NumerFrame

Custom data structure for Numerai data. Extends Pandas DataFrame.

Overview: The NumerFrame

NumerFrame is a data structure that extends pd.DataFrame with functionality convenient for Numerai users. The main benefits include: 1. Automatically track features, targets, prediction and other columns + easily retrieve these data slices. 2. Other library functionality automatically recognizes era column (era, friday_date or date). 3. Integrations with other library components (i.e. preprocessing, model, modelpipeline, postprocessing, evaluation and submission) to create more solid inference pipelines and increase reliability.

Besides, all functionality of Pandas DataFrames is still available in the NumerFrame. You therefore don’t have to create new pipelines to process your data when using NumerFrame.

We adopt the convention: 1. All feature column names should start with 'feature'. 2. All target column names should start with 'target'. 3. All prediction column names should start with 'prediction'. 4. Data should contain an 'era', 'friday_date' or 'date' column, as is almost always the case with Numerai datasets.

Every column for which these conditions do not hold will be classified as an 'aux' column.


source

NumerFrame

 NumerFrame (*args, **kwargs)

Data structure which extends Pandas DataFrames and allows for additional Numerai specific functionality.

create_numerframe automatically recognizes your data file format, loads it into a NumerFrame and allows for column selection before loading.

Support file formats are .csv, .parquet, .pkl, .pickle, .xsl, .xslx, .xlsm, .xlsb, .odf, .ods and .odt. If the file format for your use case is missing, feel free to create a Github issue or submit a pull request. See README.md for more information on contributing.


source

create_numerframe

 create_numerframe (file_path:str, columns:list=None, *args, **kwargs)

Convenient function to initialize NumerFrame. Support most used file formats for Pandas DataFrames

(.csv, .parquet, .xls, .pkl, etc.). For more details check https://pandas.pydata.org/docs/reference/io.html

:param file_path: Relative or absolute path to data file.

:param columns: Which columns to read (All by default).

*args, **kwargs will be passed to Pandas loading function.

NumerFrame Usage

A NumerFrame object can be initialized from memory just like you would with a Pandas DataFrame.

1. Initialize from memory

test_features = [f"feature_{l}" for l in "ABCDEFGHIK"]
id_col = [uuid.uuid4().hex for _ in range(100)]

# Random DataFrame
dataf = pd.DataFrame(np.random.uniform(size=(100, 10)), columns=test_features)
dataf["id"] = id_col
dataf[["target", "target_1", "target_2"]] = np.random.normal(size=(100, 3))
dataf["date"] = range(100)
memory_dataf = NumerFrame(dataf)
assert memory_dataf.meta.era_col == "date"
memory_dataf.head(2)
feature_A feature_B feature_C feature_D feature_E feature_F feature_G feature_H feature_I feature_K id target target_1 target_2 date
0 0.939419 0.248241 0.131653 0.078532 0.203165 0.704196 0.152887 0.053677 0.890165 0.629224 35b61083cea044b280f3d33ebd55b420 1.630649 -1.932858 0.542668 0
1 0.422693 0.537752 0.137142 0.686566 0.582026 0.103345 0.748019 0.186389 0.500554 0.943307 ce5d2abc49fd40849981a3694b96334f 0.811435 -0.841065 1.128461 1

The meta attribute will store which era column is used. This is used in NumerBlox processors to group computations by era where needed.

memory_dataf.meta
{'era_col': 'date'}

2. Initialize from file path

You can also use the convenience function create_numerframe so NumerFrame can be easily initialized. Think of it as a dynamic pd.read_csv, pd.read_parquet, etc.

num_dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet",
                          )
assert num_dataf.meta.era_col == "era"

3. Example functionality

.get_feature_data will retrieve all columns where the column name starts with feature.

num_dataf.get_feature_data.head(2)
feature_dichasial_hammier_spawner feature_rheumy_epistemic_prancer feature_pert_performative_hormuz feature_hillier_unpitied_theobromine feature_perigean_bewitching_thruster feature_renegade_undomestic_milord feature_koranic_rude_corf feature_demisable_expiring_millepede feature_unscheduled_malignant_shingling feature_clawed_unwept_adaptability ... feature_unpruned_pedagoguish_inkblot feature_forworn_hask_haet feature_drawable_exhortative_dispersant feature_metabolic_minded_armorist feature_investigatory_inerasable_circumvallation feature_centroclinal_incentive_lancelet feature_unemotional_quietistic_chirper feature_behaviorist_microbiological_farina feature_lofty_acceptable_challenge feature_coactive_prefatorial_lucy
id
n559bd06a8861222 0.25 0.75 0.25 0.75 0.25 0.50 1.0 0.25 0.25 0.75 ... 0.75 0.0 1.00 0.0 0.0 0.25 0.00 0.0 1.00 0.25
n9d39dea58c9e3cf 0.75 0.50 0.75 1.00 0.50 0.25 0.5 0.00 1.00 0.25 ... 1.00 1.0 0.25 0.5 0.0 0.25 0.75 1.0 0.75 1.00

2 rows × 1050 columns

.get_target_data retrieves all columns if the column name starts with "target".

num_dataf.get_target_data.head(2)
target target_nomi_20 target_nomi_60 target_jerome_20 target_jerome_60 target_janet_20 target_janet_60 target_ben_20 target_ben_60 target_alan_20 ... target_paul_20 target_paul_60 target_george_20 target_george_60 target_william_20 target_william_60 target_arthur_20 target_arthur_60 target_thomas_20 target_thomas_60
id
n559bd06a8861222 0.25 0.25 0.50 0.0 0.50 0.5 0.5 0.25 0.5 0.5 ... 0.0 0.50 0.25 0.5 0.000000 0.500000 0.166667 0.500000 0.333333 0.500000
n9d39dea58c9e3cf 0.50 0.50 0.75 0.5 0.75 0.5 0.5 0.50 0.5 0.5 ... 0.5 0.75 0.50 0.5 0.666667 0.666667 0.500000 0.666667 0.500000 0.666667

2 rows × 21 columns

.get_single_target_data only retrieves the column "target".

num_dataf.get_single_target_data.head(2)
target
id
n559bd06a8861222 0.25
n9d39dea58c9e3cf 0.50

.get_pattern_data allows you to get columns based on a certain pattern. In this example we retrieve all 20-day targets.

num_dataf.get_pattern_data("_20").head(2)
target_nomi_20 target_jerome_20 target_janet_20 target_ben_20 target_alan_20 target_paul_20 target_george_20 target_william_20 target_arthur_20 target_thomas_20
id
n559bd06a8861222 0.25 0.0 0.5 0.25 0.5 0.0 0.25 0.000000 0.166667 0.333333
n9d39dea58c9e3cf 0.50 0.5 0.5 0.50 0.5 0.5 0.50 0.666667 0.500000 0.500000
num_dataf.head()
era data_type feature_dichasial_hammier_spawner feature_rheumy_epistemic_prancer feature_pert_performative_hormuz feature_hillier_unpitied_theobromine feature_perigean_bewitching_thruster feature_renegade_undomestic_milord feature_koranic_rude_corf feature_demisable_expiring_millepede ... target_paul_20 target_paul_60 target_george_20 target_george_60 target_william_20 target_william_60 target_arthur_20 target_arthur_60 target_thomas_20 target_thomas_60
id
n559bd06a8861222 0297 train 0.25 0.75 0.25 0.75 0.25 0.50 1.00 0.25 ... 0.00 0.50 0.25 0.50 0.000000 0.500000 0.166667 0.500000 0.333333 0.500000
n9d39dea58c9e3cf 0003 train 0.75 0.50 0.75 1.00 0.50 0.25 0.50 0.00 ... 0.50 0.75 0.50 0.50 0.666667 0.666667 0.500000 0.666667 0.500000 0.666667
nb64f06d3a9fc9f1 0472 train 1.00 1.00 1.00 0.50 0.00 1.00 0.25 0.50 ... 0.00 0.25 0.50 0.50 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333
n1927b4862500882 0265 train 0.00 0.00 0.25 0.00 1.00 0.00 0.00 0.00 ... 0.75 0.75 0.50 0.75 0.833333 0.833333 0.666667 0.833333 0.666667 0.666667
nc3234b6eeacd6b7 0299 train 0.75 0.25 0.00 0.75 1.00 0.25 0.00 0.00 ... 0.25 0.50 0.50 0.50 0.166667 0.666667 0.333333 0.500000 0.500000 0.666667

5 rows × 1073 columns

.get_era_batch will return a tf.Tensor or np.array with feature data and target data for one or more eras. Convenient for creating neural network DataGenerators.

X_era, y_era = num_dataf.get_era_batch(['0297'], convert_to_tf=True)
X_era
2023-03-23 18:06:22.672389: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-23 18:06:22.918932: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-23 18:06:24.031738: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/miniconda3/envs/classic_prod/lib/
2023-03-23 18:06:24.031878: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/miniconda3/envs/classic_prod/lib/
2023-03-23 18:06:24.031889: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-23 18:06:25.689024: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-03-23 18:06:25.689181: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (numerai-training): /proc/driver/nvidia/version does not exist
2023-03-23 18:06:25.697668: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<tf.Tensor: shape=(1, 1050), dtype=float32, numpy=array([[0.25, 0.75, 0.25, ..., 0.  , 1.  , 0.25]], dtype=float32)>

For people training autoencoders + MLP you can get a target that contains 3 elements: features, targets and targets. Just define aemlp_batch=True. More info on this setup: AutoEncoder and multitask MLP on new dataset forum post.

_, y_era_aemlp = num_dataf.get_era_batch(['0297'], convert_to_tf=True, aemlp_batch=True)
y_era_aemlp
[<tf.Tensor: shape=(1, 1050), dtype=float32, numpy=array([[0.25, 0.75, 0.25, ..., 0.  , 1.  , 0.25]], dtype=float32)>,
 <tf.Tensor: shape=(1, 21), dtype=float32, numpy=
 array([[0.25      , 0.25      , 0.5       , 0.        , 0.5       ,
         0.5       , 0.5       , 0.25      , 0.5       , 0.5       ,
         0.5       , 0.        , 0.5       , 0.25      , 0.5       ,
         0.        , 0.5       , 0.16666667, 0.5       , 0.33333334,
         0.5       ]], dtype=float32)>,
 <tf.Tensor: shape=(1, 21), dtype=float32, numpy=
 array([[0.25      , 0.25      , 0.5       , 0.        , 0.5       ,
         0.5       , 0.5       , 0.25      , 0.5       , 0.5       ,
         0.5       , 0.        , 0.5       , 0.25      , 0.5       ,
         0.        , 0.5       , 0.16666667, 0.5       , 0.33333334,
         0.5       ]], dtype=float32)>]

.aux_cols denotes all columns that are not features, targets or prediction columns.

num_dataf.aux_cols
['era', 'data_type']
num_dataf.get_aux_data.head(2)
era data_type
id
n559bd06a8861222 0297 train
n9d39dea58c9e3cf 0003 train
num_dataf['prediction_1'] = np.random.uniform(size=len(num_dataf))
num_dataf['prediction_2'] = np.random.uniform(size=len(num_dataf))

To track new columns like prediction columns, make sure to initialize a new NumerFrame. Prediction columns can easily be retrieved with .get_prediction_data and get_prediction_aux_data if you want to also get columns like era and data_type. This can be handy for ensembling and submission use cases.

num_dataf = NumerFrame(num_dataf)
num_dataf.get_prediction_data.head(2)
prediction_1 prediction_2
id
n559bd06a8861222 0.087993 0.729183
n9d39dea58c9e3cf 0.238604 0.384513
num_dataf.get_prediction_aux_data.head(2)
prediction_1 prediction_2 era data_type
id
n559bd06a8861222 0.087993 0.729183 0297 train
n9d39dea58c9e3cf 0.238604 0.384513 0003 train
num_dataf.meta
{'era_col': 'era'}

Because NumerFrame inherits from pd.DataFrame you still have all functionality of a normal DataFrame at your disposal, like copying.

dataf2 = num_dataf.copy()
assert dataf2.equals(num_dataf)

NumerFrame dynamically tracks which feature, target, aux and prediction columns there are when initialized. For example, here we add a new prediction column. Upon initialization the column will be contained in prediction_cols. Prediction columns are all column names that start with prediction.

num_dataf.loc[:, "prediction_test_1"] = np.random.uniform(size=len(num_dataf))
new_dataset = NumerFrame(num_dataf)
assert "prediction_test_1" in new_dataset.prediction_cols

Arbitrary columns van be retrieved with .get_column_selection. The input argument can be either a string or a list with column names.

selection1 = num_dataf.get_column_selection("era")
selection1.head(2)
era
id
n559bd06a8861222 0297
n9d39dea58c9e3cf 0003
selection2 = num_dataf.get_column_selection(["era", "prediction_test_1"])
selection2.head(2)
era prediction_test_1
id
n559bd06a8861222 0297 0.952676
n9d39dea58c9e3cf 0003 0.616081

For convenience we can get a feature, target pair with one method. If multi_target=True all columns where the column name starts with target will be retrieved.

features, single_target = num_dataf.get_feature_target_pair(multi_target=False)
features.head(2)
feature_dichasial_hammier_spawner feature_rheumy_epistemic_prancer feature_pert_performative_hormuz feature_hillier_unpitied_theobromine feature_perigean_bewitching_thruster feature_renegade_undomestic_milord feature_koranic_rude_corf feature_demisable_expiring_millepede feature_unscheduled_malignant_shingling feature_clawed_unwept_adaptability ... feature_unpruned_pedagoguish_inkblot feature_forworn_hask_haet feature_drawable_exhortative_dispersant feature_metabolic_minded_armorist feature_investigatory_inerasable_circumvallation feature_centroclinal_incentive_lancelet feature_unemotional_quietistic_chirper feature_behaviorist_microbiological_farina feature_lofty_acceptable_challenge feature_coactive_prefatorial_lucy
id
n559bd06a8861222 0.25 0.75 0.25 0.75 0.25 0.50 1.0 0.25 0.25 0.75 ... 0.75 0.0 1.00 0.0 0.0 0.25 0.00 0.0 1.00 0.25
n9d39dea58c9e3cf 0.75 0.50 0.75 1.00 0.50 0.25 0.5 0.00 1.00 0.25 ... 1.00 1.0 0.25 0.5 0.0 0.25 0.75 1.0 0.75 1.00

2 rows × 1050 columns

single_target.head(2)
target
id
n559bd06a8861222 0.25
n9d39dea58c9e3cf 0.50