= [f"feature_{l}" for l in "ABCDEFGHIK"]
test_features = [uuid.uuid4().hex for _ in range(100)]
id_col
# Random DataFrame
= pd.DataFrame(np.random.uniform(size=(100, 10)), columns=test_features)
dataf "id"] = id_col
dataf["target", "target_1", "target_2"]] = np.random.normal(size=(100, 3))
dataf[["date"] = range(100) dataf[
NumerFrame
Overview: The NumerFrame
NumerFrame
is a data structure that extends pd.DataFrame
with functionality convenient for Numerai users. The main benefits include: 1. Automatically track features, targets, prediction and other columns + easily retrieve these data slices. 2. Other library functionality automatically recognizes era column (era
, friday_date
or date
). 3. Integrations with other library components (i.e. preprocessing
, model
, modelpipeline
, postprocessing
, evaluation
and submission
) to create more solid inference pipelines and increase reliability.
Besides, all functionality of Pandas DataFrames is still available in the NumerFrame
. You therefore don’t have to create new pipelines to process your data when using NumerFrame
.
We adopt the convention: 1. All feature column names should start with 'feature'
. 2. All target column names should start with 'target'
. 3. All prediction column names should start with 'prediction'
. 4. Data should contain an 'era'
, 'friday_date'
or 'date'
column, as is almost always the case with Numerai datasets.
Every column for which these conditions do not hold will be classified as an 'aux'
column.
NumerFrame
NumerFrame (*args, **kwargs)
Data structure which extends Pandas DataFrames and allows for additional Numerai specific functionality.
create_numerframe
automatically recognizes your data file format, loads it into a NumerFrame
and allows for column selection before loading.
Support file formats are .csv
, .parquet
, .pkl
, .pickle
, .xsl
, .xslx
, .xlsm
, .xlsb
, .odf
, .ods
and .odt
. If the file format for your use case is missing, feel free to create a Github issue or submit a pull request. See README.md
for more information on contributing.
create_numerframe
create_numerframe (file_path:str, columns:list=None, *args, **kwargs)
Convenient function to initialize NumerFrame. Support most used file formats for Pandas DataFrames
(.csv, .parquet, .xls, .pkl, etc.). For more details check https://pandas.pydata.org/docs/reference/io.html
:param file_path: Relative or absolute path to data file.
:param columns: Which columns to read (All by default).
*args, **kwargs will be passed to Pandas loading function.
NumerFrame Usage
A NumerFrame
object can be initialized from memory just like you would with a Pandas DataFrame.
1. Initialize from memory
= NumerFrame(dataf)
memory_dataf assert memory_dataf.meta.era_col == "date"
2) memory_dataf.head(
feature_A | feature_B | feature_C | feature_D | feature_E | feature_F | feature_G | feature_H | feature_I | feature_K | id | target | target_1 | target_2 | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.939419 | 0.248241 | 0.131653 | 0.078532 | 0.203165 | 0.704196 | 0.152887 | 0.053677 | 0.890165 | 0.629224 | 35b61083cea044b280f3d33ebd55b420 | 1.630649 | -1.932858 | 0.542668 | 0 |
1 | 0.422693 | 0.537752 | 0.137142 | 0.686566 | 0.582026 | 0.103345 | 0.748019 | 0.186389 | 0.500554 | 0.943307 | ce5d2abc49fd40849981a3694b96334f | 0.811435 | -0.841065 | 1.128461 | 1 |
The meta
attribute will store which era column is used. This is used in NumerBlox processors to group computations by era where needed.
memory_dataf.meta
{'era_col': 'date'}
2. Initialize from file path
You can also use the convenience function create_numerframe
so NumerFrame
can be easily initialized. Think of it as a dynamic pd.read_csv
, pd.read_parquet
, etc.
= create_numerframe("test_assets/mini_numerai_version_2_data.parquet",
num_dataf
)assert num_dataf.meta.era_col == "era"
3. Example functionality
.get_feature_data
will retrieve all columns where the column name starts with feature
.
2) num_dataf.get_feature_data.head(
feature_dichasial_hammier_spawner | feature_rheumy_epistemic_prancer | feature_pert_performative_hormuz | feature_hillier_unpitied_theobromine | feature_perigean_bewitching_thruster | feature_renegade_undomestic_milord | feature_koranic_rude_corf | feature_demisable_expiring_millepede | feature_unscheduled_malignant_shingling | feature_clawed_unwept_adaptability | ... | feature_unpruned_pedagoguish_inkblot | feature_forworn_hask_haet | feature_drawable_exhortative_dispersant | feature_metabolic_minded_armorist | feature_investigatory_inerasable_circumvallation | feature_centroclinal_incentive_lancelet | feature_unemotional_quietistic_chirper | feature_behaviorist_microbiological_farina | feature_lofty_acceptable_challenge | feature_coactive_prefatorial_lucy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
n559bd06a8861222 | 0.25 | 0.75 | 0.25 | 0.75 | 0.25 | 0.50 | 1.0 | 0.25 | 0.25 | 0.75 | ... | 0.75 | 0.0 | 1.00 | 0.0 | 0.0 | 0.25 | 0.00 | 0.0 | 1.00 | 0.25 |
n9d39dea58c9e3cf | 0.75 | 0.50 | 0.75 | 1.00 | 0.50 | 0.25 | 0.5 | 0.00 | 1.00 | 0.25 | ... | 1.00 | 1.0 | 0.25 | 0.5 | 0.0 | 0.25 | 0.75 | 1.0 | 0.75 | 1.00 |
2 rows × 1050 columns
.get_target_data
retrieves all columns if the column name starts with "target"
.
2) num_dataf.get_target_data.head(
target | target_nomi_20 | target_nomi_60 | target_jerome_20 | target_jerome_60 | target_janet_20 | target_janet_60 | target_ben_20 | target_ben_60 | target_alan_20 | ... | target_paul_20 | target_paul_60 | target_george_20 | target_george_60 | target_william_20 | target_william_60 | target_arthur_20 | target_arthur_60 | target_thomas_20 | target_thomas_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
n559bd06a8861222 | 0.25 | 0.25 | 0.50 | 0.0 | 0.50 | 0.5 | 0.5 | 0.25 | 0.5 | 0.5 | ... | 0.0 | 0.50 | 0.25 | 0.5 | 0.000000 | 0.500000 | 0.166667 | 0.500000 | 0.333333 | 0.500000 |
n9d39dea58c9e3cf | 0.50 | 0.50 | 0.75 | 0.5 | 0.75 | 0.5 | 0.5 | 0.50 | 0.5 | 0.5 | ... | 0.5 | 0.75 | 0.50 | 0.5 | 0.666667 | 0.666667 | 0.500000 | 0.666667 | 0.500000 | 0.666667 |
2 rows × 21 columns
.get_single_target_data
only retrieves the column "target"
.
2) num_dataf.get_single_target_data.head(
target | |
---|---|
id | |
n559bd06a8861222 | 0.25 |
n9d39dea58c9e3cf | 0.50 |
.get_pattern_data
allows you to get columns based on a certain pattern. In this example we retrieve all 20-day targets.
"_20").head(2) num_dataf.get_pattern_data(
target_nomi_20 | target_jerome_20 | target_janet_20 | target_ben_20 | target_alan_20 | target_paul_20 | target_george_20 | target_william_20 | target_arthur_20 | target_thomas_20 | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
n559bd06a8861222 | 0.25 | 0.0 | 0.5 | 0.25 | 0.5 | 0.0 | 0.25 | 0.000000 | 0.166667 | 0.333333 |
n9d39dea58c9e3cf | 0.50 | 0.5 | 0.5 | 0.50 | 0.5 | 0.5 | 0.50 | 0.666667 | 0.500000 | 0.500000 |
num_dataf.head()
era | data_type | feature_dichasial_hammier_spawner | feature_rheumy_epistemic_prancer | feature_pert_performative_hormuz | feature_hillier_unpitied_theobromine | feature_perigean_bewitching_thruster | feature_renegade_undomestic_milord | feature_koranic_rude_corf | feature_demisable_expiring_millepede | ... | target_paul_20 | target_paul_60 | target_george_20 | target_george_60 | target_william_20 | target_william_60 | target_arthur_20 | target_arthur_60 | target_thomas_20 | target_thomas_60 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
n559bd06a8861222 | 0297 | train | 0.25 | 0.75 | 0.25 | 0.75 | 0.25 | 0.50 | 1.00 | 0.25 | ... | 0.00 | 0.50 | 0.25 | 0.50 | 0.000000 | 0.500000 | 0.166667 | 0.500000 | 0.333333 | 0.500000 |
n9d39dea58c9e3cf | 0003 | train | 0.75 | 0.50 | 0.75 | 1.00 | 0.50 | 0.25 | 0.50 | 0.00 | ... | 0.50 | 0.75 | 0.50 | 0.50 | 0.666667 | 0.666667 | 0.500000 | 0.666667 | 0.500000 | 0.666667 |
nb64f06d3a9fc9f1 | 0472 | train | 1.00 | 1.00 | 1.00 | 0.50 | 0.00 | 1.00 | 0.25 | 0.50 | ... | 0.00 | 0.25 | 0.50 | 0.50 | 0.333333 | 0.333333 | 0.333333 | 0.333333 | 0.333333 | 0.333333 |
n1927b4862500882 | 0265 | train | 0.00 | 0.00 | 0.25 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | ... | 0.75 | 0.75 | 0.50 | 0.75 | 0.833333 | 0.833333 | 0.666667 | 0.833333 | 0.666667 | 0.666667 |
nc3234b6eeacd6b7 | 0299 | train | 0.75 | 0.25 | 0.00 | 0.75 | 1.00 | 0.25 | 0.00 | 0.00 | ... | 0.25 | 0.50 | 0.50 | 0.50 | 0.166667 | 0.666667 | 0.333333 | 0.500000 | 0.500000 | 0.666667 |
5 rows × 1073 columns
.get_era_batch
will return a tf.Tensor
or np.array
with feature data and target data for one or more eras. Convenient for creating neural network DataGenerators.
= num_dataf.get_era_batch(['0297'], convert_to_tf=True)
X_era, y_era X_era
2023-03-23 18:06:22.672389: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-23 18:06:22.918932: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-23 18:06:24.031738: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/miniconda3/envs/classic_prod/lib/
2023-03-23 18:06:24.031878: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/opt/miniconda3/envs/classic_prod/lib/
2023-03-23 18:06:24.031889: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-23 18:06:25.689024: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-03-23 18:06:25.689181: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (numerai-training): /proc/driver/nvidia/version does not exist
2023-03-23 18:06:25.697668: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<tf.Tensor: shape=(1, 1050), dtype=float32, numpy=array([[0.25, 0.75, 0.25, ..., 0. , 1. , 0.25]], dtype=float32)>
For people training autoencoders + MLP you can get a target that contains 3 elements: features, targets and targets. Just define aemlp_batch=True
. More info on this setup: AutoEncoder and multitask MLP on new dataset forum post.
= num_dataf.get_era_batch(['0297'], convert_to_tf=True, aemlp_batch=True)
_, y_era_aemlp y_era_aemlp
[<tf.Tensor: shape=(1, 1050), dtype=float32, numpy=array([[0.25, 0.75, 0.25, ..., 0. , 1. , 0.25]], dtype=float32)>,
<tf.Tensor: shape=(1, 21), dtype=float32, numpy=
array([[0.25 , 0.25 , 0.5 , 0. , 0.5 ,
0.5 , 0.5 , 0.25 , 0.5 , 0.5 ,
0.5 , 0. , 0.5 , 0.25 , 0.5 ,
0. , 0.5 , 0.16666667, 0.5 , 0.33333334,
0.5 ]], dtype=float32)>,
<tf.Tensor: shape=(1, 21), dtype=float32, numpy=
array([[0.25 , 0.25 , 0.5 , 0. , 0.5 ,
0.5 , 0.5 , 0.25 , 0.5 , 0.5 ,
0.5 , 0. , 0.5 , 0.25 , 0.5 ,
0. , 0.5 , 0.16666667, 0.5 , 0.33333334,
0.5 ]], dtype=float32)>]
.aux_cols
denotes all columns that are not features, targets or prediction columns.
num_dataf.aux_cols
['era', 'data_type']
2) num_dataf.get_aux_data.head(
era | data_type | |
---|---|---|
id | ||
n559bd06a8861222 | 0297 | train |
n9d39dea58c9e3cf | 0003 | train |
'prediction_1'] = np.random.uniform(size=len(num_dataf))
num_dataf['prediction_2'] = np.random.uniform(size=len(num_dataf)) num_dataf[
To track new columns like prediction columns, make sure to initialize a new NumerFrame
. Prediction columns can easily be retrieved with .get_prediction_data
and get_prediction_aux_data
if you want to also get columns like era
and data_type
. This can be handy for ensembling and submission use cases.
= NumerFrame(num_dataf) num_dataf
2) num_dataf.get_prediction_data.head(
prediction_1 | prediction_2 | |
---|---|---|
id | ||
n559bd06a8861222 | 0.087993 | 0.729183 |
n9d39dea58c9e3cf | 0.238604 | 0.384513 |
2) num_dataf.get_prediction_aux_data.head(
prediction_1 | prediction_2 | era | data_type | |
---|---|---|---|---|
id | ||||
n559bd06a8861222 | 0.087993 | 0.729183 | 0297 | train |
n9d39dea58c9e3cf | 0.238604 | 0.384513 | 0003 | train |
num_dataf.meta
{'era_col': 'era'}
Because NumerFrame
inherits from pd.DataFrame
you still have all functionality of a normal DataFrame at your disposal, like copying.
= num_dataf.copy()
dataf2 assert dataf2.equals(num_dataf)
NumerFrame
dynamically tracks which feature, target, aux and prediction columns there are when initialized. For example, here we add a new prediction column. Upon initialization the column will be contained in prediction_cols
. Prediction columns are all column names that start with prediction
.
"prediction_test_1"] = np.random.uniform(size=len(num_dataf))
num_dataf.loc[:, = NumerFrame(num_dataf)
new_dataset assert "prediction_test_1" in new_dataset.prediction_cols
Arbitrary columns van be retrieved with .get_column_selection
. The input argument can be either a string or a list with column names.
= num_dataf.get_column_selection("era")
selection1 2) selection1.head(
era | |
---|---|
id | |
n559bd06a8861222 | 0297 |
n9d39dea58c9e3cf | 0003 |
= num_dataf.get_column_selection(["era", "prediction_test_1"])
selection2 2) selection2.head(
era | prediction_test_1 | |
---|---|---|
id | ||
n559bd06a8861222 | 0297 | 0.952676 |
n9d39dea58c9e3cf | 0003 | 0.616081 |
For convenience we can get a feature, target pair with one method. If multi_target=True
all columns where the column name starts with target
will be retrieved.
= num_dataf.get_feature_target_pair(multi_target=False)
features, single_target 2) features.head(
feature_dichasial_hammier_spawner | feature_rheumy_epistemic_prancer | feature_pert_performative_hormuz | feature_hillier_unpitied_theobromine | feature_perigean_bewitching_thruster | feature_renegade_undomestic_milord | feature_koranic_rude_corf | feature_demisable_expiring_millepede | feature_unscheduled_malignant_shingling | feature_clawed_unwept_adaptability | ... | feature_unpruned_pedagoguish_inkblot | feature_forworn_hask_haet | feature_drawable_exhortative_dispersant | feature_metabolic_minded_armorist | feature_investigatory_inerasable_circumvallation | feature_centroclinal_incentive_lancelet | feature_unemotional_quietistic_chirper | feature_behaviorist_microbiological_farina | feature_lofty_acceptable_challenge | feature_coactive_prefatorial_lucy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
n559bd06a8861222 | 0.25 | 0.75 | 0.25 | 0.75 | 0.25 | 0.50 | 1.0 | 0.25 | 0.25 | 0.75 | ... | 0.75 | 0.0 | 1.00 | 0.0 | 0.0 | 0.25 | 0.00 | 0.0 | 1.00 | 0.25 |
n9d39dea58c9e3cf | 0.75 | 0.50 | 0.75 | 1.00 | 0.50 | 0.25 | 0.5 | 0.00 | 1.00 | 0.25 | ... | 1.00 | 1.0 | 0.25 | 0.5 | 0.0 | 0.25 | 0.75 | 1.0 | 0.75 | 1.00 |
2 rows × 1050 columns
2) single_target.head(
target | |
---|---|
id | |
n559bd06a8861222 | 0.25 |
n9d39dea58c9e3cf | 0.50 |