= create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = ["test_assets/joblib_v2_example_model.joblib"]
test_paths for path in test_paths:
= SingleModel(path, model_name="test")
model print(model.predict(dataf).get_prediction_data.head(2))
Model
0. Base
0.1. BaseModel
The BaseModel
is an abstract base class that handles directory logic and naming conventions. All models should inherit from BaseModel
and be sure to implement the .predict
method.
In general, models are loaded in from disk. However, if no model files are involved in your model you should pass an empty string (""
) as the model_directory
argument.
Note that a new prediction column will have the column name prediction_{MODEL_NAME}
.
BaseModel
BaseModel (model_directory:str, model_name:str=None)
Setup for model prediction on a Dataset.
:param model_directory: Main directory from which to read in models.
:param model_name: Name that will be used to create column names and for display purposes.
0.2. DirectoryModel
A DirectoryModel
assumes that you have a directory of models and you want to load + predict for all models with a certain file_suffix
(for example, .joblib
, .cbm
or .lgb
). This base class handles prediction logic for this situation.
If you are thinking of implementing your own model and your use case involves reading multiple models from a directory, then you should inherit from DirectoryModel
and be sure to implement .load_models
. You then don’t have to implement any prediction logic in the .predict
method.
When inheriting from DirectoryModel
the only mandatory method implementation is for .load_models
. It should instantiate all models and return them as a list
.
DirectoryModel
DirectoryModel (model_directory:str, file_suffix:str, model_name:str=None, feature_cols:list=None, combine_preds=True)
Base class implementation where predictions are averaged out from a directory of models. Walks through every file with given file_suffix in a directory.
:param model_directory: Main directory from which to read in models.
:param file_suffix: File format to load (For example, .joblib, .pkl, .cbm or .lgb)
:param model_name: Name that will be used to create column names and for display purposes.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
:param combine_preds: Whether to average predictions along column axis. Only relevant for multi target models.
Convenient when you want to predict the main target by averaging a multi-target model.
1. Single model formats
Implementations for common Numerai model prediction situations.
1.1. SingleModel
In many cases you just want to load a single model file and create predictions for that model. SingleModel
supports this.
This class supports multiple model formats for easy use. All models should have a .predict
method. Currently, .joblib
, .cbm
, .pkl
, .pickle
and .h5
(keras) format are supported.
Things to keep in mind - This model will use all available features in the NumerFrame
and use them for prediction by default. Define feature_cols
in SingleModel
or implement a FeatureSelectionPreProcessor
as part of your ModelPipeline
if you are using a subset of features. - If you have XGBoost models we recommend saving them as .joblib
. - The added prediction column will have the column name prediction_{MODEL_NAME}
if 1 target is predicted. For multiple targets the new column names will be prediction_{MODEL_NAME}_{i}
for each target number i (starting with 0). - We welcome the Numerai community to extend SingleModel
for more file formats. See the Contributing section in README.md
for more information on contributing.
SingleModel
SingleModel (model_file_path:str, model_name:str=None, combine_preds=False, autoencoder_mlp=False, feature_cols:list=None)
Load single model from file and perform prediction logic.
:param model_file_path: Full path to model file.
:param model_name: Name that will be used to create column names and for display purposes.
:param combine_preds: Whether to average predictions along column axis. Only relevant for multi target models. Convenient when you want to predict the main target by averaging a multi-target model.
:param autoencoder_mlp: Whether your model is an autoencoder + MLP model. Will take the 3rd of tuple output in this case. Only relevant for NN models. More info on autoencoders: https://forum.numer.ai/t/autoencoder-and-multitask-mlp-on-new-dataset-from-kaggle-jane-street/4338
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
= SingleModel(test_paths[0], model_name="test")
model model.suffix_to_model_mapping
1.2. WandbKerasModel
This model is for a specific case. Namely, if you are logging Keras model using Weights & Biases and want to download the best model for a specific run. WandbKerasModel
wraps SingleModel
and only adds additional logic for downloading models from Weights & Biases.
To authenticate your W&B account you are given several options: 1. Run wandb login
in terminal and follow instructions (docs). 2. Configure global environment variable "WANDB_API_KEY"
. 3. Run wandb.init(project=PROJECT_NAME, entity=ENTITY_NAME)
and pass API key from https://wandb.ai/authorize.
WandbKerasModel
WandbKerasModel (run_path:str, file_name:str='model-best.h5', combine_preds=False, autoencoder_mlp=False, replace=False, feature_cols:list=None)
Download best .h5 model from Weights & Biases (W&B) run in local directory and make predictions. More info on W&B: https://wandb.ai/site
:param run_path: W&B path structured as entity/project/run_id. Can be copied from the Overview tab of a W&B run. For more info: https://docs.wandb.ai/ref/app/pages/run-page#overview-tab
:param file_name: Name of .h5 file as saved in W&B run. ‘model-best.h5’ by default. File name can be found under files tab of W&B run.
:param combine_preds: Whether to average predictions along column axis. Convenient when you want to predict the main target by averaging a multi-target model.
:param autoencoder_mlp: Whether your model is an autoencoder + MLP model. Will take the 3rd of tuple output in this case. Only relevant for NN models.
More info on autoencoders: https://forum.numer.ai/t/autoencoder-and-multitask-mlp-on-new-dataset-from-kaggle-jane-street/4338
:param replace: Replace any model files saved under the same file name with downloaded W&B run model. WARNING: Setting to True may overwrite models in your local environment.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
1.3. CSVSub
This model is a wrapper for if you want to add predictions from external CSVs in a directory.
ExternalCSVs
ExternalCSVs (data_directory:str='external_submissions')
Load external submissions and add to NumerFrame.
All csv files in this directory will be added to NumerFrame. Make sure all external predictions are prepared and ready for submission. i.e. IDs lining up and one column named ‘prediction’.
:param data_directory: Directory path for retrieving external submission.
For testing, the external_submissions
directory contains test_predictions.csv
with values from feature target_thomas_20
.
= create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = ExternalCSVs(data_directory="test_assets/external_submissions")
external = external.predict(dataf)
new_dataf 2) new_dataf.get_prediction_data.head(
If no submissions are found in the given data_directory
you should recieve a warning.
="Some_nonexisting_directory_12354321"); ExternalCSVs(data_directory
1.4. NumerBay
This model is a wrapper for if you want to add predictions from NumerBay purchases.
Currently only Numerai Classic submissions are supported. Numerai Signals will be supported in a future version.
NumerBayCSVs
NumerBayCSVs (data_directory:str='numerbay_submissions', numerbay_product_full_names:list=None, numerbay_username:str=None, numerbay_password:str=None, numerbay_key_path:str=None, ticker_col:str='bloomberg_ticker')
Load NumerBay submissions and add to NumerFrame.
Make sure to provide correct NumerBay credentials and that your purchases have been confirmed and artifacts are available for download.
:param data_directory: Directory path for caching submission. Files not already present in the directory will be downloaded from NumerBay. :param numerbay_product_full_names: List of product full names (in the format of [category]-[product name]) to download from NumerBay. E.g. [‘numerai-predictions-numerbay’] :param numerbay_username: NumerBay username :param numerbay_password: NumerBay password :param numerbay_key_path: NumerBay encryption key json file path (exported from the profile page)
Example usage
# nb_model = NumerBayCSVs(data_directory='/app/notebooks/tmp',
# numerbay_product_full_names=['numerai-predictions-someproduct'],
# numerbay_username="myusername",
# numerbay_password="mypassword",
# numerbay_key_path="/app/notebooks/tmp/numerbay.json")
# preds = nb_model.predict(dataf)
2. Loading all models in directory
2.1. Joblib directory
Many models, like scikit-learn
, can conveniently be saved as .joblib
files. This class automatically loads all .joblib
files in a given folder and generates (averaged out) predictions.
JoblibModel
JoblibModel (model_directory:str, model_name:str=None, feature_cols:list=None)
Load and predict for arbitrary models in directory saved as .joblib.
All loaded models should have a .predict method and accept the features present in the data.
:param model_directory: Main directory from which to read in models.
:param model_name: Name that will be used to create column names and for display purposes.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
= create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = JoblibModel("test_assets", model_name="Joblib_LGB")
model = model.predict(dataf).get_prediction_data
predictions assert predictions['prediction_Joblib_LGB'].between(0, 1).all()
2) predictions.head(
2.2. Catboost directory (.cbm)
This model setup loads all CatBoost
(.cbm
) models present in a given directory and makes (averaged out) predictions.
CatBoostModel
CatBoostModel (model_directory:str, model_name:str=None, feature_cols:list=None)
Load and predict with all .cbm models (CatBoostRegressor) in directory.
:param model_directory: Main directory from which to read in models.
:param model_name: Name that will be used to define column names and for display purposes.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
# Example on NumerFrame
# dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
# model = CatBoostModel("test_assets", model_name="CB")
# predictions = model.predict(dataf).get_prediction_data
# assert predictions['prediction_CB'].between(0, 1).all()
# predictions.head(2)
2.3. LightGBM directory (.lgb)
This model setup loads all LightGBM
(.lgb
) models present in a given directory and makes (averaged out) predictions.
LGBMModel
LGBMModel (model_directory:str, model_name:str=None, feature_cols:list=None)
Load and predict with all .lgb models (LightGBM) in directory.
:param model_directory: Main directory from which to read in models.
:param model_name: Name that will be used to define column names and for display purposes.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
= create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = LGBMModel("test_assets", model_name="LGB")
model = model.predict(dataf).get_prediction_data
predictions assert predictions['prediction_LGB'].between(0, 1).all()
2) predictions.head(
3. Baseline models
Setting a baseline is always an important step for data science problems. This section introduces models that should only be used a baselines.
3.1. ConstantModel
This model simply outputs a constant of your choice. Convenient for setting classification baselines.
ConstantModel
ConstantModel (constant:float=0.5, model_name:str=None)
WARNING: Only use this Model for testing purposes.
Create constant prediction.
:param constant: Value for constant prediction.
:param model_name: Name that will be used to create column names and for display purposes.
= 0.85
constant = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = ConstantModel(constant=constant)
constant_model = constant_model.predict(dataf).get_prediction_data
predictions assert (predictions.to_numpy() == constant).all()
2) predictions.head(
3.2. RandomModel
This model returns uniformly distributed predictions in range \([0...1)\). Solid naive baseline for regression models.
RandomModel
RandomModel (model_name:str=None)
WARNING: Only use this Model for testing purposes.
Create uniformly distributed predictions.
:param model_name: Name that will be used to create column names and for display purposes.
= create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = RandomModel()
random_model = random_model.predict(dataf).get_prediction_data
predictions assert predictions['prediction_random'].between(0, 1).all()
2) predictions.head(
3.3. Example (validation) predictions
This Model performs downloading and adding of example predictions for Numerai Classic. Convenient when you are constructing a ModelPipeline
and want to include example predictions.
ExamplePredictionsModel
ExamplePredictionsModel (file_name:str='example_validation_predictions.p arquet', data_directory:str='example_predictions_model', round_num:int=None)
Load example predictions and add to NumerFrame.
:param file_name: File to download from NumerAPI. ‘example_validation_predictions.parquet’ by default.
:param data_directory: Directory path to download example predictions to or directory where example data already exists.
:param round_num: Optional round number. Downloads most recent round by default.
# Download validation data
= NumeraiClassicDownloader("example_predictions_model")
downloader = "numerai_validation_data.parquet"
val_file = f"{str(downloader.dir)}/{val_file}"
val_save_path =val_file,
downloader.download_single_dataset(filename=val_save_path)
dest_path
# Load validation data and add example predictions
= create_numerframe(val_save_path)
dataf = ExamplePredictionsModel()
example_model = example_model.predict(dataf).get_prediction_data
predictions assert predictions['prediction_example'].between(0, 1).all()
2) predictions.head(
4. Custom Model
There are two different ways to implement new models. Both have their own conveniences and use cases.
4.1. Inherit from BaseModel
(custom prediction logic).
4.2. Inherit from DirectoryModel
(make predictions for all models in directory with given file suffix. Prediction logic will already be implemented. Only implement model loading logic).
4.1. (From BaseModel) works well when you have no or only a single file that you use for generating predictions.
Examples: 1. Loading a model is not relevant or your model is already loaded in memory. 2. You would like predictions for one model loaded from disk. 3. The object you are loading already aggregates multiple models and transformation steps (such as scikit-learn FeatureUnion).
4.2. (From DirectoryModel) is convenient when you have a lot of similar models in a directory and want to generate predictions for all of them.
Examples: 1. You have multiple similar models saved through a cross validation process. 2. You have a bagging strategy where a lot of models trained on slightly different data or with different initializations should are averaged.
4.1. From BaseModel
Arbitrary models can be instantiated and used for prediction generation by inheriting from BaseModel
. Arbitrary logic (model loading, prediction, etc.) can be defined in .predict
as long as the method takes a NumerFrame
as input and outputs a NumerFrame
.
For clear console output we recommend adding the @display_processor_info
decorator to the .predict
method.
If your model does not involve reading files from disk specify model_directory=""
.
AwesomeModel
AwesomeModel (model_directory:str, model_name:str=None, feature_cols:list=None)
TEMPLATE - Predict with arbitrary prediction logic and model formats.
:param model_directory: Main directory from which to read in models.
:param model_name: Name that will be used to define column names and for display purposes.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.
4.2. From DirectoryModel
You may want to implement a setup similar to JoblibModel
and CatBoostModel
. Namely, load in all models of a certain type from a directory, predict for all and take the average. If this is your use case, inherit from DirectoryModel
and be sure to implement the .load_models
method.
For a DirectoryModel
you should specify a file_suffix
(like .joblib
or .cbm
) which will be used to store all available models in self.model_paths
.
The .predict
method will in this case already be implemented, but can be overridden if the prediction logic is more complex. For example, if you want to apply weighted averaging or a geometric mean for models within a given directory.
AwesomeDirectoryModel
AwesomeDirectoryModel (model_directory:str, model_name:str=None, feature_cols:list=None)
TEMPLATE - Load in all models of arbitrary file format and predict for all.
:param model_directory: Main directory from which to read in models.
:param model_name: Name that will be used to define column names and for display purposes.
:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.