Model

Generating predictions for Numerai on preprocessed data.

0. Base

0.1. BaseModel

The BaseModel is an abstract base class that handles directory logic and naming conventions. All models should inherit from BaseModel and be sure to implement the .predict method.

In general, models are loaded in from disk. However, if no model files are involved in your model you should pass an empty string ("") as the model_directory argument.

Note that a new prediction column will have the column name prediction_{MODEL_NAME}.


source

BaseModel

 BaseModel (model_directory:str, model_name:str=None)

Setup for model prediction on a Dataset.

:param model_directory: Main directory from which to read in models.

:param model_name: Name that will be used to create column names and for display purposes.

0.2. DirectoryModel

A DirectoryModel assumes that you have a directory of models and you want to load + predict for all models with a certain file_suffix (for example, .joblib, .cbm or .lgb). This base class handles prediction logic for this situation.

If you are thinking of implementing your own model and your use case involves reading multiple models from a directory, then you should inherit from DirectoryModel and be sure to implement .load_models. You then don’t have to implement any prediction logic in the .predict method.

When inheriting from DirectoryModel the only mandatory method implementation is for .load_models. It should instantiate all models and return them as a list.


source

DirectoryModel

 DirectoryModel (model_directory:str, file_suffix:str,
                 model_name:str=None, feature_cols:list=None,
                 combine_preds=True)

Base class implementation where predictions are averaged out from a directory of models. Walks through every file with given file_suffix in a directory.

:param model_directory: Main directory from which to read in models.

:param file_suffix: File format to load (For example, .joblib, .pkl, .cbm or .lgb)

:param model_name: Name that will be used to create column names and for display purposes.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

:param combine_preds: Whether to average predictions along column axis. Only relevant for multi target models.

Convenient when you want to predict the main target by averaging a multi-target model.

1. Single model formats

Implementations for common Numerai model prediction situations.

1.1. SingleModel

In many cases you just want to load a single model file and create predictions for that model. SingleModel supports this.

This class supports multiple model formats for easy use. All models should have a .predict method. Currently, .joblib, .cbm, .pkl, .pickle and .h5 (keras) format are supported.

Things to keep in mind - This model will use all available features in the NumerFrame and use them for prediction by default. Define feature_cols in SingleModel or implement a FeatureSelectionPreProcessor as part of your ModelPipeline if you are using a subset of features. - If you have XGBoost models we recommend saving them as .joblib. - The added prediction column will have the column name prediction_{MODEL_NAME} if 1 target is predicted. For multiple targets the new column names will be prediction_{MODEL_NAME}_{i} for each target number i (starting with 0). - We welcome the Numerai community to extend SingleModel for more file formats. See the Contributing section in README.md for more information on contributing.


source

SingleModel

 SingleModel (model_file_path:str, model_name:str=None,
              combine_preds=False, autoencoder_mlp=False,
              feature_cols:list=None)

Load single model from file and perform prediction logic.

:param model_file_path: Full path to model file.

:param model_name: Name that will be used to create column names and for display purposes.

:param combine_preds: Whether to average predictions along column axis. Only relevant for multi target models. Convenient when you want to predict the main target by averaging a multi-target model.

:param autoencoder_mlp: Whether your model is an autoencoder + MLP model. Will take the 3rd of tuple output in this case. Only relevant for NN models. More info on autoencoders: https://forum.numer.ai/t/autoencoder-and-multitask-mlp-on-new-dataset-from-kaggle-jane-street/4338

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
test_paths = ["test_assets/joblib_v2_example_model.joblib"]
for path in test_paths:
    model = SingleModel(path, model_name="test")
    print(model.predict(dataf).get_prediction_data.head(2))
model = SingleModel(test_paths[0], model_name="test")
model.suffix_to_model_mapping

1.2. WandbKerasModel

This model is for a specific case. Namely, if you are logging Keras model using Weights & Biases and want to download the best model for a specific run. WandbKerasModel wraps SingleModel and only adds additional logic for downloading models from Weights & Biases.

To authenticate your W&B account you are given several options: 1. Run wandb login in terminal and follow instructions (docs). 2. Configure global environment variable "WANDB_API_KEY". 3. Run wandb.init(project=PROJECT_NAME, entity=ENTITY_NAME) and pass API key from https://wandb.ai/authorize.


source

WandbKerasModel

 WandbKerasModel (run_path:str, file_name:str='model-best.h5',
                  combine_preds=False, autoencoder_mlp=False,
                  replace=False, feature_cols:list=None)

Download best .h5 model from Weights & Biases (W&B) run in local directory and make predictions. More info on W&B: https://wandb.ai/site

:param run_path: W&B path structured as entity/project/run_id. Can be copied from the Overview tab of a W&B run. For more info: https://docs.wandb.ai/ref/app/pages/run-page#overview-tab

:param file_name: Name of .h5 file as saved in W&B run. ‘model-best.h5’ by default. File name can be found under files tab of W&B run.

:param combine_preds: Whether to average predictions along column axis. Convenient when you want to predict the main target by averaging a multi-target model.

:param autoencoder_mlp: Whether your model is an autoencoder + MLP model. Will take the 3rd of tuple output in this case. Only relevant for NN models.

More info on autoencoders: https://forum.numer.ai/t/autoencoder-and-multitask-mlp-on-new-dataset-from-kaggle-jane-street/4338

:param replace: Replace any model files saved under the same file name with downloaded W&B run model. WARNING: Setting to True may overwrite models in your local environment.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

1.3. CSVSub

This model is a wrapper for if you want to add predictions from external CSVs in a directory.


source

ExternalCSVs

 ExternalCSVs (data_directory:str='external_submissions')

Load external submissions and add to NumerFrame.

All csv files in this directory will be added to NumerFrame. Make sure all external predictions are prepared and ready for submission. i.e. IDs lining up and one column named ‘prediction’.

:param data_directory: Directory path for retrieving external submission.

For testing, the external_submissions directory contains test_predictions.csv with values from feature target_thomas_20.

dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
external = ExternalCSVs(data_directory="test_assets/external_submissions")
new_dataf = external.predict(dataf)
new_dataf.get_prediction_data.head(2)

If no submissions are found in the given data_directory you should recieve a warning.

ExternalCSVs(data_directory="Some_nonexisting_directory_12354321");

1.4. NumerBay

This model is a wrapper for if you want to add predictions from NumerBay purchases.

Currently only Numerai Classic submissions are supported. Numerai Signals will be supported in a future version.


source

NumerBayCSVs

 NumerBayCSVs (data_directory:str='numerbay_submissions',
               numerbay_product_full_names:list=None,
               numerbay_username:str=None, numerbay_password:str=None,
               numerbay_key_path:str=None,
               ticker_col:str='bloomberg_ticker')

Load NumerBay submissions and add to NumerFrame.

Make sure to provide correct NumerBay credentials and that your purchases have been confirmed and artifacts are available for download.

:param data_directory: Directory path for caching submission. Files not already present in the directory will be downloaded from NumerBay. :param numerbay_product_full_names: List of product full names (in the format of [category]-[product name]) to download from NumerBay. E.g. [‘numerai-predictions-numerbay’] :param numerbay_username: NumerBay username :param numerbay_password: NumerBay password :param numerbay_key_path: NumerBay encryption key json file path (exported from the profile page)

Example usage

# nb_model = NumerBayCSVs(data_directory='/app/notebooks/tmp',
#                         numerbay_product_full_names=['numerai-predictions-someproduct'],
#                         numerbay_username="myusername",
#                         numerbay_password="mypassword",
#                         numerbay_key_path="/app/notebooks/tmp/numerbay.json")
# preds = nb_model.predict(dataf)

2. Loading all models in directory

2.1. Joblib directory

Many models, like scikit-learn, can conveniently be saved as .joblib files. This class automatically loads all .joblib files in a given folder and generates (averaged out) predictions.


source

JoblibModel

 JoblibModel (model_directory:str, model_name:str=None,
              feature_cols:list=None)

Load and predict for arbitrary models in directory saved as .joblib.

All loaded models should have a .predict method and accept the features present in the data.

:param model_directory: Main directory from which to read in models.

:param model_name: Name that will be used to create column names and for display purposes.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
model = JoblibModel("test_assets", model_name="Joblib_LGB")
predictions = model.predict(dataf).get_prediction_data
assert predictions['prediction_Joblib_LGB'].between(0, 1).all()
predictions.head(2)

2.2. Catboost directory (.cbm)

This model setup loads all CatBoost (.cbm) models present in a given directory and makes (averaged out) predictions.


source

CatBoostModel

 CatBoostModel (model_directory:str, model_name:str=None,
                feature_cols:list=None)

Load and predict with all .cbm models (CatBoostRegressor) in directory.

:param model_directory: Main directory from which to read in models.

:param model_name: Name that will be used to define column names and for display purposes.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

# Example on NumerFrame
# dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
# model = CatBoostModel("test_assets", model_name="CB")
# predictions = model.predict(dataf).get_prediction_data
# assert predictions['prediction_CB'].between(0, 1).all()
# predictions.head(2)

2.3. LightGBM directory (.lgb)

This model setup loads all LightGBM (.lgb) models present in a given directory and makes (averaged out) predictions.


source

LGBMModel

 LGBMModel (model_directory:str, model_name:str=None,
            feature_cols:list=None)

Load and predict with all .lgb models (LightGBM) in directory.

:param model_directory: Main directory from which to read in models.

:param model_name: Name that will be used to define column names and for display purposes.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
model = LGBMModel("test_assets", model_name="LGB")
predictions = model.predict(dataf).get_prediction_data
assert predictions['prediction_LGB'].between(0, 1).all()
predictions.head(2)

3. Baseline models

Setting a baseline is always an important step for data science problems. This section introduces models that should only be used a baselines.

3.1. ConstantModel

This model simply outputs a constant of your choice. Convenient for setting classification baselines.


source

ConstantModel

 ConstantModel (constant:float=0.5, model_name:str=None)

WARNING: Only use this Model for testing purposes.

Create constant prediction.

:param constant: Value for constant prediction.

:param model_name: Name that will be used to create column names and for display purposes.

constant = 0.85
dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
constant_model = ConstantModel(constant=constant)
predictions = constant_model.predict(dataf).get_prediction_data
assert (predictions.to_numpy() == constant).all()
predictions.head(2)

3.2. RandomModel

This model returns uniformly distributed predictions in range \([0...1)\). Solid naive baseline for regression models.


source

RandomModel

 RandomModel (model_name:str=None)

WARNING: Only use this Model for testing purposes.

Create uniformly distributed predictions.

:param model_name: Name that will be used to create column names and for display purposes.

dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
random_model = RandomModel()
predictions = random_model.predict(dataf).get_prediction_data
assert predictions['prediction_random'].between(0, 1).all()
predictions.head(2)

3.3. Example (validation) predictions

This Model performs downloading and adding of example predictions for Numerai Classic. Convenient when you are constructing a ModelPipeline and want to include example predictions.


source

ExamplePredictionsModel

 ExamplePredictionsModel
                          (file_name:str='example_validation_predictions.p
                          arquet',
                          data_directory:str='example_predictions_model',
                          round_num:int=None)

Load example predictions and add to NumerFrame.

:param file_name: File to download from NumerAPI. ‘example_validation_predictions.parquet’ by default.

:param data_directory: Directory path to download example predictions to or directory where example data already exists.

:param round_num: Optional round number. Downloads most recent round by default.

# Download validation data
downloader = NumeraiClassicDownloader("example_predictions_model")
val_file = "numerai_validation_data.parquet"
val_save_path = f"{str(downloader.dir)}/{val_file}"
downloader.download_single_dataset(filename=val_file,
                                   dest_path=val_save_path)

# Load validation data and add example predictions
dataf = create_numerframe(val_save_path)
example_model = ExamplePredictionsModel()
predictions = example_model.predict(dataf).get_prediction_data
assert predictions['prediction_example'].between(0, 1).all()
predictions.head(2)

4. Custom Model

There are two different ways to implement new models. Both have their own conveniences and use cases.

4.1. Inherit from BaseModel (custom prediction logic).

4.2. Inherit from DirectoryModel (make predictions for all models in directory with given file suffix. Prediction logic will already be implemented. Only implement model loading logic).

4.1. (From BaseModel) works well when you have no or only a single file that you use for generating predictions.

Examples: 1. Loading a model is not relevant or your model is already loaded in memory. 2. You would like predictions for one model loaded from disk. 3. The object you are loading already aggregates multiple models and transformation steps (such as scikit-learn FeatureUnion).

4.2. (From DirectoryModel) is convenient when you have a lot of similar models in a directory and want to generate predictions for all of them.

Examples: 1. You have multiple similar models saved through a cross validation process. 2. You have a bagging strategy where a lot of models trained on slightly different data or with different initializations should are averaged.

4.1. From BaseModel

Arbitrary models can be instantiated and used for prediction generation by inheriting from BaseModel. Arbitrary logic (model loading, prediction, etc.) can be defined in .predict as long as the method takes a NumerFrame as input and outputs a NumerFrame.

For clear console output we recommend adding the @display_processor_info decorator to the .predict method.

If your model does not involve reading files from disk specify model_directory="".


source

AwesomeModel

 AwesomeModel (model_directory:str, model_name:str=None,
               feature_cols:list=None)

TEMPLATE - Predict with arbitrary prediction logic and model formats.

:param model_directory: Main directory from which to read in models.

:param model_name: Name that will be used to define column names and for display purposes.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.

4.2. From DirectoryModel

You may want to implement a setup similar to JoblibModel and CatBoostModel. Namely, load in all models of a certain type from a directory, predict for all and take the average. If this is your use case, inherit from DirectoryModel and be sure to implement the .load_models method.

For a DirectoryModel you should specify a file_suffix (like .joblib or .cbm) which will be used to store all available models in self.model_paths.

The .predict method will in this case already be implemented, but can be overridden if the prediction logic is more complex. For example, if you want to apply weighted averaging or a geometric mean for models within a given directory.


source

AwesomeDirectoryModel

 AwesomeDirectoryModel (model_directory:str, model_name:str=None,
                        feature_cols:list=None)

TEMPLATE - Load in all models of arbitrary file format and predict for all.

:param model_directory: Main directory from which to read in models.

:param model_name: Name that will be used to define column names and for display purposes.

:param feature_cols: optional list of features to use for prediction. Selects all feature columns (i.e. column names with prefix ‘feature’) by default.