# Random DataFrame
= [f"prediction_{l}" for l in "ABCDE"]
test_features = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df "target"] = np.random.normal(size=100)
df["date"] = [0, 1, 2, 3] * 25
df[= NumerFrame(df) test_dataf
Postprocessing
Overview
The postprocessing procedure is similar to preprocessing. Preprocessors manipulate and/or add feature
columns, while postprocessors manipulate and/or add prediction
columns.
Every postprocessor should inherit from BasePostProcessor
. A postprocessor should take a NumerFrame
as input and output a NumerFrame
. One or more new prediction column(s) with prefix prediction
are added or manipulated in a postprocessor.
0. BasePostProcessor
Some characteristics are particular to Postprocessors, but not suitable to put in the Processor
base class. This functionality is implemented in BasePostProcessor
.
BasePostProcessor
BasePostProcessor (final_col_name:str)
Base class for postprocessing objects.
Postprocessors manipulate or introduce new prediction columns in a NumerFrame.
1. Common postprocessing steps
We invite the Numerai community to develop new postprocessors so that everyone can benefit from new insights and research. This section implements commonly used postprocessing for Numerai.
1.0. Tournament agnostic
Postprocessing that works for both Numerai Classic and Numerai Signals.
1.0.1. Standardization
Standardizing is an essential step in order to reliably combine Numerai predictions. It is a default postprocessor for ModelPipeline
.
Standardizer
Standardizer (cols:list=None)
Uniform standardization of prediction columns. All values should only contain values in the range [0…1].
:param cols: All prediction columns that should be standardized. Use all prediction columns by default.
= Standardizer()
std 2) std.transform(test_dataf).get_prediction_data.head(
✅ Finished step Standardizer. Output shape=(100, 7). Time taken for step: 0:00:00.004771. ✅
prediction_A | prediction_B | prediction_C | prediction_D | prediction_E | |
---|---|---|---|---|---|
0 | 0.12 | 0.40 | 0.04 | 0.72 | 0.04 |
1 | 0.92 | 0.48 | 0.96 | 0.92 | 0.72 |
1.0.2. Ensembling
Multiple prediction results can be ensembled in multiple ways. We provide the most common use cases here.
1.0.2.1. Simple Mean
MeanEnsembler
MeanEnsembler (final_col_name:str, cols:list=None, standardize:bool=False)
Take simple mean of multiple cols and store in new col.
:param final_col_name: Name of new averaged column. final_col_name should start with “prediction”.
:param cols: Column names to average.
:param standardize: Whether to standardize by era before averaging. Highly recommended as columns that are averaged may have different distributions.
1.0.2.2. Donate’s formula
This method for weighted averaging is mostly suitable if you have multiple models trained on a time series cross validation scheme. The first models will be trained on less data so we want to give them a lower weighting compared to the later models.
Source: Yirun Zhang in his winning solution for the Jane Street 2021 Kaggle competition. Based on a paper by Donate et al.
DonateWeightedEnsembler
DonateWeightedEnsembler (final_col_name:str, cols:list=None)
Weighted average as per Donate et al.’s formula Paper Link: https://doi.org/10.1016/j.neucom.2012.02.053 Code source: https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp
Weightings for 5 folds: [0.0625, 0.0625, 0.125, 0.25, 0.5]
:param cols: Prediction columns to ensemble. Uses all prediction columns by default.
:param final_col_name: New column name for ensembled values.
# Random DataFrame
#| include: false
= [f"prediction_{l}" for l in "ABCDE"]
test_features = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df "target"] = np.random.normal(size=100)
df["era"] = range(100)
df[= NumerFrame(df) test_dataf
For 5 folds, the weightings are [0.0625, 0.0625, 0.125, 0.25, 0.5]
.
= [0.0625, 0.0625, 0.125, 0.25, 0.5]
w_5_fold = DonateWeightedEnsembler(
donate =test_dataf.prediction_cols, final_col_name="prediction"
cols
)= donate(test_dataf).get_prediction_data
ensembled assert ensembled["prediction"][0] == np.sum(
* elem for w, elem in zip(w_5_fold, ensembled[test_features].iloc[0])]
[w
)2) ensembled.head(
🍲 Ensembled '['prediction_A', 'prediction_B', 'prediction_C', 'prediction_D', 'prediction_E']' with DonateWeightedEnsembler and saved in 'prediction' 🍲
✅ Finished step DonateWeightedEnsembler. Output shape=(100, 8). Time taken for step: 0:00:00.004307. ✅
prediction_A | prediction_B | prediction_C | prediction_D | prediction_E | prediction | |
---|---|---|---|---|---|---|
0 | 0.809680 | 0.821740 | 0.673158 | 0.130708 | 0.946340 | 0.691955 |
1 | 0.665325 | 0.402088 | 0.454365 | 0.820944 | 0.091936 | 0.374713 |
1.0.2.3. Geometric Mean
Take the mean of multiple prediction columns using the product of values.
More info on Geometric mean: - Wikipedia - Investopedia
GeometricMeanEnsembler
GeometricMeanEnsembler (final_col_name:str, cols:list=None)
Calculate the weighted Geometric mean.
:param cols: Prediction columns to ensemble. Uses all prediction columns by default.
:param final_col_name: New column name for ensembled values.
= GeometricMeanEnsembler(final_col_name="prediction_geo")
geo_mean = geo_mean(test_dataf).get_prediction_data
ensembled 2) ensembled.head(
🍲 Ensembled '['prediction_A', 'prediction_B', 'prediction_C', 'prediction_D', 'prediction_E']' with GeometricMeanEnsembler and saved in 'prediction_geo' 🍲
✅ Finished step GeometricMeanEnsembler. Output shape=(100, 9). Time taken for step: 0:00:00.031692. ✅
prediction_A | prediction_B | prediction_C | prediction_D | prediction_E | prediction | prediction_geo | |
---|---|---|---|---|---|---|---|
0 | 0.809680 | 0.821740 | 0.673158 | 0.130708 | 0.946340 | 0.691955 | 0.560664 |
1 | 0.665325 | 0.402088 | 0.454365 | 0.820944 | 0.091936 | 0.374713 | 0.391302 |
1.0.3. Neutralization and penalization
1.0.3.1. Feature Neutralization
Classic feature neutralization (subtracting linear model from scores).
New column name for neutralized values will be {pred_name}_neutralized_{PROPORTION}
. pred_name
should start with 'prediction'
.
Optionally, you can run feature neutralization on the GPU using cupy by setting cuda=True
. Make sure you have cupy
installed with the correct CUDA Toolkit version. More information: docs.cupy.dev/en/stable/install.html
Detailed explanation of Feature Neutralization by Katsu1110
FeatureNeutralizer
FeatureNeutralizer (feature_names:list=None, pred_name:str='prediction', proportion:float=0.5, suffix:str=None, cuda=False)
Classic feature neutralization by subtracting linear model.
:param feature_names: List of column names to neutralize against. Uses all feature columns by default.
:param pred_name: Prediction column to neutralize.
:param proportion: Number in range [0…1] indicating how much to neutralize.
:param suffix: Optional suffix that is added to new column name.
:param cuda: Do neutralization on the GPU
Make sure you have CuPy installed when setting cuda to True.
Installation docs: docs.cupy.dev/en/stable/install.html
= create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
testv1_dataf "prediction"] = np.random.uniform(size=len(testv1_dataf)) testv1_dataf.loc[:,
2) testv1_dataf.head(
id | era | data_type | feature_intelligence1 | feature_intelligence2 | feature_intelligence3 | feature_intelligence4 | feature_intelligence5 | feature_intelligence6 | feature_intelligence7 | ... | feature_wisdom39 | feature_wisdom40 | feature_wisdom41 | feature_wisdom42 | feature_wisdom43 | feature_wisdom44 | feature_wisdom45 | feature_wisdom46 | target | prediction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | n000315175b67977 | era1 | train | 0.0 | 0.5 | 0.25 | 0.00 | 0.5 | 0.25 | 0.25 | ... | 1.0 | 0.75 | 0.5 | 0.75 | 0.50 | 1.0 | 0.50 | 0.75 | 0.50 | 0.612091 |
1 | n0014af834a96cdd | era1 | train | 0.0 | 0.0 | 0.00 | 0.25 | 0.5 | 0.00 | 0.00 | ... | 1.0 | 0.00 | 0.0 | 0.75 | 0.25 | 0.0 | 0.25 | 1.00 | 0.25 | 0.636917 |
2 rows × 315 columns
= FeatureNeutralizer(
ft =test_dataf.feature_cols, pred_name="prediction", proportion=0.8
feature_names
)= ft.transform(test_dataf) new_dataf
🤖 Neutralized 'prediction' with proportion '0.8' 🤖
New neutralized column = 'prediction_neutralized_0.8'.
✅ Finished step FeatureNeutralizer. Output shape=(100, 10). Time taken for step: 0:00:00.455949. ✅
assert "prediction_neutralized_0.8" in new_dataf.prediction_cols
assert 0.0 in new_dataf.get_prediction_data["prediction_neutralized_0.8"]
assert 1.0 in new_dataf.get_prediction_data["prediction_neutralized_0.8"]
Generated columns and data can be easily retrieved for the NumerFrame
.
new_dataf.prediction_cols
['prediction_A',
'prediction_B',
'prediction_C',
'prediction_D',
'prediction_E',
'prediction',
'prediction_geo',
'prediction_neutralized_0.8']
3) new_dataf.get_prediction_data.head(
prediction_A | prediction_B | prediction_C | prediction_D | prediction_E | prediction | prediction_geo | prediction_neutralized_0.8 | |
---|---|---|---|---|---|---|---|---|
0 | 0.809680 | 0.821740 | 0.673158 | 0.130708 | 0.946340 | 0.691955 | 0.560664 | NaN |
1 | 0.665325 | 0.402088 | 0.454365 | 0.820944 | 0.091936 | 0.374713 | 0.391302 | NaN |
2 | 0.404700 | 0.519101 | 0.104269 | 0.781825 | 0.263947 | 0.398201 | 0.339651 | NaN |
1.0.3.2. Feature Penalization
FeaturePenalizer
FeaturePenalizer (max_exposure:float, feature_names:list=None, pred_name:str='prediction', suffix:str=None)
Feature penalization with TensorFlow.
Source (by jrb): https://github.com/jonrtaylor/twitch/blob/master/FE_Clipping_Script.ipynb
Source of first PyTorch implementation (by Michael Oliver / mdo): https://forum.numer.ai/t/model-diagnostics-feature-exposure/899/12
:param feature_names: List of column names to reduce feature exposure. Uses all feature columns by default.
:param pred_name: Prediction column to neutralize.
:param max_exposure: Number in range [0…1] indicating how much to reduce max feature exposure to.
1.1. Numerai Classic
Postprocessing steps that are specific to Numerai Classic
# 1.1.
# No Numerai Classic specific postprocessors implemented yet.
1.2. Numerai Signals
Postprocessors that are specific to Numerai Signals.
# 1.2.
# No Numerai Signals specific postprocessors implemented yet.
2. Custom PostProcessors
As with preprocessors, there are an almost unlimited number of ways to postprocess data. We (once again) invite the Numerai community to develop Numerai Classic and Signals postprocessors.
A new Postprocessor should inherit from BasePostProcessor
and implement a transform
method. The transform
method should take a NumerFrame
as input and return a NumerFrame
object as output. A template for this is given below.
To enable fancy logging output. Add the @display_processor_info
decorator to the transform
method.
AwesomePostProcessor
AwesomePostProcessor (final_col_name:str, *args, **kwargs)
TEMPLATE - Do some awesome postprocessing.
:param final_col_name: Column name to store manipulated or ensembled predictions in.