Common functionality for preprocessors and postprocessors.
0.2. Logging
We would like to keep an overview of which steps are done in a data pipeline and where processing bottlenecks occur. The decorator below will display for a given function/method: 1. When it has finished. 2. What the output shape of the data is. 3. How long it took to finish.
To use this functionality, simply add @display_processor_info as a decorator to the function/method you want to track.
We will use this decorator throughout the pipeline (preprocessing, model and postprocessing).
✅ Finished step TestDisplay. Output shape=(10, 1073). Time taken for step: 0:00:02.002106. ✅
era
data_type
feature_dichasial_hammier_spawner
feature_rheumy_epistemic_prancer
feature_pert_performative_hormuz
feature_hillier_unpitied_theobromine
feature_perigean_bewitching_thruster
feature_renegade_undomestic_milord
feature_koranic_rude_corf
feature_demisable_expiring_millepede
...
target_paul_20
target_paul_60
target_george_20
target_george_60
target_william_20
target_william_60
target_arthur_20
target_arthur_60
target_thomas_20
target_thomas_60
id
n559bd06a8861222
0297
train
0.25
0.75
0.25
0.75
0.25
0.50
1.00
0.25
...
0.00
0.50
0.25
0.50
0.000000
0.500000
0.166667
0.500000
0.333333
0.500000
n9d39dea58c9e3cf
0003
train
0.75
0.50
0.75
1.00
0.50
0.25
0.50
0.00
...
0.50
0.75
0.50
0.50
0.666667
0.666667
0.500000
0.666667
0.500000
0.666667
nb64f06d3a9fc9f1
0472
train
1.00
1.00
1.00
0.50
0.00
1.00
0.25
0.50
...
0.00
0.25
0.50
0.50
0.333333
0.333333
0.333333
0.333333
0.333333
0.333333
n1927b4862500882
0265
train
0.00
0.00
0.25
0.00
1.00
0.00
0.00
0.00
...
0.75
0.75
0.50
0.75
0.833333
0.833333
0.666667
0.833333
0.666667
0.666667
nc3234b6eeacd6b7
0299
train
0.75
0.25
0.00
0.75
1.00
0.25
0.00
0.00
...
0.25
0.50
0.50
0.50
0.166667
0.666667
0.333333
0.500000
0.500000
0.666667
n1b41d583e12f051
0009
train
0.00
0.50
0.50
0.25
0.25
0.50
0.50
1.00
...
0.50
0.25
0.50
0.00
0.500000
0.333333
0.500000
0.333333
0.500000
0.333333
n116898cdc07d4e2
0013
train
0.50
1.00
1.00
0.75
0.00
1.00
0.50
0.75
...
0.50
0.75
0.50
0.50
0.500000
0.666667
0.500000
0.666667
0.500000
0.666667
nb0a7aef640025dc
0232
train
0.25
0.25
0.50
0.00
1.00
0.00
0.50
0.00
...
0.50
0.25
0.50
0.00
0.500000
0.166667
0.500000
0.000000
0.666667
0.166667
n12466a161ab0a24
0092
train
0.50
0.75
1.00
0.50
0.25
0.25
0.25
0.75
...
0.50
0.50
0.50
0.50
0.500000
0.500000
0.500000
0.500000
0.500000
0.500000
n40132f4765f9185
0270
train
0.50
0.50
0.00
0.50
0.75
0.50
0.00
0.25
...
0.25
0.50
0.00
0.50
0.333333
0.333333
0.333333
0.333333
0.333333
0.500000
10 rows × 1073 columns
1. Common preprocessing steps
This section implements commonly used preprocessing for Numerai. We invite the Numerai community to develop new preprocessors.
1.0 Tournament agnostic
Preprocessors that can be applied for both Numerai Classic and Numerai Signals.
1.0.1. CopyPreProcessor
The first and obvious preprocessor is copying, which is implemented as a default in ModelPipeline (Section 4) to avoid manipulation of the original DataFrame or NumerFrame that you load in.
Not relevant for an inference pipeline, but especially convenient for Numerai Classic training if you train on a subset of the available targets. Can also be applied to Signals if you are using engineered targets in your pipeline.
✅ Finished step TargetSelectionPreProcessor. Output shape=(10, 1055). Time taken for step: 0:00:00.022602. ✅
target
target_nomi_20
target_nomi_60
feature_dichasial_hammier_spawner
feature_rheumy_epistemic_prancer
feature_pert_performative_hormuz
feature_hillier_unpitied_theobromine
feature_perigean_bewitching_thruster
feature_renegade_undomestic_milord
feature_koranic_rude_corf
...
feature_drawable_exhortative_dispersant
feature_metabolic_minded_armorist
feature_investigatory_inerasable_circumvallation
feature_centroclinal_incentive_lancelet
feature_unemotional_quietistic_chirper
feature_behaviorist_microbiological_farina
feature_lofty_acceptable_challenge
feature_coactive_prefatorial_lucy
era
data_type
id
n559bd06a8861222
0.25
0.25
0.50
0.25
0.75
0.25
0.75
0.25
0.50
1.0
...
1.00
0.0
0.0
0.25
0.00
0.0
1.00
0.25
0297
train
n9d39dea58c9e3cf
0.50
0.50
0.75
0.75
0.50
0.75
1.00
0.50
0.25
0.5
...
0.25
0.5
0.0
0.25
0.75
1.0
0.75
1.00
0003
train
2 rows × 1055 columns
1.0.4. ReduceMemoryProcessor
Numerai datasets can take up a lot of RAM and may put a strain on your compute environment.
For Numerai Classic, many of the feature and target columns can be downscaled to float16. int8 if you are using the Numerai int8 datasets. For Signals it depends on the features you are generating.
ReduceMemoryProcessor downscales the type of your numeric columns to reduce the memory footprint as much as possible.
Credits to kainsama and others for writing about memory usage reduction for Numerai data: https://forum.numer.ai/t/reducing-memory/313
:param deep_mem_inspect: Introspect the data deeply by interrogating object dtypes. Yields a more accurate representation of memory usage if you have complex object columns.
✅ Finished step ReduceMemoryProcessor. Output shape=(10, 1073). Time taken for step: 0:00:00.695928. ✅
1.0.6. UMAPFeatureGenerator
Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that we can utilize to generate new Numerai features. This processor uses umap-learn under the hood to model the manifold. The dimension of the input data will be reduced to n_components number of features.
Generate new Numerai features using UMAP. Uses umap-learn under the hood:
https://pypi.org/project/umap-learn/ :param n_components: How many new features to generate. :param n_neighbors: Number of neighboring points used in local approximations of manifold structure. :param min_dist: How tightly the embedding is allows to compress points together. :param metric: Metric to measure distance in input space. Correlation by default. :param feature_names: Selection of features used to perform UMAP on. All features by default. *args, **kwargs will be passed to initialization of UMAP.
The new features will be names with the convention f"feature_umap_{i}".
umap_features = [f"feature_umap_{i}"for i inrange(n_components)]dataf[umap_features].head(3)
feature_umap_0
feature_umap_1
feature_umap_2
id
n559bd06a8861222
0.419718
0.607629
1.000000
n9d39dea58c9e3cf
0.794003
0.273073
0.963785
nb64f06d3a9fc9f1
0.490861
1.000000
0.522663
1.1. Numerai Classic
The Numerai Classic dataset has a certain structure that you may not encounter in the Numerai Signals tournament. Therefore, this section has all preprocessors that can only be applied to Numerai Classic.
1.1.0 Numerai Classic: Version agnostic
Preprocessors that work for all Numerai Classic versions.
:param target_col: Column from which to create fake target.
:param feature_names: Selection of features used for Bayesian GMM. All features by default. :param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.
1.2. Numerai Signals
Preprocessors that are specific to Numerai Signals.
1.2.1. KatsuFeatureGenerator
Katsu1110 provides an excellent and fast feature engineering scheme in his Kaggle notebook on starting with Numerai Signals. It is surprisingly effective, fast and works well for modeling. This preprocessor is based on his feature engineering setup in that notebook.
Features generated: 1. MACD and MACD signal 2. RSI 3. Percentage rate of return 4. Volatility 5. MA (moving average) gap
Effective feature engineering setup based on Katsu’s starter notebook. Based on source by Katsu1110: https://www.kaggle.com/code1110/numeraisignals-starter-for-beginners
:param windows: Time interval to apply for window features:
Percentage Rate of change
Volatility
Moving Average gap
:param ticker_col: Columns with tickers to iterate over.
:param close_col: Column name where you have closing price stored.
Let’s create a simple synthetic dataset to test preprocessors on. Many preprocessor require at least ticker, date and close columns. More advanced feature engineering preprocessors should also have open, high, low and volume columns.
Numerai Signals’ objective is predicting a ranking of equities. Therefore, we can benefit from creating rankings out of the features. Doing this reduces noise and works as a normalization mechanism for your features. EraQuantileProcessor bins features in a given number of quantiles for each era in the dataset.
Transform features into quantiles on a per-era basis
:param num_quantiles: Number of buckets to split data into.
:param era_col: Era column name in the dataframe to perform each transformation.
:param features: All features that you want quantized. All feature cols by default.
:param num_cores: CPU cores to allocate for quantile transforming. All available cores by default.
:param random_state: Seed for QuantileTransformer.
:param batch_size: How many feature to process at the same time. For Numerai Signals scale data it is advisable to process features one by one. This is the default setting.
✅ Finished step TickerMapper. Output shape=(5, 2). Time taken for step: 0:00:00.005146. ✅
bloomberg_ticker
signals_ticker
0
LLB SW
LLB.SW
1
DRAK NA
DRAK.AS
2
SWB MK
5211.KLSE
3
ELEKTRA* MF
ELEKTRA.MX
4
NOT_A_TICKER
NaN
1.2.4. SignalsTargetProcessor
Numerai provides targets for 5000 stocks that are neutralized against all sorts of factors. However, it can be helpful to experiment with creating your own targets. You might want to explore different windows, different target binning and/or neutralization. SignalsTargetProcessor engineers 3 different targets for every given windows: - _raw: Raw return based on price movements. - _rank: Ranks of raw return. - _group: Binned returns based on rank.
Note that Numerai provides targets based on 4-day returns and 20-day returns. While you can explore any window you like, it makes sense to start with windows close to these timeframes.
For the bins argument there are also many options possible. The followed are commonly used binning: - Nomi bins: [0, 0.05, 0.25, 0.75, 0.95, 1] - Uniform bins: [0, 0.20, 0.40, 0.60, 0.80, 1]
✅ Finished step SignalsTargetProcessor. Output shape=(300, 27). Time taken for step: 0:00:00.334536. ✅
target_10d_raw
target_10d_rank
target_10d_group
target_20d_raw
target_20d_rank
target_20d_group
0
0.026874
1.0
1.0
0.039152
1.0
1.0
1
0.021059
1.0
1.0
0.018829
1.0
1.0
1.2.5. LagPreProcessor
Many models like Gradient Boosting Machines (GBMs) don’t learn any time-series patterns by itself. However, if we create lags of our features the models will pick up on time dependencies between features. LagPreProcessor create lag features for given features and windows.
✅ Finished step LagPreProcessor. Output shape=(300, 16). Time taken for step: 0:00:00.036771. ✅
All lag features will contain lag in the column name.
dataf.get_pattern_data("lag").tail(2)
close_lag5
close_lag10
close_lag15
close_lag20
volume_lag5
volume_lag10
volume_lag15
volume_lag20
298
47.939367
46.472799
47.846745
46.300246
4102.0
3544.0
7182.0
7843.0
299
47.322248
46.709073
47.398402
46.820937
9846.0
6945.0
5197.0
5164.0
1.2.6. DifferencePreProcessor
After creating lags with the LagPreProcessor, it may be useful to create new features that calculate the difference between those lags. Through this process in DifferencePreProcessor, we can provide models with more time-series related patterns.
Add difference features based on given windows. Run LagPreProcessor first.
:param windows: All lag windows to process for all features.
:param feature_names: All features for which you want to create differences. All features that also have lags by default.
:param pct_change: Method to calculate differences. If True, will calculate differences with a percentage change. Otherwise calculates a simple difference. Defaults to False
:param abs_diff: Whether to also calculate the absolute value of all differences. Defaults to True
✅ Finished step DifferencePreProcessor. Output shape=(300, 24). Time taken for step: 0:00:00.047873. ✅
All difference features will contain diff in the column name.
dataf.get_pattern_data("diff").tail(2)
close_diff5
close_diff10
close_diff15
close_diff20
volume_diff5
volume_diff10
volume_diff15
volume_diff20
298
-0.019599
0.011340
-0.017701
0.015109
0.226719
0.419865
-0.299360
-0.358409
299
-0.014802
-0.001868
-0.016384
-0.004253
-0.279301
0.021742
0.365403
0.374129
1.2.7. PandasTaFeatureGenerator
This generator takes in a pandas-ta strategy and processing them on multiple cores. There is a simple default strategy available with RSI features for 14 and 60 rows.
✅ Finished step PandasTaFeatureGenerator. Output shape=(300, 10). Time taken for step: 0:00:00.943370. ✅
ticker
date
open
high
low
close
volume
friday_date
feature_RSI_14
feature_RSI_60
298
GHI.US
2020-04-08
46.949789
47.019789
46.989789
46.999789
5032
2020-04-08
51.987534
52.353205
299
GHI.US
2020-04-09
46.571805
46.641805
46.611805
46.621805
7096
2020-04-09
48.727812
51.555874
The feature data can be selected directly through a NumerFrame convenience method called .get_feature_data.
new_pta_df.get_feature_data.tail(2)
feature_RSI_14
feature_RSI_60
298
51.987534
52.353205
299
48.727812
51.555874
A custom pandas-ta strategy can be defined as follows. Check the pandas-ta docs for more information on available indicators and arguments.
ta takes in a list of dictionaries defining indicators and optional additional arguments. We use col_names for convenience so features are prefixed by feature_ and can be easily retrieved within a NumerFrame.
✅ Finished step PandasTaFeatureGenerator. Output shape=(300, 10). Time taken for step: 0:00:00.965338. ✅
feature_CMO
feature_RSI_60
295
11.802106
53.305259
296
9.111138
52.965671
297
8.799414
52.927768
298
3.975068
52.353205
299
-2.544377
51.555874
2. Custom preprocessors
There are an almost unlimited number of ways to preprocess (selection, engineering and manipulation). We have only scratched the surface with the preprocessors currently implemented. We invite the Numerai community to develop Numerai Classic and Numerai Signals preprocessors.
A new Preprocessor should inherit from BaseProcessor and implement a transform method. For efficient implementation, we recommend you use NumerFrame functionality for preprocessing. You can also support Pandas DataFrame input as long as the transform method returns a NumerFrame. This ensures that the Preprocessor still works within a full numerai-blocks pipeline. A template for new preprocessors is given below.
To enable fancy logging output. Add the @display_processor_info decorator to the transform method.