Preprocessors
NumerBlox offers a suite of preprocessors to easily do Numerai specific data transformations. All preprocessors are compatible with scikit-learn
pipelines and feature a similar API. Note that some preprocessors may require an additional eras
or tickers
argument in the transform
step.
Numerai Classic
GroupStatsPreProcessor
The v4.2 (rain) dataset for Numerai Classic reintroduced feature groups. The GroupStatsPreProcessor
calculates group statistics for all data groups. It uses predefined feature group mappings to generate statistical measures (mean, standard deviation, skew) for each of the feature groups.
Example
Here's how you can use the GroupStatsPreProcessor
:
from numerblox.preprocessing import GroupStatsPreProcessor
group_processor = GroupStatsPreProcessor(groups=['intelligence'])
# Return features with group statistics for the 'intelligence' group
features = group_processor.transform(X)
Numerai Signals
ReduceMemoryProcessor
The ReduceMemoryProcessor
reduces the memory usage of the data as much as possible. It's particularly useful for Numerai Signals dataset which can be quite large.
Note that modern Numerai Classic Data (v4.2+) already is an int8 format so this processor will be not be useful for Numerai Classic.
from numerblox.preprocessing import ReduceMemoryProcessor
processor = ReduceMemoryProcessor(deep_mem_inspect=True, verbose=True)
reduced_data = processor.fit_transform(dataf)
KatsuFeatureGenerator
KatsuFeatureGenerator
performs feature engineering based on Katsu's starter notebook. This is useful for those participating in the Numerai Signals contest.
You can specify custom windows that indicates how many days to look back when generating features.
from numerblox.preprocessing import KatsuFeatureGenerator
feature_gen = KatsuFeatureGenerator(windows=[7, 14, 21])
enhanced_data = feature_gen.fit_transform(dataf)
EraQuantileProcessor
EraQuantileProcessor
transforms features into quantiles by era. This can help normalize data and make patterns more distinguishable. Quantiling operation are parallelized across features for faster processing.
Using .transform
requires passing era_series
. This is because the quantiles are calculated per era so it needs that information along with the raw input features.
from numerblox.preprocessing import EraQuantileProcessor
eq_processor = EraQuantileProcessor(num_quantiles=50, random_state=42)
transformed_data = eq_processor.fit_transform(X, era_series=eras_series)
LagPreProcessor
LagPreProcessor
generates lag features based on specified windows. Lag features can capture temporal patterns in time-series data.
Note that LagPreProcessor
needs a ticker_series
in the .transform
step.
from numerblox.preprocessing import LagPreProcessor
lag_processor = LagPreProcessor(windows=[5, 10, 20])
lag_processor.fit(X)
lagged_data = lag_processor.transform(X, ticker_series=tickers_series)
DifferencePreProcessor
DifferencePreProcessor
computes the difference between features and their lags. It's used after LagPreProcessor
.
WARNING: DifferencePreProcessor
works only on pd.DataFrame
and with columns that are generated in LagPreProcessor
. If you are using these in a Pipeline make sure LagPreProcessor
is defined before DifferencePreProcessor
and that output API is set to Pandas (pipeline.set_output(transform="pandas")
).
Note that LagPreProcessor
needs a ticker_series
in the .transform
step so a pipeline with both preprocessors will need a tickers
argument in .transform
.
from sklearn.pipeline import make_pipeline
from numerblox.preprocessing import DifferencePreProcessor
lag = LagPreProcessor(windows=[5, 10])
diff = DifferencePreProcessor(windows=[5, 10], pct_diff=True)
pipe = make_pipeline(lag, diff)
pipe.set_output(transform="pandas")
pipe.fit(X)
diff_data = pipe.transform(X, ticker_series=tickers_series)
PandasTaFeatureGenerator
PandasTaFeatureGenerator
uses the pandas-ta
library to generate technical analysis features. It's a powerful tool for those interested in financial time-series data.
Make sure you have pandas-ta
installed before using this feature generator:
!pip install pandas-ta
Currently PandasTaFeatureGenerator
only works on pd.DataFrame
input. Its input is a DataFrame with columns [ticker, date, open, high, low, close, volume]
.
from numerblox.preprocessing import PandasTaFeatureGenerator
ta_gen = PandasTaFeatureGenerator()
ta_features = ta_gen.transform(dataf)
MinimumDataFilter
MinimumDataFilter
filters out dates and tickers that don't have enough data. For example, it makes sense to filter out dates for which you have less than 100 days of data. Also, dates that have less than 100 unique tickers can be filtered out.
Additionally, you can specify a list of tickers to blacklist and exclude from your data.
NOTE: This step only works with DataFrame input.
from numerblox.preprocessing import MinimumDataFilter
min_data_filter = MinimumDataFilter(min_samples_date=200, min_samples_ticker=1200, blacklist_tickers=["SOMETICKER.BLA"])
filtered_data = min_data_filter.fit_transform(dataf)
Rolling your own preprocessor
We invite the community to contribute their own preprocessors to NumerBlox. If you have a preprocessor that you think would be useful to others, please open a PR with your code and tests. The new preprocessor should adhere to scikit-learn conventions. Here are some the most important things to keep in mind and a template.
- Make sure that your preprocessor inherits from
numerblox.preprocessing.base.BasePreProcessor
. This will automatically implement a blank fit method. It will also inherit fromsklearn.base.TransformerMixin
andsklearn.base.BaseEstimator
. - Make sure your preprocessor implements a
transform
method that can take anp.array
orpd.DataFrame
as input and outputs annp.array
. If your preprocessor can only work withpd.DataFrame
input, mention this explicitly in the docstring. - Implement a
get_feature_names_out
method so it can supportpd.DataFrame
output with valid column names.
import numpy as np
import pandas as pd
from typing import Union
from sklearn.validation import check_is_fitted, check_X_y
from numerblox.preprocessing.base import BasePreProcessor
class MyAwesomePreProcessor(BasePreProcessor):
def __init__(self, random_state: int = 0):
super().__init__()
# If you introduce additional arguments be sure to add them as attributes.
self.random_state = random_state
def fit(self, X: Union[np.array, pd.DataFrame], y=None):
# Arguments can be set for later use.
self.n_cols_ = X.shape[1]
return self
def transform(self, X: Union[np.array, pd.DataFrame]) -> np.array:
# Do your preprocessing here.
# Can involve additional checks.
check_is_fitted(self)
X = check_X_y(X)
return X
def get_feature_names_out(self, input_features=None) -> list:
# Return a list of feature names.
# If you are not using pandas output, you can skip this method.
check_is_fitted(self)
return ["awesome_output_feature_{i}" for i in range(self.n_cols_)]