Preprocessing

Feature/target selection, engineering and manipulation.

0. Base

These objects will provide a base for all pre- and post-processing functionality and log relevant information.

0.1. BaseProcessor

BaseProcessor defines common functionality for preprocessing and postprocessing (Section 5).

Every Preprocessor should inherit from BaseProcessor and implement the .transform method.


source

BaseProcessor

 BaseProcessor ()

Common functionality for preprocessors and postprocessors.

0.2. Logging

We would like to keep an overview of which steps are done in a data pipeline and where processing bottlenecks occur. The decorator below will display for a given function/method: 1. When it has finished. 2. What the output shape of the data is. 3. How long it took to finish.

To use this functionality, simply add @display_processor_info as a decorator to the function/method you want to track.

We will use this decorator throughout the pipeline (preprocessing, model and postprocessing).

Inspiration for this decorator: Calmcode Pandas Pipe Logs


source

display_processor_info

 display_processor_info (func)

Fancy console output for data processing.

✅ Finished step TestDisplay. Output shape=(10, 1073). Time taken for step: 0:00:02.002106. ✅
era data_type feature_dichasial_hammier_spawner feature_rheumy_epistemic_prancer feature_pert_performative_hormuz feature_hillier_unpitied_theobromine feature_perigean_bewitching_thruster feature_renegade_undomestic_milord feature_koranic_rude_corf feature_demisable_expiring_millepede ... target_paul_20 target_paul_60 target_george_20 target_george_60 target_william_20 target_william_60 target_arthur_20 target_arthur_60 target_thomas_20 target_thomas_60
id
n559bd06a8861222 0297 train 0.25 0.75 0.25 0.75 0.25 0.50 1.00 0.25 ... 0.00 0.50 0.25 0.50 0.000000 0.500000 0.166667 0.500000 0.333333 0.500000
n9d39dea58c9e3cf 0003 train 0.75 0.50 0.75 1.00 0.50 0.25 0.50 0.00 ... 0.50 0.75 0.50 0.50 0.666667 0.666667 0.500000 0.666667 0.500000 0.666667
nb64f06d3a9fc9f1 0472 train 1.00 1.00 1.00 0.50 0.00 1.00 0.25 0.50 ... 0.00 0.25 0.50 0.50 0.333333 0.333333 0.333333 0.333333 0.333333 0.333333
n1927b4862500882 0265 train 0.00 0.00 0.25 0.00 1.00 0.00 0.00 0.00 ... 0.75 0.75 0.50 0.75 0.833333 0.833333 0.666667 0.833333 0.666667 0.666667
nc3234b6eeacd6b7 0299 train 0.75 0.25 0.00 0.75 1.00 0.25 0.00 0.00 ... 0.25 0.50 0.50 0.50 0.166667 0.666667 0.333333 0.500000 0.500000 0.666667
n1b41d583e12f051 0009 train 0.00 0.50 0.50 0.25 0.25 0.50 0.50 1.00 ... 0.50 0.25 0.50 0.00 0.500000 0.333333 0.500000 0.333333 0.500000 0.333333
n116898cdc07d4e2 0013 train 0.50 1.00 1.00 0.75 0.00 1.00 0.50 0.75 ... 0.50 0.75 0.50 0.50 0.500000 0.666667 0.500000 0.666667 0.500000 0.666667
nb0a7aef640025dc 0232 train 0.25 0.25 0.50 0.00 1.00 0.00 0.50 0.00 ... 0.50 0.25 0.50 0.00 0.500000 0.166667 0.500000 0.000000 0.666667 0.166667
n12466a161ab0a24 0092 train 0.50 0.75 1.00 0.50 0.25 0.25 0.25 0.75 ... 0.50 0.50 0.50 0.50 0.500000 0.500000 0.500000 0.500000 0.500000 0.500000
n40132f4765f9185 0270 train 0.50 0.50 0.00 0.50 0.75 0.50 0.00 0.25 ... 0.25 0.50 0.00 0.50 0.333333 0.333333 0.333333 0.333333 0.333333 0.500000

10 rows × 1073 columns

1. Common preprocessing steps

This section implements commonly used preprocessing for Numerai. We invite the Numerai community to develop new preprocessors.

1.0 Tournament agnostic

Preprocessors that can be applied for both Numerai Classic and Numerai Signals.

1.0.1. CopyPreProcessor

The first and obvious preprocessor is copying, which is implemented as a default in ModelPipeline (Section 4) to avoid manipulation of the original DataFrame or NumerFrame that you load in.


source

CopyPreProcessor

 CopyPreProcessor ()

Copy DataFrame to avoid manipulation of original DataFrame.

dataset = create_numerframe(
    "test_assets/mini_numerai_version_2_data.parquet"
)
copied_dataset = CopyPreProcessor().transform(dataset)
assert np.array_equal(copied_dataset.values, dataset.values)
assert dataset.meta == copied_dataset.meta
✅ Finished step CopyPreProcessor. Output shape=(10, 1073). Time taken for step: 0:00:00.019088. ✅

1.0.2. FeatureSelectionPreProcessor

FeatureSelectionPreProcessor will keep all features that you pass + keeps all other columns that are not features.


source

FeatureSelectionPreProcessor

 FeatureSelectionPreProcessor (feature_cols:Union[str,list])

Keep only features given + all target, predictions and aux columns.

selected_dataset = FeatureSelectionPreProcessor(
    feature_cols=["feature_dichasial_hammier_spawner"]
).transform(dataset)

assert selected_dataset.get_feature_data.shape[1] == 1
assert dataset.meta == selected_dataset.meta
✅ Finished step FeatureSelectionPreProcessor. Output shape=(10, 24). Time taken for step: 0:00:00.001614. ✅
selected_dataset.head(2)
feature_dichasial_hammier_spawner target target_nomi_20 target_nomi_60 target_jerome_20 target_jerome_60 target_janet_20 target_janet_60 target_ben_20 target_ben_60 ... target_george_20 target_george_60 target_william_20 target_william_60 target_arthur_20 target_arthur_60 target_thomas_20 target_thomas_60 era data_type
id
n559bd06a8861222 0.25 0.25 0.25 0.50 0.0 0.50 0.5 0.5 0.25 0.5 ... 0.25 0.5 0.000000 0.500000 0.166667 0.500000 0.333333 0.500000 0297 train
n9d39dea58c9e3cf 0.75 0.50 0.50 0.75 0.5 0.75 0.5 0.5 0.50 0.5 ... 0.50 0.5 0.666667 0.666667 0.500000 0.666667 0.500000 0.666667 0003 train

2 rows × 24 columns

1.0.3. TargetSelectionPreProcessor

TargetSelectionPreProcessor will keep all targets that you pass + all other columns that are not targets.

Not relevant for an inference pipeline, but especially convenient for Numerai Classic training if you train on a subset of the available targets. Can also be applied to Signals if you are using engineered targets in your pipeline.


source

TargetSelectionPreProcessor

 TargetSelectionPreProcessor (target_cols:Union[str,list])

Keep only features given + all target, predictions and aux columns.

dataset = create_numerframe(
    "test_assets/mini_numerai_version_2_data.parquet"
)
target_cols = ["target", "target_nomi_20", "target_nomi_60"]
selected_dataset = TargetSelectionPreProcessor(target_cols=target_cols).transform(
    dataset
)
assert selected_dataset.get_target_data.shape[1] == len(target_cols)
selected_dataset.head(2)
✅ Finished step TargetSelectionPreProcessor. Output shape=(10, 1055). Time taken for step: 0:00:00.022602. ✅
target target_nomi_20 target_nomi_60 feature_dichasial_hammier_spawner feature_rheumy_epistemic_prancer feature_pert_performative_hormuz feature_hillier_unpitied_theobromine feature_perigean_bewitching_thruster feature_renegade_undomestic_milord feature_koranic_rude_corf ... feature_drawable_exhortative_dispersant feature_metabolic_minded_armorist feature_investigatory_inerasable_circumvallation feature_centroclinal_incentive_lancelet feature_unemotional_quietistic_chirper feature_behaviorist_microbiological_farina feature_lofty_acceptable_challenge feature_coactive_prefatorial_lucy era data_type
id
n559bd06a8861222 0.25 0.25 0.50 0.25 0.75 0.25 0.75 0.25 0.50 1.0 ... 1.00 0.0 0.0 0.25 0.00 0.0 1.00 0.25 0297 train
n9d39dea58c9e3cf 0.50 0.50 0.75 0.75 0.50 0.75 1.00 0.50 0.25 0.5 ... 0.25 0.5 0.0 0.25 0.75 1.0 0.75 1.00 0003 train

2 rows × 1055 columns

1.0.4. ReduceMemoryProcessor

Numerai datasets can take up a lot of RAM and may put a strain on your compute environment.

For Numerai Classic, many of the feature and target columns can be downscaled to float16. int8 if you are using the Numerai int8 datasets. For Signals it depends on the features you are generating.

ReduceMemoryProcessor downscales the type of your numeric columns to reduce the memory footprint as much as possible.


source

ReduceMemoryProcessor

 ReduceMemoryProcessor (deep_mem_inspect=False)

Reduce memory usage as much as possible.

Credits to kainsama and others for writing about memory usage reduction for Numerai data: https://forum.numer.ai/t/reducing-memory/313

:param deep_mem_inspect: Introspect the data deeply by interrogating object dtypes. Yields a more accurate representation of memory usage if you have complex object columns.

dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
rmp = ReduceMemoryProcessor()
dataf = rmp.transform(dataf)
Memory usage of DataFrame is 0.04 MB
Memory usage after optimization is: 0.02 MB
 Usage decreased by 49.72%
✅ Finished step ReduceMemoryProcessor. Output shape=(10, 1073). Time taken for step: 0:00:00.695928. ✅

1.0.6. UMAPFeatureGenerator

Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique that we can utilize to generate new Numerai features. This processor uses umap-learn under the hood to model the manifold. The dimension of the input data will be reduced to n_components number of features.


source

UMAPFeatureGenerator

 UMAPFeatureGenerator (n_components:int=5, n_neighbors:int=15,
                       min_dist:float=0.0, metric:str='correlation',
                       feature_names:list=None, *args, **kwargs)

Generate new Numerai features using UMAP. Uses umap-learn under the hood:

https://pypi.org/project/umap-learn/ :param n_components: How many new features to generate. :param n_neighbors: Number of neighboring points used in local approximations of manifold structure. :param min_dist: How tightly the embedding is allows to compress points together. :param metric: Metric to measure distance in input space. Correlation by default. :param feature_names: Selection of features used to perform UMAP on. All features by default. *args, **kwargs will be passed to initialization of UMAP.

n_components = 3
umap_gen = UMAPFeatureGenerator(n_components=n_components, n_neighbors=9)
dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet")
dataf = umap_gen(dataf)

The new features will be names with the convention f"feature_umap_{i}".

umap_features = [f"feature_umap_{i}" for i in range(n_components)]
dataf[umap_features].head(3)
feature_umap_0 feature_umap_1 feature_umap_2
id
n559bd06a8861222 0.419718 0.607629 1.000000
n9d39dea58c9e3cf 0.794003 0.273073 0.963785
nb64f06d3a9fc9f1 0.490861 1.000000 0.522663

1.1. Numerai Classic

The Numerai Classic dataset has a certain structure that you may not encounter in the Numerai Signals tournament. Therefore, this section has all preprocessors that can only be applied to Numerai Classic.

1.1.0 Numerai Classic: Version agnostic

Preprocessors that work for all Numerai Classic versions.

1.1.0.1. BayesianGMMTargetProcessor


source

BayesianGMMTargetProcessor

 BayesianGMMTargetProcessor (target_col:str='target',
                             feature_names:list=None, n_components:int=6)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model.

Based on Michael Oliver’s GitHub Gist implementation:

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target.

:param feature_names: Selection of features used for Bayesian GMM. All features by default. :param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

1.2. Numerai Signals

Preprocessors that are specific to Numerai Signals.

1.2.1. KatsuFeatureGenerator

Katsu1110 provides an excellent and fast feature engineering scheme in his Kaggle notebook on starting with Numerai Signals. It is surprisingly effective, fast and works well for modeling. This preprocessor is based on his feature engineering setup in that notebook.

Features generated: 1. MACD and MACD signal 2. RSI 3. Percentage rate of return 4. Volatility 5. MA (moving average) gap


source

KatsuFeatureGenerator

 KatsuFeatureGenerator (windows:list, ticker_col:str='ticker',
                        close_col:str='close', num_cores:int=None)

Effective feature engineering setup based on Katsu’s starter notebook. Based on source by Katsu1110: https://www.kaggle.com/code1110/numeraisignals-starter-for-beginners

:param windows: Time interval to apply for window features:

  1. Percentage Rate of change

  2. Volatility

  3. Moving Average gap

:param ticker_col: Columns with tickers to iterate over.

:param close_col: Column name where you have closing price stored.

Let’s create a simple synthetic dataset to test preprocessors on. Many preprocessor require at least ticker, date and close columns. More advanced feature engineering preprocessors should also have open, high, low and volume columns.

instances = []
tickers = ["ABC.US", "DEF.US", "GHI.US"]
for ticker in tickers:
    price = np.random.randint(10, 100)
    for i in range(100):
        price += np.random.uniform(-1, 1)
        instances.append(
            {
                "ticker": ticker,
                "date": pd.Timestamp("2020-01-01") + pd.Timedelta(days=i),
                "open": price - 0.05,
                "high": price + 0.02,
                "low": price - 0.01,
                "close": price,
                "volume": np.random.randint(1000, 10000),
            }
        )
dummy_df = NumerFrame(instances)
dummy_df.head(2)
ticker date open high low close volume
0 ABC.US 2020-01-01 73.980959 74.050959 74.020959 74.030959 3307
1 ABC.US 2020-01-02 74.477860 74.547860 74.517860 74.527860 7824
dataf = NumerFrame(dummy_df)
dataf.loc[:, "friday_date"] = dataf["date"]
kfpp = KatsuFeatureGenerator(windows=[20, 40, 60], num_cores=8)
new_dataf = kfpp.transform(dataf)
Feature engineering for 3 tickers using 8 CPU cores.
✅ Finished step KatsuFeatureGenerator. Output shape=(300, 20). Time taken for step: 0:00:00.359029. ✅

12 features are generated in this test (3*3 window features + 3 non window features).

new_dataf.sort_values(["ticker", "date"]).get_feature_data.tail(2)
feature_close_ROCP_20 feature_close_VOL_20 feature_close_MA_gap_20 feature_close_ROCP_40 feature_close_VOL_40 feature_close_MA_gap_40 feature_close_ROCP_60 feature_close_VOL_60 feature_close_MA_gap_60 feature_RSI feature_MACD feature_MACD_signal
298 0.015109 0.002838 1.000321 0.053793 0.003068 1.007518 0.104728 0.003154 1.033006 51.987534 0.695065 0.813298
299 -0.004253 0.002809 0.992487 0.068869 0.002935 0.997809 0.077094 0.003124 1.023447 48.727812 0.557264 0.762091

1.2.2. EraQuantileProcessor

Numerai Signals’ objective is predicting a ranking of equities. Therefore, we can benefit from creating rankings out of the features. Doing this reduces noise and works as a normalization mechanism for your features. EraQuantileProcessor bins features in a given number of quantiles for each era in the dataset.


source

EraQuantileProcessor

 EraQuantileProcessor (num_quantiles:int=50, era_col:str='friday_date',
                       features:list=None, num_cores:int=None,
                       random_state:int=0, batch_size:int=1)

Transform features into quantiles on a per-era basis

:param num_quantiles: Number of buckets to split data into.

:param era_col: Era column name in the dataframe to perform each transformation.

:param features: All features that you want quantized. All feature cols by default.

:param num_cores: CPU cores to allocate for quantile transforming. All available cores by default.

:param random_state: Seed for QuantileTransformer.

:param batch_size: How many feature to process at the same time. For Numerai Signals scale data it is advisable to process features one by one. This is the default setting.

era_quantiler = EraQuantileProcessor(num_quantiles=50)
era_dataf = era_quantiler.transform(new_dataf)
Quantiling for 12 features using 32 CPU cores.
✅ Finished step EraQuantileProcessor. Output shape=(300, 21). Time taken for step: 0:00:01.111358. ✅
era_dataf.get_feature_data.tail(2)
feature_close_ROCP_20 feature_close_VOL_20 feature_close_MA_gap_20 feature_close_ROCP_40 feature_close_VOL_40 feature_close_MA_gap_40 feature_close_ROCP_60 feature_close_VOL_60 feature_close_MA_gap_60 feature_RSI feature_MACD feature_MACD_signal feature_close_ROCP_20_quantile50
298 0.015109 0.002838 1.000321 0.053793 0.003068 1.007518 0.104728 0.003154 1.033006 51.987534 0.695065 0.813298 1.0
299 -0.004253 0.002809 0.992487 0.068869 0.002935 0.997809 0.077094 0.003124 1.023447 48.727812 0.557264 0.762091 0.5

1.2.3. TickerMapper

Numerai Signals data APIs may work with different ticker formats. Our goal with TickerMapper is to map ticker_col to target_ticker_format.


source

TickerMapper

 TickerMapper (ticker_col:str='ticker',
               target_ticker_format:str='bloomberg_ticker',
               mapper_path:str='https://numerai-signals-public-data.s3-us-
               west-2.amazonaws.com/signals_ticker_map_w_bbg.csv')

Map ticker from one format to another.

:param ticker_col: Column used for mapping. Must already be present in the input data.

:param target_ticker_format: Format to map tickers to. Must be present in the ticker map.

For default mapper supported ticker formats are: [‘ticker’, ‘bloomberg_ticker’, ‘yahoo’]

:param mapper_path: Path to CSV file containing at least ticker_col and target_ticker_format columns.

Can be either a web link of local path. Numerai Signals mapping by default.

Use default signals mapping to convert between Numerai ticker, Bloomberg ticker and Yahoo ticker formats.

test_dataf = pd.DataFrame(["AAPL", "MSFT"], columns=["ticker"])
mapper = TickerMapper()
mapper.transform(test_dataf)
✅ Finished step TickerMapper. Output shape=(2, 2). Time taken for step: 0:00:00.002761. ✅
ticker bloomberg_ticker
0 AAPL AAPL US
1 MSFT MSFT US

You can also use a CSV file for mapping. For example, the mapping Numerai user degerhan provides in dsignals for EOD data.

test_dataf = pd.DataFrame(["LLB SW", "DRAK NA", "SWB MK", "ELEKTRA* MF", "NOT_A_TICKER"], columns=["bloomberg_ticker"])
mapper = TickerMapper(ticker_col="bloomberg_ticker", target_ticker_format="signals_ticker",
                      mapper_path="test_assets/eodhd-map.csv")
mapper.transform(test_dataf)
✅ Finished step TickerMapper. Output shape=(5, 2). Time taken for step: 0:00:00.005146. ✅
bloomberg_ticker signals_ticker
0 LLB SW LLB.SW
1 DRAK NA DRAK.AS
2 SWB MK 5211.KLSE
3 ELEKTRA* MF ELEKTRA.MX
4 NOT_A_TICKER NaN

1.2.4. SignalsTargetProcessor

Numerai provides targets for 5000 stocks that are neutralized against all sorts of factors. However, it can be helpful to experiment with creating your own targets. You might want to explore different windows, different target binning and/or neutralization. SignalsTargetProcessor engineers 3 different targets for every given windows: - _raw: Raw return based on price movements. - _rank: Ranks of raw return. - _group: Binned returns based on rank.

Note that Numerai provides targets based on 4-day returns and 20-day returns. While you can explore any window you like, it makes sense to start with windows close to these timeframes.

For the bins argument there are also many options possible. The followed are commonly used binning: - Nomi bins: [0, 0.05, 0.25, 0.75, 0.95, 1] - Uniform bins: [0, 0.20, 0.40, 0.60, 0.80, 1]


source

SignalsTargetProcessor

 SignalsTargetProcessor (price_col:str='close', windows:list=None,
                         bins:list=None, labels:list=None)

Engineer targets for Numerai Signals.

More information on implements Numerai Signals targets:

https://forum.numer.ai/t/decoding-the-signals-target/2501

:param price_col: Column from which target will be derived.

:param windows: Timeframes to use for engineering targets. 10 and 20-day by default.

:param bins: Binning used to create group targets. Nomi binning by default.

:param labels: Scaling for binned target. Must be same length as resulting bins (bins-1). Numerai labels by default.

stp = SignalsTargetProcessor()
era_dataf.meta.era_col = "date"
new_target_dataf = stp.transform(era_dataf)
new_target_dataf.get_target_data.head(2)
✅ Finished step SignalsTargetProcessor. Output shape=(300, 27). Time taken for step: 0:00:00.334536. ✅
target_10d_raw target_10d_rank target_10d_group target_20d_raw target_20d_rank target_20d_group
0 0.026874 1.0 1.0 0.039152 1.0 1.0
1 0.021059 1.0 1.0 0.018829 1.0 1.0

1.2.5. LagPreProcessor

Many models like Gradient Boosting Machines (GBMs) don’t learn any time-series patterns by itself. However, if we create lags of our features the models will pick up on time dependencies between features. LagPreProcessor create lag features for given features and windows.


source

LagPreProcessor

 LagPreProcessor (windows:list=None, ticker_col:str='bloomberg_ticker',
                  feature_names:list=None)

Add lag features based on given windows.

:param windows: All lag windows to process for all features.

[5, 10, 15, 20] by default (4 weeks lookback)

:param ticker_col: Column name for grouping by tickers.

:param feature_names: All features for which you want to create lags. All features by default.

lpp = LagPreProcessor(ticker_col="ticker", feature_names=["close", "volume"])
dataf = lpp(dataf)
✅ Finished step LagPreProcessor. Output shape=(300, 16). Time taken for step: 0:00:00.036771. ✅

All lag features will contain lag in the column name.

dataf.get_pattern_data("lag").tail(2)
close_lag5 close_lag10 close_lag15 close_lag20 volume_lag5 volume_lag10 volume_lag15 volume_lag20
298 47.939367 46.472799 47.846745 46.300246 4102.0 3544.0 7182.0 7843.0
299 47.322248 46.709073 47.398402 46.820937 9846.0 6945.0 5197.0 5164.0

1.2.6. DifferencePreProcessor

After creating lags with the LagPreProcessor, it may be useful to create new features that calculate the difference between those lags. Through this process in DifferencePreProcessor, we can provide models with more time-series related patterns.


source

DifferencePreProcessor

 DifferencePreProcessor (windows:list=None, feature_names:list=None,
                         pct_diff:bool=False, abs_diff:bool=False)

Add difference features based on given windows. Run LagPreProcessor first.

:param windows: All lag windows to process for all features.

:param feature_names: All features for which you want to create differences. All features that also have lags by default.

:param pct_change: Method to calculate differences. If True, will calculate differences with a percentage change. Otherwise calculates a simple difference. Defaults to False

:param abs_diff: Whether to also calculate the absolute value of all differences. Defaults to True

dpp = DifferencePreProcessor(
    feature_names=["close", "volume"], windows=[5, 10, 15, 20], pct_diff=True
)
dataf = dpp.transform(dataf)
✅ Finished step DifferencePreProcessor. Output shape=(300, 24). Time taken for step: 0:00:00.047873. ✅

All difference features will contain diff in the column name.

dataf.get_pattern_data("diff").tail(2)
close_diff5 close_diff10 close_diff15 close_diff20 volume_diff5 volume_diff10 volume_diff15 volume_diff20
298 -0.019599 0.011340 -0.017701 0.015109 0.226719 0.419865 -0.299360 -0.358409
299 -0.014802 -0.001868 -0.016384 -0.004253 -0.279301 0.021742 0.365403 0.374129

1.2.7. PandasTaFeatureGenerator

This generator takes in a pandas-ta strategy and processing them on multiple cores. There is a simple default strategy available with RSI features for 14 and 60 rows.

To learn more about defining pandas-ta strategies. Check this section of the pandas-ta README.


source

PandasTaFeatureGenerator

 PandasTaFeatureGenerator (strategy:pandas_ta.core.Strategy=None,
                           ticker_col:str='ticker', num_cores:int=None)

Generate features with pandas-ta. https://github.com/twopirllc/pandas-ta

:param strategy: Valid Pandas Ta strategy.

For more information on creating a strategy, see:

https://github.com/twopirllc/pandas-ta#pandas-ta-strategy

By default, a strategy with RSI(14) and RSI(60) is used.

:param ticker_col: Column name for grouping by tickers.

:param num_cores: Number of cores to use for multiprocessing.

By default, all available cores are used.

pta = PandasTaFeatureGenerator()
new_pta_df = pta.transform(dummy_df)
new_pta_df.tail(2)
✅ Finished step PandasTaFeatureGenerator. Output shape=(300, 10). Time taken for step: 0:00:00.943370. ✅
ticker date open high low close volume friday_date feature_RSI_14 feature_RSI_60
298 GHI.US 2020-04-08 46.949789 47.019789 46.989789 46.999789 5032 2020-04-08 51.987534 52.353205
299 GHI.US 2020-04-09 46.571805 46.641805 46.611805 46.621805 7096 2020-04-09 48.727812 51.555874

The feature data can be selected directly through a NumerFrame convenience method called .get_feature_data.

new_pta_df.get_feature_data.tail(2)
feature_RSI_14 feature_RSI_60
298 51.987534 52.353205
299 48.727812 51.555874

A custom pandas-ta strategy can be defined as follows. Check the pandas-ta docs for more information on available indicators and arguments.

ta takes in a list of dictionaries defining indicators and optional additional arguments. We use col_names for convenience so features are prefixed by feature_ and can be easily retrieved within a NumerFrame.

strategy = ta.Strategy(name="mystrategy",
                       ta=[{"kind": "cmo", "col_names": ("feature_CMO")}, # Chande Momentum Oscillator
                           {"kind": "rsi", "length": 60, "col_names": ("feature_RSI_60")} # Relative Strength Index
                           ])
pta = PandasTaFeatureGenerator(strategy=strategy)
new_pta_df = pta.transform(dummy_df)
new_pta_df.get_feature_data.tail(5)
✅ Finished step PandasTaFeatureGenerator. Output shape=(300, 10). Time taken for step: 0:00:00.965338. ✅
feature_CMO feature_RSI_60
295 11.802106 53.305259
296 9.111138 52.965671
297 8.799414 52.927768
298 3.975068 52.353205
299 -2.544377 51.555874

2. Custom preprocessors

There are an almost unlimited number of ways to preprocess (selection, engineering and manipulation). We have only scratched the surface with the preprocessors currently implemented. We invite the Numerai community to develop Numerai Classic and Numerai Signals preprocessors.

A new Preprocessor should inherit from BaseProcessor and implement a transform method. For efficient implementation, we recommend you use NumerFrame functionality for preprocessing. You can also support Pandas DataFrame input as long as the transform method returns a NumerFrame. This ensures that the Preprocessor still works within a full numerai-blocks pipeline. A template for new preprocessors is given below.

To enable fancy logging output. Add the @display_processor_info decorator to the transform method.


source

AwesomePreProcessor

 AwesomePreProcessor ()

TEMPLATE - Do some awesome preprocessing.