Feature Transformers
centimators.feature_transformers.ranking
Ranking transformers for cross-sectional normalization.
RankTransformer
Bases: _BaseFeatureTransformer
RankTransformer transforms features into their normalized rank within groups defined by a date series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_names
|
list of str
|
Names of columns to transform. If None, all columns of X are used. |
None
|
Examples:
>>> import pandas as pd
>>> from centimators.feature_transformers import RankTransformer
>>> df = pd.DataFrame({
... 'date': ['2021-01-01', '2021-01-01', '2021-01-02'],
... 'feature1': [3, 1, 2],
... 'feature2': [30, 20, 10]
... })
>>> transformer = RankTransformer(feature_names=['feature1', 'feature2'])
>>> result = transformer.fit_transform(df[['feature1', 'feature2']], date_series=df['date'])
>>> print(result)
feature1_rank feature2_rank
0 0.5 0.5
1 1.0 1.0
2 1.0 1.0
Source code in src/centimators/feature_transformers/ranking.py
transform(X, y=None, date_series=None)
Transforms features to their normalized rank.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Any
|
Ignored. Kept for compatibility. |
None
|
date_series
|
IntoSeries
|
Series defining groups for ranking (e.g., dates). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with ranked features. |
Source code in src/centimators/feature_transformers/ranking.py
get_feature_names_out(input_features=None)
Returns the output feature names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of transformed feature names. |
Source code in src/centimators/feature_transformers/ranking.py
centimators.feature_transformers.time_series
Time-series feature transformers for grouped temporal operations.
LagTransformer
Bases: _BaseFeatureTransformer
LagTransformer shifts features by specified lag windows within groups defined by a ticker series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
windows
|
iterable of int
|
Lag periods to compute. Each feature will have shifted versions for each lag. |
required |
feature_names
|
list of str
|
Names of columns to transform. If None, all columns of X are used. |
None
|
Examples:
>>> import pandas as pd
>>> from centimators.feature_transformers import LagTransformer
>>> df = pd.DataFrame({
... 'ticker': ['A', 'A', 'A', 'B', 'B'],
... 'price': [10, 11, 12, 20, 21]
... })
>>> transformer = LagTransformer(windows=[1, 2], feature_names=['price'])
>>> result = transformer.fit_transform(df[['price']], ticker_series=df['ticker'])
>>> print(result)
price_lag1 price_lag2
0 NaN NaN
1 10.0 NaN
2 11.0 10.0
3 NaN NaN
4 20.0 NaN
Source code in src/centimators/feature_transformers/time_series.py
transform(X, y=None, ticker_series=None)
Applies lag transformation to the features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Any
|
Ignored. Kept for compatibility. |
None
|
ticker_series
|
IntoSeries
|
Series defining groups for lagging (e.g., tickers). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with lagged features. Columns are ordered
by lag (as in |
Source code in src/centimators/feature_transformers/time_series.py
get_feature_names_out(input_features=None)
Returns the output feature names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of transformed feature names, ordered by lag, then by feature. |
Source code in src/centimators/feature_transformers/time_series.py
MovingAverageTransformer
Bases: _BaseFeatureTransformer
MovingAverageTransformer computes the moving average of a feature over a specified window.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
windows
|
list of int
|
The windows over which to compute the moving average. |
required |
feature_names
|
list of str
|
The names of the features to compute the moving average for. |
None
|
Source code in src/centimators/feature_transformers/time_series.py
transform(X, y=None, ticker_series=None)
Applies moving average transformation to the features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Any
|
Ignored. Kept for compatibility. |
None
|
ticker_series
|
IntoSeries
|
Series defining groups for moving average (e.g., tickers). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with moving average features. |
Source code in src/centimators/feature_transformers/time_series.py
get_feature_names_out(input_features=None)
Returns the output feature names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of transformed feature names. |
Source code in src/centimators/feature_transformers/time_series.py
LogReturnTransformer
Bases: _BaseFeatureTransformer
LogReturnTransformer computes the log return of a feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_names
|
list of str
|
Names of columns to transform. If None, all columns of X are used. |
None
|
Source code in src/centimators/feature_transformers/time_series.py
transform(X, y=None, ticker_series=None)
Applies log return transformation to the features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Any
|
Ignored. Kept for compatibility. |
None
|
ticker_series
|
IntoSeries
|
Series defining groups for log return (e.g., tickers). |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with log return features. |
Source code in src/centimators/feature_transformers/time_series.py
get_feature_names_out(input_features=None)
Returns the output feature names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of transformed feature names. |
Source code in src/centimators/feature_transformers/time_series.py
centimators.feature_transformers.stats
Statistical transformers for horizontal aggregations.
GroupStatsTransformer
Bases: _BaseFeatureTransformer
GroupStatsTransformer calculates statistical measures for defined feature groups.
This transformer computes mean, standard deviation, and skewness for each group of features specified in the feature_group_mapping.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
feature_group_mapping
|
dict
|
Dictionary mapping group names to lists of feature columns. Example: {'group1': ['feature1', 'feature2'], 'group2': ['feature3', 'feature4']} |
required |
stats
|
list of str
|
List of statistics to compute for each group. If None, all statistics are computed. Valid options are 'mean', 'std', 'skew', 'kurt', 'range', and 'cv'. |
['mean', 'std', 'skew', 'kurt', 'range', 'cv']
|
Examples:
>>> import pandas as pd
>>> from centimators.feature_transformers import GroupStatsTransformer
>>> df = pd.DataFrame({
... 'feature1': [1, 2, 3],
... 'feature2': [4, 5, 6],
... 'feature3': [7, 8, 9],
... 'feature4': [10, 11, 12]
... })
>>> mapping = {'group1': ['feature1', 'feature2'], 'group2': ['feature3', 'feature4']}
>>> transformer = GroupStatsTransformer(feature_group_mapping=mapping)
>>> result = transformer.fit_transform(df)
>>> print(result)
group1_groupstats_mean group1_groupstats_std group1_groupstats_skew group2_groupstats_mean group2_groupstats_std group2_groupstats_skew
0 2.5 1.5 0.0 8.5 1.5 0.0
1 3.5 1.5 0.0 9.5 1.5 0.0
2 4.5 1.5 0.0 10.5 1.5 0.0
>>> transformer_mean_only = GroupStatsTransformer(feature_group_mapping=mapping, stats=['mean'])
>>> result_mean_only = transformer_mean_only.fit_transform(df)
>>> print(result_mean_only)
group1_groupstats_mean group2_groupstats_mean
0 2.5 8.5
1 3.5 9.5
2 4.5 10.5
Source code in src/centimators/feature_transformers/stats.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
transform(X, y=None)
Calculates group statistics on the features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Any
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with group statistics features. |
Source code in src/centimators/feature_transformers/stats.py
get_feature_names_out(input_features=None)
Return feature names for all groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of transformed feature names. |
Source code in src/centimators/feature_transformers/stats.py
centimators.feature_transformers.neutralization
Neutralization transformers for reducing feature exposure.
FeatureNeutralizer
Bases: _BaseFeatureTransformer
Classic feature neutralization by subtracting a linear model to reduce feature exposure.
This transformer neutralizes predictions by removing their linear relationship with specified features. For each era, it: 1. Gaussianizes the predictions (rank -> normalize -> inverse CDF) 2. Fits a linear model: prediction ~ features 3. Subtracts proportion * exposure from predictions 4. Re-normalizes and scales to [0, 1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
proportion
|
float or list of float
|
How much to neutralize in range [0, 1]. 0 = no neutralization, 1 = full neutralization. If list, creates multiple output columns (one per proportion). |
0.5
|
pred_name
|
str or list of str
|
Name(s) of prediction column(s) to neutralize. Used for generating output column names. |
'prediction'
|
feature_names
|
list of str
|
Names of feature columns to neutralize against. If None, all columns of X are used. |
None
|
suffix
|
str
|
Suffix to append to output column names. |
None
|
n_jobs
|
int
|
Number of parallel jobs. 1 = sequential (default), -1 = all cores. |
1
|
verbose
|
bool
|
Show progress bar over eras. Default False. |
False
|
Examples:
>>> import pandas as pd
>>> from centimators.feature_transformers import FeatureNeutralizer
>>> # Sample data with eras, features, and predictions
>>> df = pd.DataFrame({
... 'era': ['era1', 'era1', 'era1', 'era2', 'era2', 'era2'],
... 'feature1': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
... 'feature2': [0.6, 0.5, 0.4, 0.3, 0.2, 0.1],
... 'prediction': [0.7, 0.8, 0.9, 0.6, 0.7, 0.8]
... })
>>> neutralizer = FeatureNeutralizer(
... proportion=0.5,
... pred_name='prediction',
... feature_names=['feature1', 'feature2']
... )
>>> # Predictions to neutralize (can be separate from features)
>>> result = neutralizer.fit_transform(
... df[['prediction']],
... features=df[['feature1', 'feature2']],
... era_series=df['era']
... )
Source code in src/centimators/feature_transformers/neutralization.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | |
transform(X, y=None, features=None, era_series=None)
Neutralizes predictions against features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input predictions to neutralize (shape: n_samples x n_predictions). |
required |
y
|
Ignored. Kept for sklearn compatibility. |
None
|
|
features
|
FrameT | None
|
DataFrame with features for neutralization. If None, uses X as both predictions and features. |
None
|
era_series
|
IntoSeries | None
|
Series with era labels for grouping. If None, treats all data as a single era. |
None
|
Returns:
| Type | Description |
|---|---|
FrameT
|
DataFrame with neutralized predictions, scaled to [0, 1]. |
Source code in src/centimators/feature_transformers/neutralization.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | |
centimators.feature_transformers.embedding
Embedding transformers for text and categorical features using DSPy.
EmbeddingTransformer
Bases: _BaseFeatureTransformer
EmbeddingTransformer embeds text and categorical features using DSPy's Embedder.
This transformer converts text or categorical columns into dense vector embeddings using either hosted embedding models (e.g., OpenAI) or custom embedding functions (e.g., local SentenceTransformers). The embeddings are expanded into multiple columns for sklearn compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str or Callable
|
The embedding model to use. Can be: - A string for hosted models (e.g., "openai/text-embedding-3-small") - A callable function (e.g., SentenceTransformer.encode) |
required |
feature_names
|
list[str] | None
|
Names of columns to embed. If None, all columns are embedded. |
None
|
categorical_mapping
|
dict[str, str] | None
|
Optional mapping from categorical column names to text templates. For example: {"sector": "Company sector: {}"} will format the sector value as "Company sector: Technology" before embedding. |
None
|
batch_size
|
int
|
Batch size for embedding computation. Default: 200. |
200
|
caching
|
bool
|
Whether to cache embeddings (for hosted models). Default: True. |
True
|
**embedder_kwargs
|
Additional keyword arguments passed to dspy.Embedder. |
{}
|
Examples:
>>> import polars as pl
>>> from centimators.feature_transformers import EmbeddingTransformer
>>> from sentence_transformers import SentenceTransformer
>>>
>>> # Example 1: Using a local model
>>> model = SentenceTransformer('all-MiniLM-L6-v2')
>>> df = pl.DataFrame({
... 'text': ['AI company', 'Bank', 'Pharma firm'],
... 'sector': ['Technology', 'Finance', 'Healthcare']
... })
>>>
>>> transformer = EmbeddingTransformer(
... model=model.encode,
... feature_names=['text', 'sector'],
... categorical_mapping={'sector': 'Company sector: {}'}
... )
>>> embedded = transformer.fit_transform(df[['text', 'sector']])
>>> print(embedded.columns) # text_embed_0, text_embed_1, ..., sector_embed_0, ...
>>>
>>> # Example 2: Using a hosted model
>>> transformer = EmbeddingTransformer(
... model="openai/text-embedding-3-small",
... feature_names=['text']
... )
>>> embedded = transformer.fit_transform(df[['text']])
Notes
- Null values are skipped and filled with zero vectors
- Embedding dimension is inferred from the first batch
- Output columns follow the pattern:
{feature_name}_embed_{dim_idx} - Requires
centimators[dspy]installation
Source code in src/centimators/feature_transformers/embedding.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
fit(X, y=None)
Fit the transformer and initialize the embedder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
EmbeddingTransformer |
The fitted transformer. |
Source code in src/centimators/feature_transformers/embedding.py
transform(X, y=None)
Transform features by embedding them into dense vectors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with embedding columns expanded. Each input feature becomes multiple columns: {feature_name}_embed_0, {feature_name}_embed_1, etc. |
Source code in src/centimators/feature_transformers/embedding.py
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 | |
get_feature_names_out(input_features=None)
Return the output feature names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of transformed feature names in the format {feature_name}embed. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If called before transform() when dimensions are unknown. |
Source code in src/centimators/feature_transformers/embedding.py
centimators.feature_transformers.dimreduction
Dimensionality reduction transformers for feature compression.
DimReducer
Bases: _BaseFeatureTransformer
DimReducer applies dimensionality reduction to features using PCA, t-SNE, or UMAP.
This transformer reduces the dimensionality of input features by projecting them into a lower-dimensional space using one of three methods: Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), or Uniform Manifold Approximation and Projection (UMAP).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
The dimensionality reduction method to use. Options are: - 'pca': Principal Component Analysis (linear, preserves global structure) - 'tsne': t-SNE (non-linear, preserves local structure, visualization) - 'umap': UMAP (non-linear, preserves local + global structure) Default: 'pca' |
'pca'
|
n_components
|
int
|
Number of dimensions in the reduced space. Default: 2 |
2
|
feature_names
|
list[str] | None
|
Names of columns to reduce. If None, all columns are used. |
None
|
**reducer_kwargs
|
Additional keyword arguments passed to the underlying reducer (sklearn.decomposition.PCA, sklearn.manifold.TSNE, or umap.UMAP). |
{}
|
Examples:
>>> import polars as pl
>>> from centimators.feature_transformers import DimReducer
>>> df = pl.DataFrame({
... 'feature1': [1.0, 2.0, 3.0, 4.0],
... 'feature2': [4.0, 5.0, 6.0, 7.0],
... 'feature3': [7.0, 8.0, 9.0, 10.0],
... })
>>>
>>> # PCA reduction
>>> reducer = DimReducer(method='pca', n_components=2)
>>> reduced = reducer.fit_transform(df)
>>> print(reduced.columns) # ['dim_0', 'dim_1']
>>>
>>> # t-SNE for visualization
>>> reducer = DimReducer(method='tsne', n_components=2, random_state=42)
>>> reduced = reducer.fit_transform(df)
>>>
>>> # UMAP (requires umap-learn)
>>> reducer = DimReducer(method='umap', n_components=2, random_state=42)
>>> reduced = reducer.fit_transform(df)
Notes
- PCA is deterministic and fast, suitable for preprocessing
- t-SNE is stochastic and slower, primarily for visualization (does not support separate transform - uses fit_transform internally)
- UMAP balances speed and quality, good for both preprocessing and visualization
- UMAP requires the umap-learn package:
uv add 'centimators[all]' - All methods work with any narwhals-compatible backend (pandas, polars, etc.)
Source code in src/centimators/feature_transformers/dimreduction.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |
fit(X, y=None)
Fit the dimensionality reduction model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
DimReducer |
The fitted transformer. |
Source code in src/centimators/feature_transformers/dimreduction.py
transform(X, y=None)
Transform features by reducing their dimensionality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input data frame. |
required |
y
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
FrameT |
FrameT
|
Transformed data frame with reduced dimensionality. Columns are named 'dim_0', 'dim_1', ..., 'dim_{n_components-1}'. |
Source code in src/centimators/feature_transformers/dimreduction.py
get_feature_names_out(input_features=None)
Return the output feature names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_features
|
list[str]
|
Ignored. Kept for compatibility. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of output feature names: ['dim_0', 'dim_1', ...]. |
Source code in src/centimators/feature_transformers/dimreduction.py
centimators.feature_transformers.penalization
Feature penalization transformers using iterative optimization (requires JAX).
FeaturePenalizer
Bases: _BaseFeatureTransformer
Feature penalization using iterative optimization to cap feature exposure.
Unlike FeatureNeutralizer which subtracts a fixed proportion of linear exposure, this transformer uses gradient descent to find the minimal adjustment that caps all feature exposures below a threshold. This preserves more of the original signal while ensuring no single feature dominates.
For each era, it: 1. Gaussianizes the predictions (rank -> normalize -> inverse CDF) 2. Trains a linear model to subtract from predictions such that |exposure to any feature| <= max_exposure 3. Re-normalizes and scales to [0, 1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_exposure
|
float or list of float
|
Maximum allowed feature exposure in [0, 1]. Lower = more aggressive penalization. If list, creates multiple outputs. |
0.1
|
pred_name
|
str or list of str
|
Name(s) of prediction column(s) to penalize. |
'prediction'
|
feature_names
|
list of str
|
Names of feature columns. |
None
|
suffix
|
str
|
Suffix to append to output column names. |
None
|
lr
|
float
|
Learning rate for Adamax optimizer. Default 1e-3. |
0.001
|
max_iters
|
int
|
Maximum optimization iterations per era. |
100000
|
tol
|
float
|
Early stopping tolerance for loss. |
1e-07
|
n_jobs
|
int
|
Number of parallel jobs. 1 = sequential, -1 = all cores. |
1
|
verbose
|
bool
|
Show progress bar over eras. Default False. |
False
|
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> from centimators.feature_transformers import FeaturePenalizer
>>> df = pd.DataFrame({
... 'era': ['era1'] * 50 + ['era2'] * 50,
... 'feature1': np.random.randn(100),
... 'feature2': np.random.randn(100),
... 'prediction': np.random.randn(100)
... })
>>> penalizer = FeaturePenalizer(
... max_exposure=0.1,
... pred_name='prediction',
... feature_names=['feature1', 'feature2']
... )
>>> result = penalizer.fit_transform(
... df[['prediction']],
... features=df[['feature1', 'feature2']],
... era_series=df['era']
... )
Source code in src/centimators/feature_transformers/penalization.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 | |
transform(X, y=None, features=None, era_series=None)
Penalize predictions to cap feature exposure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
FrameT
|
Input predictions to penalize (shape: n_samples x n_predictions). |
required |
y
|
Ignored. Kept for sklearn compatibility. |
None
|
|
features
|
FrameT | None
|
DataFrame with features for penalization. |
None
|
era_series
|
IntoSeries | None
|
Series with era labels for grouping. |
None
|
Returns:
| Type | Description |
|---|---|
FrameT
|
DataFrame with penalized predictions, scaled to [0, 1]. |