Meta Estimators
Meta estimator wrap existing scikit-learn estimators to provide additional functionality. Currently, the following meta estimators are available:
CrossValEstimator
CrossValEstimator
provides a way to integrate cross-validation directly into model training, enabling simultaneous fitting of multiple models across data folds. By doing this, you can fit it as one transformer and get outputs for each fold during the prediction phase.
Why CrossValEstimator?
-
Holistic Training: Cross-validation offers a more robust model training process by leveraging multiple sub-sets of your data. This way, your model's performance is less susceptible to the peculiarities of any single data split.
-
Inherent Ensemble: By training on multiple folds, you're essentially building an ensemble of models. Ensembles often outperform individual models since they average out biases, reduce variance, and are less likely to overfit.
-
Custom Evaluation: With the
evaluation_func
parameter, you can input your custom evaluation logic, allowing for flexible and tailored performance assessment for each fold. -
Flexibility with Predictions: Choose between different prediction functions like 'predict', 'predict_proba', and 'predict_log_proba' using the
predict_func
parameter. -
Verbose Logging: Gain insights into the training process with detailed logs during the fitting phase, aiding in debugging and understanding model performance across folds.
Example
from sklearn.model_selection import KFold
from xgboost import XGBRegressor
from numerblox.meta import CrossValEstimator
# Define the cross-validation strategy
cv = KFold(n_splits=5)
# Initialize the estimator
estimator = XGBRegressor(n_estimators=100, max_depth=3)
# (optional) Define a custom evaluation function
def custom_eval(y_true, y_pred):
return {"mse": ((y_true - y_pred) ** 2).mean()}
# Initialize the CrossValEstimator
cross_val_estimator = CrossValEstimator(cv=cv,
estimator=estimator,
evaluation_func=custom_eval)
# Fit the CrossValEstimator
cross_val_estimator.fit(X_train, y_train)
predictions = cross_val_estimator.predict(X_test)
MetaPipeline
The MetaPipeline
extends the functionality of scikit-learn's Pipeline
by seamlessly integrating models and post-model transformations. It empowers you to employ sophisticated data transformation techniques not just before, but also after your model's predictions. This is particularly useful when post-processing predictions, such as neutralizing feature exposures in financial models.
Why MetaPipeline?
-
Post-Model Transformations: It can be crucial to apply transformations, like feature neutralization, after obtaining predictions.
MetaPipeline
facilitates such operations, leading to improved model generalization and stability. -
Streamlined Workflow: Instead of managing separate sequences for transformations and predictions, you can orchestrate them under a single umbrella, simplifying both development and production workflows.
-
Flexible Integration:
MetaPipeline
gracefully handles a variety of objects, includingPipeline
,FeatureUnion
, andColumnTransformer
. This makes it a versatile tool adaptable to diverse tasks and data structures.
Example
Consider a scenario where you have an XGBRegressor
model and want to apply a FeatureNeutralizer
after obtaining the model's predictions:
from xgboost import XGBRegressor
from numerblox.meta import MetaPipeline
from numerblox.neutralizers import FeatureNeutralizer
# Define MetaPipeline steps
steps = [
('xgb_regressor', XGBRegressor(n_estimators=100, max_depth=3)),
('feature_neutralizer', FeatureNeutralizer(proportion=0.5))
]
# Create MetaPipeline
meta_pipeline = MetaPipeline(steps)
# Train and predict using MetaPipeline
meta_pipeline.fit(X_train, y_train)
predictions = meta_pipeline.predict(X_test)
For a more succinct creation of a MetaPipeline
, you can use the make_meta_pipeline
function:
from numerblox.meta import make_meta_pipeline
pipeline = make_meta_pipeline(XGBRegressor(n_estimators=100, max_depth=3),
FeatureNeutralizer(proportion=0.5))