DSPyMator
Requires DSPy
This estimator requires the dspy optional dependency. Install with:
centimators.model_estimators.DSPyMator brings the power of Large Language Models (LLMs) to feature engineering and tabular prediction tasks through DSPy. Unlike traditional neural networks that learn patterns from data through gradient descent, DSPyMator leverages pre-trained LLMs and natural language reasoning to make predictions, making it uniquely suited for tasks where domain knowledge, explainability, and few-shot learning are critical.
Why Use DSPyMator?
DSPyMator excels in scenarios where traditional machine learning falls short:
- Few-Shot Learning: Achieve strong performance with limited training data by leveraging the LLM's pre-existing knowledge
- Domain Knowledge Integration: Incorporate reasoning and expert knowledge naturally through task descriptions
- Explainable Predictions: Access the model's reasoning process (when using chain-of-thought)
- Mixed Data Types: Seamlessly handle numerical, categorical, and text features without complex preprocessing
- Rapid Prototyping: Get baseline predictions quickly before investing in traditional model training
- Scikit-learn Compatible: Stack and compose DSPyMator in scikit-learn pipelines, column transformers, and with other compatible workflows
How It Works
DSPyMator wraps any DSPy Module (like dspy.Predict or dspy.ChainOfThought) and exposes it through the familiar scikit-learn API. Under the hood:
- Signature Definition: You define input and output fields via DSPy signatures (e.g.,
"review_text -> sentiment") - Feature Mapping: DSPyMator automatically maps your dataframe columns to the signature's input fields
- LLM Execution: During prediction, each row is converted into a prompt and sent to the LLM
- Output Extraction: Results are extracted and returned as numpy arrays (for
predict) or dataframes (fortransform)
The real power comes from DSPy's optimization capabilities. You can use optimizers like GEPA, MIPROv2, or BootstrapFewShot to automatically improve prompts, select better demonstrations, or even finetune the model—all through the standard fit() method.
Usage
Basic Classification
Let's start with a simple sentiment classification task using movie reviews:
import polars as pl
import dspy
from centimators.model_estimators import DSPyMator
# Sample movie reviews
reviews = pl.DataFrame({
"review_text": [
"This movie was absolutely fantastic! A masterpiece.",
"Terrible waste of time. Boring and predictable.",
"Pretty good, though it had some slow moments.",
"One of the worst films I've ever seen.",
"Loved every minute of it! Highly recommended."
],
"sentiment": ["positive", "negative", "neutral", "negative", "positive"]
})
# Define a DSPy program with input and output signature
# You can use dspy.Predict for simple predictions or dspy.ChainOfThought for reasoning
classifier_program = dspy.Predict("review_text: str -> sentiment: str")
# Create the DSPyMator estimator
sentiment_classifier = DSPyMator(
program=classifier_program,
target_names="sentiment", # Which output field to use as predictions
lm="openai/gpt-4o-mini", # Language model to use
temperature=0.0, # Low temperature for consistent outputs
)
# Fit the classifier (establishes the LM configuration)
X = reviews[["review_text"]]
y = reviews["sentiment"]
sentiment_classifier.fit(X, y)
# Make predictions
test_reviews = pl.DataFrame({
"review_text": [
"An incredible journey with stunning visuals.",
"Could barely stay awake through this one."
]
})
predictions = sentiment_classifier.predict(test_reviews[["review_text"]])
print(predictions) # ['positive', 'negative']
Getting Full Outputs with Transform
Unlike predict() which returns only the target field, transform() returns all output fields from the DSPy program:
# Get all outputs (useful for accessing reasoning, confidence, etc.)
full_outputs = sentiment_classifier.transform(test_reviews[["review_text"]])
print(full_outputs)
# Polars DataFrame with column: sentiment
If you're using dspy.ChainOfThought instead of dspy.Predict, you'll also get reasoning:
# Using ChainOfThought to get reasoning
cot_program = dspy.ChainOfThought("review_text: str -> sentiment: str")
cot_classifier = DSPyMator(
program=cot_program,
target_names="sentiment",
)
cot_classifier.fit(X, y)
outputs = cot_classifier.transform(test_reviews[["review_text"]])
print(outputs)
# Polars DataFrame with columns: rationale, sentiment
Multi-Output Predictions
DSPyMator supports multiple output fields for richer predictions:
# Define a program with multiple outputs
multi_output_program = dspy.Predict(
"review_text: str -> sentiment: str, confidence: float"
)
multi_classifier = DSPyMator(
program=multi_output_program,
target_names=["sentiment", "confidence"], # Multiple targets
)
multi_classifier.fit(X, y)
predictions = multi_classifier.predict(test_reviews[["review_text"]])
print(predictions.shape) # (2, 2) - rows × outputs
Prompt Optimization with GEPA
The real magic happens when you use DSPy optimizers to automatically improve your prompts:
# Define a metric function for optimization
def sentiment_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
"""
Metric that returns score and optional feedback for GEPA.
Args:
gold: The ground truth example
pred: The predicted output
trace: Optional full program trace
pred_name: Optional name of predictor being optimized
pred_trace: Optional trace of specific predictor
Returns:
float score or dspy.Prediction(score=float, feedback=str)
"""
y_pred = pred.sentiment
y_true = gold.sentiment
is_correct = (y_pred == y_true)
score = 1.0 if is_correct else 0.0
# If GEPA is requesting feedback, provide rich textual guidance
if pred_name:
if is_correct:
feedback = f"Correctly classified as {y_pred}."
else:
feedback = f"Incorrect. Predicted {y_pred} but should be {y_true}."
return dspy.Prediction(score=score, feedback=feedback)
return score
# Create a GEPA optimizer
gepa_optimizer = dspy.GEPA(
metric=sentiment_metric,
auto="light", # or "medium", "heavy" for more thoroughness
reflection_minibatch_size=20,
reflection_lm=dspy.LM(model="openai/gpt-4o-mini", temperature=1.0)
)
# Create a fresh classifier
optimized_classifier = DSPyMator(
program=dspy.Predict("review_text: str -> sentiment: str"),
target_names="sentiment",
)
# Fit with optimization (GEPA will improve the prompts)
optimized_classifier.fit(
X,
y,
optimizer=gepa_optimizer,
validation_data=0.3 # Use 30% of data for validation
)
# The optimized program is now ready to use
predictions = optimized_classifier.predict(test_reviews[["review_text"]])
Few-Shot Learning with Bootstrap
For few-shot learning, use BootstrapFewShot to automatically select good demonstrations:
# Few-shot optimizer doesn't need validation data
bootstrap_optimizer = dspy.BootstrapFewShot(
metric=sentiment_metric,
max_bootstrapped_demos=3, # Number of examples to use
)
few_shot_classifier = DSPyMator(
program=dspy.ChainOfThought("review_text: str -> sentiment: str"),
target_names="sentiment",
)
# Fit with bootstrap (no validation_data needed)
few_shot_classifier.fit(
X,
y,
optimizer=bootstrap_optimizer,
validation_data=None # Few-shot optimizers only need trainset
)
predictions = few_shot_classifier.predict(test_reviews[["review_text"]])
Advanced: Custom Multi-Input Features
DSPyMator automatically maps multiple dataframe columns to signature fields:
# Multi-feature example
movie_data = pl.DataFrame({
"title": ["The Matrix", "Cats"],
"review_text": ["Mind-bending sci-fi classic", "A catastrophic mistake"],
"rating": [5, 1],
})
# Signature with multiple inputs
multi_input_program = dspy.Predict(
"title: str, review_text: str, rating: int -> sentiment: str"
)
multi_input_classifier = DSPyMator(
program=multi_input_program,
target_names="sentiment",
feature_names=["title", "review_text", "rating"], # Map columns to signature
)
# Fit and predict
multi_input_classifier.fit(movie_data[["title", "review_text", "rating"]], None)
predictions = multi_input_classifier.predict(movie_data[["title", "review_text", "rating"]])
Async Execution for Speed
By default, DSPyMator uses async execution for faster batch predictions:
# Async is on by default
fast_classifier = DSPyMator(
program=dspy.Predict("review_text: str -> sentiment: str"),
target_names="sentiment",
use_async=True, # Default behavior
max_concurrent=50, # Max concurrent API requests
verbose=True, # Show progress bar
)
# For synchronous execution (useful for debugging)
sync_classifier = DSPyMator(
program=dspy.Predict("review_text: str -> sentiment: str"),
target_names="sentiment",
use_async=False,
verbose=True,
)
AsyncIO and Local Model Limitations
DSPyMator's async execution mode is ideal for rapid API-backed LLM calls (e.g., OpenAI, Anthropic) because it uses asyncio. For best results with local LLMs, use use_async=False unless you know your backend supports true parallelism.
Pipeline Integration
DSPyMator works seamlessly in scikit-learn pipelines:
from sklearn.pipeline import make_pipeline
llm_pipeline = make_pipeline(
# ... preprocessing steps ...
DSPyMator(
program=dspy.ChainOfThought("review_text: str -> sentiment: str"),
target_names="sentiment"
),
EmbeddingTransformer(
model="openai/text-embedding-3-small", # or your preferred embedding model
feature_names=["reasoning"], # specify which columns to embed
)