DSPyMator

Requires DSPy

This estimator requires the dspy optional dependency. Install with:

uv add centimators[dspy]

centimators.model_estimators.DSPyMator brings the power of Large Language Models (LLMs) to feature engineering and tabular prediction tasks through DSPy. Unlike traditional neural networks that learn patterns from data through gradient descent, DSPyMator leverages pre-trained LLMs and natural language reasoning to make predictions, making it uniquely suited for tasks where domain knowledge, explainability, and few-shot learning are critical.

Why Use DSPyMator?

DSPyMator excels in scenarios where traditional machine learning falls short:

Few-Shot Learning: Achieve strong performance with limited training data by leveraging the LLM's pre-existing knowledge
Domain Knowledge Integration: Incorporate reasoning and expert knowledge naturally through task descriptions
Explainable Predictions: Access the model's reasoning process (when using chain-of-thought)
Mixed Data Types: Seamlessly handle numerical, categorical, and text features without complex preprocessing
Rapid Prototyping: Get baseline predictions quickly before investing in traditional model training
Scikit-learn Compatible: Stack and compose DSPyMator in scikit-learn pipelines, column transformers, and with other compatible workflows

How It Works

DSPyMator wraps any DSPy Module (like dspy.Predict or dspy.ChainOfThought) and exposes it through the familiar scikit-learn API. Under the hood:

Signature Definition: You define input and output fields via DSPy signatures (e.g., "review_text -> sentiment")
Feature Mapping: DSPyMator automatically maps your dataframe columns to the signature's input fields
LLM Execution: During prediction, each row is converted into a prompt and sent to the LLM
Output Extraction: Results are extracted and returned as numpy arrays (for predict) or dataframes (for transform)

The real power comes from DSPy's optimization capabilities. You can use optimizers like GEPA, MIPROv2, or BootstrapFewShot to automatically improve prompts, select better demonstrations, or even finetune the model—all through the standard fit() method.

Usage

Basic Classification

Let's start with a simple sentiment classification task using movie reviews:

import polars as pl
import dspy
from centimators.model_estimators import DSPyMator

# Sample movie reviews
reviews = pl.DataFrame({
    "review_text": [
        "This movie was absolutely fantastic! A masterpiece.",
        "Terrible waste of time. Boring and predictable.",
        "Pretty good, though it had some slow moments.",
        "One of the worst films I've ever seen.",
        "Loved every minute of it! Highly recommended."
    ],
    "sentiment": ["positive", "negative", "neutral", "negative", "positive"]
})

# Define a DSPy program with input and output signature
# You can use dspy.Predict for simple predictions or dspy.ChainOfThought for reasoning
classifier_program = dspy.Predict("review_text: str -> sentiment: str")

# Create the DSPyMator estimator
sentiment_classifier = DSPyMator(
    program=classifier_program,
    target_names="sentiment",  # Which output field to use as predictions
    lm="openai/gpt-4o-mini",   # Language model to use
    temperature=0.0,            # Low temperature for consistent outputs
)

# Fit the classifier (establishes the LM configuration)
X = reviews[["review_text"]]
y = reviews["sentiment"]
sentiment_classifier.fit(X, y)

# Make predictions
test_reviews = pl.DataFrame({
    "review_text": [
        "An incredible journey with stunning visuals.",
        "Could barely stay awake through this one."
    ]
})

predictions = sentiment_classifier.predict(test_reviews[["review_text"]])
print(predictions)  # ['positive', 'negative']

Getting Full Outputs with Transform

Unlike predict() which returns only the target field, transform() returns all output fields from the DSPy program:

# Get all outputs (useful for accessing reasoning, confidence, etc.)
full_outputs = sentiment_classifier.transform(test_reviews[["review_text"]])
print(full_outputs)
# Polars DataFrame with column: sentiment

If you're using dspy.ChainOfThought instead of dspy.Predict, you'll also get reasoning:

# Using ChainOfThought to get reasoning
cot_program = dspy.ChainOfThought("review_text: str -> sentiment: str")

cot_classifier = DSPyMator(
    program=cot_program,
    target_names="sentiment",
)

cot_classifier.fit(X, y)
outputs = cot_classifier.transform(test_reviews[["review_text"]])
print(outputs)
# Polars DataFrame with columns: rationale, sentiment

Multi-Output Predictions

DSPyMator supports multiple output fields for richer predictions:

# Define a program with multiple outputs
multi_output_program = dspy.Predict(
    "review_text: str -> sentiment: str, confidence: float"
)

multi_classifier = DSPyMator(
    program=multi_output_program,
    target_names=["sentiment", "confidence"],  # Multiple targets
)

multi_classifier.fit(X, y)
predictions = multi_classifier.predict(test_reviews[["review_text"]])
print(predictions.shape)  # (2, 2) - rows × outputs

Prompt Optimization with GEPA

The real magic happens when you use DSPy optimizers to automatically improve your prompts:

# Define a metric function for optimization
def sentiment_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    Metric that returns score and optional feedback for GEPA.

    Args:
        gold: The ground truth example
        pred: The predicted output
        trace: Optional full program trace
        pred_name: Optional name of predictor being optimized
        pred_trace: Optional trace of specific predictor

    Returns:
        float score or dspy.Prediction(score=float, feedback=str)
    """
    y_pred = pred.sentiment
    y_true = gold.sentiment
    is_correct = (y_pred == y_true)
    score = 1.0 if is_correct else 0.0

    # If GEPA is requesting feedback, provide rich textual guidance
    if pred_name:
        if is_correct:
            feedback = f"Correctly classified as {y_pred}."
        else:
            feedback = f"Incorrect. Predicted {y_pred} but should be {y_true}."

        return dspy.Prediction(score=score, feedback=feedback)

    return score

# Create a GEPA optimizer
gepa_optimizer = dspy.GEPA(
    metric=sentiment_metric,
    auto="light",  # or "medium", "heavy" for more thoroughness
    reflection_minibatch_size=20,
    reflection_lm=dspy.LM(model="openai/gpt-4o-mini", temperature=1.0)
)

# Create a fresh classifier
optimized_classifier = DSPyMator(
    program=dspy.Predict("review_text: str -> sentiment: str"),
    target_names="sentiment",
)

# Fit with optimization (GEPA will improve the prompts)
optimized_classifier.fit(
    X, 
    y, 
    optimizer=gepa_optimizer,
    validation_data=0.3  # Use 30% of data for validation
)

# The optimized program is now ready to use
predictions = optimized_classifier.predict(test_reviews[["review_text"]])

Few-Shot Learning with Bootstrap

For few-shot learning, use BootstrapFewShot to automatically select good demonstrations:

# Few-shot optimizer doesn't need validation data
bootstrap_optimizer = dspy.BootstrapFewShot(
    metric=sentiment_metric,
    max_bootstrapped_demos=3,  # Number of examples to use
)

few_shot_classifier = DSPyMator(
    program=dspy.ChainOfThought("review_text: str -> sentiment: str"),
    target_names="sentiment",
)

# Fit with bootstrap (no validation_data needed)
few_shot_classifier.fit(
    X,
    y,
    optimizer=bootstrap_optimizer,
    validation_data=None  # Few-shot optimizers only need trainset
)

predictions = few_shot_classifier.predict(test_reviews[["review_text"]])

Advanced: Custom Multi-Input Features

DSPyMator automatically maps multiple dataframe columns to signature fields:

# Multi-feature example
movie_data = pl.DataFrame({
    "title": ["The Matrix", "Cats"],
    "review_text": ["Mind-bending sci-fi classic", "A catastrophic mistake"],
    "rating": [5, 1],
})

# Signature with multiple inputs
multi_input_program = dspy.Predict(
    "title: str, review_text: str, rating: int -> sentiment: str"
)

multi_input_classifier = DSPyMator(
    program=multi_input_program,
    target_names="sentiment",
    feature_names=["title", "review_text", "rating"],  # Map columns to signature
)

# Fit and predict
multi_input_classifier.fit(movie_data[["title", "review_text", "rating"]], None)
predictions = multi_input_classifier.predict(movie_data[["title", "review_text", "rating"]])

Async Execution for Speed

By default, DSPyMator uses async execution for faster batch predictions:

# Async is on by default
fast_classifier = DSPyMator(
    program=dspy.Predict("review_text: str -> sentiment: str"),
    target_names="sentiment",
    use_async=True,        # Default behavior
    max_concurrent=50,     # Max concurrent API requests
    verbose=True,          # Show progress bar
)

# For synchronous execution (useful for debugging)
sync_classifier = DSPyMator(
    program=dspy.Predict("review_text: str -> sentiment: str"),
    target_names="sentiment",
    use_async=False,
    verbose=True,
)

AsyncIO and Local Model Limitations

DSPyMator's async execution mode is ideal for rapid API-backed LLM calls (e.g., OpenAI, Anthropic) because it uses asyncio. For best results with local LLMs, use use_async=False unless you know your backend supports true parallelism.

Pipeline Integration

DSPyMator works seamlessly in scikit-learn pipelines:

from sklearn.pipeline import make_pipeline

llm_pipeline = make_pipeline(
    # ... preprocessing steps ...
    DSPyMator(
        program=dspy.ChainOfThought("review_text: str -> sentiment: str"),
        target_names="sentiment"
    ),
    EmbeddingTransformer(
        model="openai/text-embedding-3-small",      # or your preferred embedding model
        feature_names=["reasoning"],                # specify which columns to embed
    )