DSPyMator Tutorial¶
This tutorial demonstrates how to use DSPyMator, a scikit-learn compatible wrapper that brings the power of Large Language Models (LLMs) to tabular prediction tasks through DSPy. You'll learn to build a basic text classifier with DSPyMator, use predict() vs transform() to get different outputs, extract reasoning with ChainOfThought, optimize prompts automatically with GEPA and fit(), and build advanced pipelines with embeddings and dimensionality reduction
Overview¶
DSPyMator wraps any DSPy module (e.g., dspy.Predict, dspy.ChainOfThought) and exposes it through the familiar scikit-learn API. It enables LLM-based predictions that work seamlessly with sklearn pipelines, cross-validation, and other ML tooling. Unlike traditional ML models that learn patterns through gradient descent, DSPyMator leverages pre-trained LLMs and natural language reasoning—making it ideal for tasks where domain knowledge, explainability, and few-shot learning are critical.
Prerequisites¶
To run this tutorial, you'll need:
- An OpenAI API key Warning: This tutorial uses an LLM via an API (like OpenAI's). Running the code, especially the prompt optimization part, will make calls to this API and may incur costs.
- The
dspylibrary for LLM orchestration - The
datasetslibrary from Hugging Face for loading the Rotten Tomatoes dataset cluestarfor interactive text visualization
# !pip install centimators[all] datasets cluestar
1. Load the Dataset¶
We'll use the Rotten Tomatoes dataset from Hugging Face - a popular movie review dataset for sentiment analysis. We'll load a subset for faster execution in this tutorial.
# import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-..."
import polars as pl
from datasets import load_dataset
# Load the Rotten Tomatoes movie review dataset from Hugging Face
print("Loading Rotten Tomatoes dataset...")
dataset = load_dataset("rotten_tomatoes")
# Convert to polars and take a subset for faster execution
# Using 300 samples for training and 50 for testing
train_data = dataset["train"].shuffle(seed=42).select(range(300))
test_data = dataset["test"].shuffle(seed=42).select(range(100))
# Convert to polars DataFrames
train_df = pl.DataFrame(
{
"review": train_data["text"],
"sentiment": [
"positive" if label == 1 else "negative" for label in train_data["label"]
],
}
)
test_df = pl.DataFrame(
{
"review": test_data["text"],
"sentiment": [
"positive" if label == 1 else "negative" for label in test_data["label"]
],
}
)
# Prepare X and y
X_train = train_df[["review"]]
y_train = train_df["sentiment"]
X_test = test_df[["review"]]
y_test = test_df["sentiment"]
print("\n✓ Dataset loaded successfully!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print("\nSentiment distribution (train):")
display(y_train.value_counts())
print("\nSample reviews:")
train_df.head(3)
Loading Rotten Tomatoes dataset... ✓ Dataset loaded successfully! Training samples: 300 Test samples: 100 Sentiment distribution (train):
| sentiment | count |
|---|---|
| str | u32 |
| "negative" | 146 |
| "positive" | 154 |
Sample reviews:
| review | sentiment |
|---|---|
| str | str |
| ". . . plays like somebody spli… | "negative" |
| "michael moore has perfected th… | "positive" |
| ". . . too gory to be a comedy … | "negative" |
2. Basic Usage: Sentiment Classification Predictions¶
Let's start with a simple sentiment classifier using dspy.Predict. We'll define a signature that maps review text to sentiment.
import dspy
from centimators.model_estimators import DSPyMator
# Define a DSPy program with input -> output signature
# Format: "input_field: type -> output_field: type"
sentiment_program = dspy.Predict("review: str -> sentiment: str")
# Create the DSPyMator estimator
classifier = DSPyMator(
program=sentiment_program,
target_names="sentiment", # Which output field to use as prediction
)
classifier.fit(X_train, y_train) # establishes LM configuration
# Predict sentiments
predictions = classifier.predict(X_test)
# Create a results dataframe
results = pl.DataFrame(
{"review": X_test["review"], "predicted": predictions, "actual": y_test}
)
print("\nPrediction Results:")
display(results)
# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")
DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 379.11it/s]
Prediction Results:
| review | predicted | actual |
|---|---|---|
| str | str | str |
| "unpretentious , charming , qui… | "positive" | "positive" |
| "a film really has to be except… | "negative" | "negative" |
| "working from a surprisingly se… | "positive" | "positive" |
| "it may not be particularly inn… | "positive" | "positive" |
| "such a premise is ripe for all… | "negative" | "negative" |
| … | … | … |
| "ice age is the first computer-… | "negative" | "negative" |
| "there's no denying that burns … | "Positive" | "positive" |
| "it collapses when mr . taylor … | "negative" | "negative" |
| "there's a great deal of corny … | "positive" | "positive" |
| "ah , the travails of metropoli… | "negative" | "negative" |
Accuracy: 79.00%
3. Adding Reasoning with ChainOfThought¶
One of the most powerful features of LLMs is their ability to explain their reasoning. Let's use dspy.ChainOfThought to get not just predictions, but also the reasoning behind them.
# Wrap the same signature as before but with a ChainOfThought (adds reasoning step)
cot_program = dspy.ChainOfThought("review: str -> sentiment: str")
cot_classifier = DSPyMator(program=cot_program, target_names="sentiment")
# Fit and transform to get reasoning
outputs_with_reasoning = cot_classifier.fit_transform(X_test)
# Create a results dataframe
results = pl.DataFrame(
{
"review": X_test["review"],
"reasoning": outputs_with_reasoning["reasoning"],
"predicted": outputs_with_reasoning["sentiment"],
"actual": y_test,
}
)
print("\nPrediction Results:")
display(results)
# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")
DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 1715.87it/s]
Prediction Results:
| review | reasoning | predicted | actual |
|---|---|---|---|
| str | str | str | str |
| "unpretentious , charming , qui… | "This short review uses multipl… | "positive" | "positive" |
| "a film really has to be except… | "This review claims the film is… | "negative" | "negative" |
| "working from a surprisingly se… | "The reviewer describes the scr… | "positive" | "positive" |
| "it may not be particularly inn… | "The reviewer notes a lack of i… | "positive" | "positive" |
| "such a premise is ripe for all… | "The review expresses a negativ… | "negative" | "negative" |
| … | … | … | … |
| "ice age is the first computer-… | "The review criticizes the paci… | "negative" | "negative" |
| "there's no denying that burns … | "The review expresses praise an… | "positive" | "positive" |
| "it collapses when mr . taylor … | "The reviewer criticizes the sh… | "negative" | "negative" |
| "there's a great deal of corny … | "The reviewer acknowledges flaw… | "Positive" | "positive" |
| "ah , the travails of metropoli… | "The reviewer expresses dissati… | "negative" | "negative" |
Accuracy: 77.00%
4. Prompt Optimization with GEPA¶
DSPyMator supports automatic prompt optimization using DSPy optimizers. Let's use GEPA (Generalized Expectation-driven Prompt Adaptation) to automatically improve our prompts based on training data.
GEPA iteratively refines prompts by analyzing errors and generating better instructions.
WARNING!! Running a full GEPA optimization can require a significant number of API calls and credits
# Define a metric function for optimization
def sentiment_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
"""
GEPA-compatible metric that returns score and textual feedback.
Args:
gold: The ground truth example
pred: The predicted output
trace: Optional full program trace
pred_name: Optional name of predictor being optimized
pred_trace: Optional trace of specific predictor
Returns:
float score or dspy.Prediction(score=float, feedback=str)
"""
y_pred = pred.sentiment
y_true = gold.sentiment
is_correct = y_pred == y_true
score = 1.0 if is_correct else 0.0
# If GEPA is requesting predictor-level feedback, provide rich guidance
if pred_name:
if is_correct:
feedback = f"Correctly classified as {y_pred}."
else:
feedback = (
f"Incorrect prediction. Predicted '{y_pred}' but actual was '{y_true}'. "
f"Review text: '{gold.review}'"
)
# Add reasoning context if available
if hasattr(pred, "reasoning"):
feedback += f" Reasoning: {pred.reasoning}"
return dspy.Prediction(score=score, feedback=feedback)
return score
# Create a light/constrained GEPA optimizer for faster results and demo
gepa_optimizer = dspy.teleprompt.GEPA(
metric=sentiment_metric,
auto="light",
reflection_minibatch_size=10,
reflection_lm=dspy.LM(model="openai/gpt-5-nano", temperature=1.0, max_tokens=16000),
)
# Fit with optimization
print(
"Starting GEPA Optimization (this may take a long time and cost a lot of money)..."
)
preoptimized_instructions = cot_classifier.signature_.instructions
cot_classifier.fit(
X_train,
y_train,
optimizer=gepa_optimizer,
validation_data=0.3, # Use 30% of training data for validation
)
print("\n✓ Optimization complete!")
optimized_predictions = cot_classifier.predict(X_test)
optimized_results = pl.DataFrame(
{"review": X_test["review"], "actual": y_test, "predicted": optimized_predictions}
)
optimized_accuracy = (
optimized_results["actual"] == optimized_results["predicted"]
).sum() / len(optimized_results)
print(f"Preoptimized accuracy: {accuracy:.2%}")
print(f"Preoptimized instructions: {preoptimized_instructions} \n")
print(f"Optimized accuracy: {optimized_accuracy:.2%}")
print(
f"Optimized instructions (first 750 characters): {cot_classifier.signature_.instructions[:750]}"
)
DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 551.67it/s]
Preoptimized accuracy: 77.00% Preoptimized instructions: Given the fields `review`, produce the fields `sentiment`. Optimized accuracy: 89.00% Optimized instructions (first 750 characters): New instruction for binary sentiment classification of film reviews Task - Determine the overall sentiment toward the film described in a single English review. - Output exactly one field: sentiment: "positive" or "negative" (all lowercase, no quotes beyond the field value). - Do not include any other fields, text, or explanations. Input - review: A single English text review of a film. It may contain punctuation, quotes, references to acting, directing, plot, cinematography, etc. Output format - Only the line: sentiment: positive or: sentiment: negative - Do not prepend, append, or include any reasoning, justification, or extraneous text. Decision rules 1) Overall sentiment - If the review expresses clear praise or a positive ov
5. Compose DSPyMator with other feature transformers in scikit-learn pipelines¶
While predict() returns only the pre-specified target_names, transform() returns all output fields from the DSPy program. This is useful when you want access to intermediate outputs or additional fields, like reasoning traces.
Why? A Pipeline Integration Example: From Text to Embeddings to Visualization¶
One of DSPyMator's strengths is its compatibility with scikit-learn pipelines. Let's build a pipeline that:
- Uses DSPyMator to generate reasoning about each review
- Embeds the reasoning using
EmbeddingTransformer - Reduces dimensionality with
DimReducer
This creates a feature extraction pipeline where LLM reasoning becomes structured numerical features.
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import EmbeddingTransformer, DimReducer
dspymator = DSPyMator(
program=dspy.ChainOfThought("review: str -> sentiment: str"),
target_names="sentiment",
)
# Create an embedder that embeds the reasoning text
embedder = EmbeddingTransformer(
model="openai/text-embedding-3-small",
feature_names=["reasoning"], # Embed the reasoning field
)
# Create a dimensionality reducer
dim_reducer = DimReducer(
method="umap",
n_components=2, # Reduce embeddings to 2D for visualization
)
# Build the pipeline
pipeline = make_pipeline(dspymator, embedder, dim_reducer)
print("Pipeline created:")
display(pipeline)
Pipeline created:
Pipeline(steps=[('dspymator',
DSPyMator(program=predict = Predict(StringSignature(review -> reasoning, sentiment
instructions='Given the fields `review`, produce the fields `sentiment`.'
review = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Review:', 'desc': '${review}'})
reasoning = Field(annotation=str required=True json_sche...
sentiment = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Sentiment:', 'desc': '${sentiment}'})
)),
target_names='sentiment',
feature_names=None,
lm='openai/gpt-5-nano',
temperature=1.0,
max_tokens=16000,
use_async=True,
max_concurrent=50,
verbose=True)),
('embeddingtransformer',
EmbeddingTransformer(categorical_mapping={},
feature_names=['reasoning'],
model='openai/text-embedding-3-small')),
('dimreducer', DimReducer(method='umap'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('dspymator', ...), ('embeddingtransformer', ...), ...] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| program | predict = Pre...ntiment}'}) )) | |
| target_names | 'sentiment' | |
| feature_names | None | |
| lm | 'openai/gpt-5-nano' | |
| temperature | 1.0 | |
| max_tokens | 16000 | |
| use_async | True | |
| max_concurrent | 50 | |
| verbose | True |
Parameters
| model | 'openai/text-embedding-3-small' | |
| feature_names | ['reasoning'] | |
| categorical_mapping | {} | |
| batch_size | 200 | |
| caching | True |
Parameters
| method | 'umap' | |
| n_components | 2 | |
| feature_names | None |
# Run the full pipeline on a small subset (to save time and API costs)
print("\nRunning pipeline: DSPyMator → Embeddings → UMAP...\n")
# Fit and transform
reduced_features = pipeline.fit_transform(X_train, y_train)
print("\nFirst few rows:")
display(reduced_features.head())
Running pipeline: DSPyMator → Embeddings → UMAP...
DSPyMator predicting: 100%|██████████| 300/300 [00:00<00:00, 618.77it/s]
First few rows:
| dim_0 | dim_1 |
|---|---|
| f32 | f32 |
| 0.286903 | 5.589123 |
| 7.863194 | 4.594091 |
| 0.060558 | 5.437039 |
| 8.057485 | 5.643534 |
| 1.149364 | 4.60095 |
Visualize the Reasoning Embeddings¶
Let's visualize how the LLM's reasoning clusters different sentiments in 2D space. By embedding and visualizing the reasoning of the LLM-classifier, we can actually see the decision boundary that has been created and inspect why certain examples have been classified incorrectly in the two clusters.
import cluestar
# Create an interactive visualization with cluestar
cluestar.plot_text(
X=reduced_features,
texts=X_train["review"].to_list(),
color_array=y_train.to_list(),
)
6. Key Takeaways¶
In this tutorial, you learned:
✅ Basic Usage: How to wrap DSPy programs with DSPyMator for sklearn compatibility
✅ Prediction Methods:
predict()returns only target field(s)transform()returns all output fields (including reasoning)
✅ Chain of Thought: Using dspy.ChainOfThought to get explainable predictions
✅ Optimization: Leveraging GEPA to automatically improve prompts based on training data
✅ Pipeline Integration: Building end-to-end pipelines combining LLM reasoning, embeddings, and dimensionality reduction
Next Steps¶
- Experiment with other DSPy optimizers like
MIPROv2orBootstrapFewShot - Try different LLM providers (Anthropic, local models, etc.)
- Combine DSPyMator with traditional ML models in ensemble pipelines
- Explore multi-output predictions for richer feature extraction