DSPyMator Tutorial¶

This tutorial demonstrates how to use DSPyMator, a scikit-learn compatible wrapper that brings the power of Large Language Models (LLMs) to tabular prediction tasks through DSPy. You'll learn to build a basic text classifier with DSPyMator, use predict() vs transform() to get different outputs, extract reasoning with ChainOfThought, optimize prompts automatically with GEPA and fit(), and build advanced pipelines with embeddings and dimensionality reduction

Overview¶

DSPyMator wraps any DSPy module (e.g., dspy.Predict, dspy.ChainOfThought) and exposes it through the familiar scikit-learn API. It enables LLM-based predictions that work seamlessly with sklearn pipelines, cross-validation, and other ML tooling. Unlike traditional ML models that learn patterns through gradient descent, DSPyMator leverages pre-trained LLMs and natural language reasoning—making it ideal for tasks where domain knowledge, explainability, and few-shot learning are critical.

Prerequisites¶

To run this tutorial, you'll need:

An OpenAI API key Warning: This tutorial uses an LLM via an API (like OpenAI's). Running the code, especially the prompt optimization part, will make calls to this API and may incur costs.
The dspy library for LLM orchestration
The datasets library from Hugging Face for loading the Rotten Tomatoes dataset
cluestar for interactive text visualization

In [1]:

Copied!

# !pip install centimators[all] datasets cluestar
# !pip install centimators[all] datasets cluestar

1. Load the Dataset¶

We'll use the Rotten Tomatoes dataset from Hugging Face - a popular movie review dataset for sentiment analysis. We'll load a subset for faster execution in this tutorial.

In [2]:

Copied!





# import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-..."
import polars as pl
from datasets import load_dataset

# Load the Rotten Tomatoes movie review dataset from Hugging Face
print("Loading Rotten Tomatoes dataset...")
dataset = load_dataset("rotten_tomatoes")

# Convert to polars and take a subset for faster execution
# Using 300 samples for training and 50 for testing
train_data = dataset["train"].shuffle(seed=42).select(range(300))
test_data = dataset["test"].shuffle(seed=42).select(range(100))

# Convert to polars DataFrames
train_df = pl.DataFrame(
    {
        "review": train_data["text"],
        "sentiment": [
            "positive" if label == 1 else "negative" for label in train_data["label"]
        ],
    }
)

test_df = pl.DataFrame(
    {
        "review": test_data["text"],
        "sentiment": [
            "positive" if label == 1 else "negative" for label in test_data["label"]
        ],
    }
)

# Prepare X and y
X_train = train_df[["review"]]
y_train = train_df["sentiment"]

X_test = test_df[["review"]]
y_test = test_df["sentiment"]

print("\n✓ Dataset loaded successfully!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print("\nSentiment distribution (train):")
display(y_train.value_counts())
print("\nSample reviews:")
train_df.head(3)
# import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-..."
import polars as pl
from datasets import load_dataset

# Load the Rotten Tomatoes movie review dataset from Hugging Face
print("Loading Rotten Tomatoes dataset...")
dataset = load_dataset("rotten_tomatoes")

# Convert to polars and take a subset for faster execution
# Using 300 samples for training and 50 for testing
train_data = dataset["train"].shuffle(seed=42).select(range(300))
test_data = dataset["test"].shuffle(seed=42).select(range(100))

# Convert to polars DataFrames
train_df = pl.DataFrame(
    {
        "review": train_data["text"],
        "sentiment": [
            "positive" if label == 1 else "negative" for label in train_data["label"]
        ],
    }
)

test_df = pl.DataFrame(
    {
        "review": test_data["text"],
        "sentiment": [
            "positive" if label == 1 else "negative" for label in test_data["label"]
        ],
    }
)

# Prepare X and y
X_train = train_df[["review"]]
y_train = train_df["sentiment"]

X_test = test_df[["review"]]
y_test = test_df["sentiment"]

print("\n✓ Dataset loaded successfully!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print("\nSentiment distribution (train):")
display(y_train.value_counts())
print("\nSample reviews:")
train_df.head(3)

Loading Rotten Tomatoes dataset...

✓ Dataset loaded successfully!
Training samples: 300
Test samples: 100

Sentiment distribution (train):

shape: (2, 2)

sentiment	count
str	u32
"negative"	146
"positive"	154

Sample reviews:

Out[2]:

shape: (3, 2)

review	sentiment
str	str
". . . plays like somebody spli…	"negative"
"michael moore has perfected th…	"positive"
". . . too gory to be a comedy …	"negative"

2. Basic Usage: Sentiment Classification Predictions¶

Let's start with a simple sentiment classifier using dspy.Predict. We'll define a signature that maps review text to sentiment.

In [3]:

Copied!





import dspy
from centimators.model_estimators import DSPyMator

# Define a DSPy program with input -> output signature
# Format: "input_field: type -> output_field: type"
sentiment_program = dspy.Predict("review: str -> sentiment: str")

# Create the DSPyMator estimator
classifier = DSPyMator(
    program=sentiment_program,
    target_names="sentiment",  # Which output field to use as prediction
)

classifier.fit(X_train, y_train)  # establishes LM configuration

# Predict sentiments
predictions = classifier.predict(X_test)

# Create a results dataframe
results = pl.DataFrame(
    {"review": X_test["review"], "predicted": predictions, "actual": y_test}
)

print("\nPrediction Results:")
display(results)

# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")
import dspy
from centimators.model_estimators import DSPyMator

# Define a DSPy program with input -> output signature
# Format: "input_field: type -> output_field: type"
sentiment_program = dspy.Predict("review: str -> sentiment: str")

# Create the DSPyMator estimator
classifier = DSPyMator(
    program=sentiment_program,
    target_names="sentiment",  # Which output field to use as prediction
)

classifier.fit(X_train, y_train)  # establishes LM configuration

# Predict sentiments
predictions = classifier.predict(X_test)

# Create a results dataframe
results = pl.DataFrame(
    {"review": X_test["review"], "predicted": predictions, "actual": y_test}
)

print("\nPrediction Results:")
display(results)

# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")

DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 379.11it/s]

Prediction Results:

shape: (100, 3)

review	predicted	actual
str	str	str
"unpretentious , charming , qui…	"positive"	"positive"
"a film really has to be except…	"negative"	"negative"
"working from a surprisingly se…	"positive"	"positive"
"it may not be particularly inn…	"positive"	"positive"
"such a premise is ripe for all…	"negative"	"negative"
…	…	…
"ice age is the first computer-…	"negative"	"negative"
"there's no denying that burns …	"Positive"	"positive"
"it collapses when mr . taylor …	"negative"	"negative"
"there's a great deal of corny …	"positive"	"positive"
"ah , the travails of metropoli…	"negative"	"negative"

Accuracy: 79.00%

3. Adding Reasoning with ChainOfThought¶

One of the most powerful features of LLMs is their ability to explain their reasoning. Let's use dspy.ChainOfThought to get not just predictions, but also the reasoning behind them.

In [4]:

Copied!





# Wrap the same signature as before but with a ChainOfThought (adds reasoning step)
cot_program = dspy.ChainOfThought("review: str -> sentiment: str")

cot_classifier = DSPyMator(program=cot_program, target_names="sentiment")

# Fit and transform to get reasoning
outputs_with_reasoning = cot_classifier.fit_transform(X_test)

# Create a results dataframe
results = pl.DataFrame(
    {
        "review": X_test["review"],
        "reasoning": outputs_with_reasoning["reasoning"],
        "predicted": outputs_with_reasoning["sentiment"],
        "actual": y_test,
    }
)

print("\nPrediction Results:")
display(results)

# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")
# Wrap the same signature as before but with a ChainOfThought (adds reasoning step)
cot_program = dspy.ChainOfThought("review: str -> sentiment: str")

cot_classifier = DSPyMator(program=cot_program, target_names="sentiment")

# Fit and transform to get reasoning
outputs_with_reasoning = cot_classifier.fit_transform(X_test)

# Create a results dataframe
results = pl.DataFrame(
    {
        "review": X_test["review"],
        "reasoning": outputs_with_reasoning["reasoning"],
        "predicted": outputs_with_reasoning["sentiment"],
        "actual": y_test,
    }
)

print("\nPrediction Results:")
display(results)

# Calculate accuracy
accuracy = (results["actual"] == results["predicted"]).sum() / len(results)
print(f"\nAccuracy: {accuracy:.2%}")

DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 1715.87it/s]

Prediction Results:

shape: (100, 4)

review	reasoning	predicted	actual
str	str	str	str
"unpretentious , charming , qui…	"This short review uses multipl…	"positive"	"positive"
"a film really has to be except…	"This review claims the film is…	"negative"	"negative"
"working from a surprisingly se…	"The reviewer describes the scr…	"positive"	"positive"
"it may not be particularly inn…	"The reviewer notes a lack of i…	"positive"	"positive"
"such a premise is ripe for all…	"The review expresses a negativ…	"negative"	"negative"
…	…	…	…
"ice age is the first computer-…	"The review criticizes the paci…	"negative"	"negative"
"there's no denying that burns …	"The review expresses praise an…	"positive"	"positive"
"it collapses when mr . taylor …	"The reviewer criticizes the sh…	"negative"	"negative"
"there's a great deal of corny …	"The reviewer acknowledges flaw…	"Positive"	"positive"
"ah , the travails of metropoli…	"The reviewer expresses dissati…	"negative"	"negative"

Accuracy: 77.00%

4. Prompt Optimization with GEPA¶

DSPyMator supports automatic prompt optimization using DSPy optimizers. Let's use GEPA (Generalized Expectation-driven Prompt Adaptation) to automatically improve our prompts based on training data.

GEPA iteratively refines prompts by analyzing errors and generating better instructions.

WARNING!! Running a full GEPA optimization can require a significant number of API calls and credits

In [ ]:

hide-output

Copied!





# Define a metric function for optimization
def sentiment_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    GEPA-compatible metric that returns score and textual feedback.

    Args:
        gold: The ground truth example
        pred: The predicted output
        trace: Optional full program trace
        pred_name: Optional name of predictor being optimized
        pred_trace: Optional trace of specific predictor

    Returns:
        float score or dspy.Prediction(score=float, feedback=str)
    """
    y_pred = pred.sentiment
    y_true = gold.sentiment
    is_correct = y_pred == y_true
    score = 1.0 if is_correct else 0.0

    # If GEPA is requesting predictor-level feedback, provide rich guidance
    if pred_name:
        if is_correct:
            feedback = f"Correctly classified as {y_pred}."
        else:
            feedback = (
                f"Incorrect prediction. Predicted '{y_pred}' but actual was '{y_true}'. "
                f"Review text: '{gold.review}'"
            )

        # Add reasoning context if available
        if hasattr(pred, "reasoning"):
            feedback += f" Reasoning: {pred.reasoning}"

        return dspy.Prediction(score=score, feedback=feedback)

    return score


# Create a light/constrained GEPA optimizer for faster results and demo
gepa_optimizer = dspy.teleprompt.GEPA(
    metric=sentiment_metric,
    auto="light",
    reflection_minibatch_size=10,
    reflection_lm=dspy.LM(model="openai/gpt-5-nano", temperature=1.0, max_tokens=16000),
)

# Fit with optimization
print(
    "Starting GEPA Optimization (this may take a long time and cost a lot of money)..."
)
preoptimized_instructions = cot_classifier.signature_.instructions
cot_classifier.fit(
    X_train,
    y_train,
    optimizer=gepa_optimizer,
    validation_data=0.3,  # Use 30% of training data for validation
)

print("\n✓ Optimization complete!")
# Define a metric function for optimization
def sentiment_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    GEPA-compatible metric that returns score and textual feedback.

    Args:
        gold: The ground truth example
        pred: The predicted output
        trace: Optional full program trace
        pred_name: Optional name of predictor being optimized
        pred_trace: Optional trace of specific predictor

    Returns:
        float score or dspy.Prediction(score=float, feedback=str)
    """
    y_pred = pred.sentiment
    y_true = gold.sentiment
    is_correct = y_pred == y_true
    score = 1.0 if is_correct else 0.0

    # If GEPA is requesting predictor-level feedback, provide rich guidance
    if pred_name:
        if is_correct:
            feedback = f"Correctly classified as {y_pred}."
        else:
            feedback = (
                f"Incorrect prediction. Predicted '{y_pred}' but actual was '{y_true}'. "
                f"Review text: '{gold.review}'"
            )

        # Add reasoning context if available
        if hasattr(pred, "reasoning"):
            feedback += f" Reasoning: {pred.reasoning}"

        return dspy.Prediction(score=score, feedback=feedback)

    return score


# Create a light/constrained GEPA optimizer for faster results and demo
gepa_optimizer = dspy.teleprompt.GEPA(
    metric=sentiment_metric,
    auto="light",
    reflection_minibatch_size=10,
    reflection_lm=dspy.LM(model="openai/gpt-5-nano", temperature=1.0, max_tokens=16000),
)

# Fit with optimization
print(
    "Starting GEPA Optimization (this may take a long time and cost a lot of money)..."
)
preoptimized_instructions = cot_classifier.signature_.instructions
cot_classifier.fit(
    X_train,
    y_train,
    optimizer=gepa_optimizer,
    validation_data=0.3,  # Use 30% of training data for validation
)

print("\n✓ Optimization complete!")

In [6]:

Copied!





optimized_predictions = cot_classifier.predict(X_test)
optimized_results = pl.DataFrame(
    {"review": X_test["review"], "actual": y_test, "predicted": optimized_predictions}
)

optimized_accuracy = (
    optimized_results["actual"] == optimized_results["predicted"]
).sum() / len(optimized_results)

print(f"Preoptimized accuracy: {accuracy:.2%}")
print(f"Preoptimized instructions: {preoptimized_instructions} \n")
print(f"Optimized accuracy: {optimized_accuracy:.2%}")
print(
    f"Optimized instructions (first 750 characters): {cot_classifier.signature_.instructions[:750]}"
)
optimized_predictions = cot_classifier.predict(X_test)
optimized_results = pl.DataFrame(
    {"review": X_test["review"], "actual": y_test, "predicted": optimized_predictions}
)

optimized_accuracy = (
    optimized_results["actual"] == optimized_results["predicted"]
).sum() / len(optimized_results)

print(f"Preoptimized accuracy: {accuracy:.2%}")
print(f"Preoptimized instructions: {preoptimized_instructions} \n")
print(f"Optimized accuracy: {optimized_accuracy:.2%}")
print(
    f"Optimized instructions (first 750 characters): {cot_classifier.signature_.instructions[:750]}"
)

DSPyMator predicting: 100%|██████████| 100/100 [00:00<00:00, 551.67it/s]

Preoptimized accuracy: 77.00%
Preoptimized instructions: Given the fields `review`, produce the fields `sentiment`. 

Optimized accuracy: 89.00%
Optimized instructions (first 750 characters): New instruction for binary sentiment classification of film reviews

Task
- Determine the overall sentiment toward the film described in a single English review.
- Output exactly one field:
  sentiment: "positive" or "negative" (all lowercase, no quotes beyond the field value).
- Do not include any other fields, text, or explanations.

Input
- review: A single English text review of a film. It may contain punctuation, quotes, references to acting, directing, plot, cinematography, etc.

Output format
- Only the line: sentiment: positive
  or: sentiment: negative
- Do not prepend, append, or include any reasoning, justification, or extraneous text.

Decision rules
1) Overall sentiment
   - If the review expresses clear praise or a positive ov

5. Compose DSPyMator with other feature transformers in scikit-learn pipelines¶

While predict() returns only the pre-specified target_names, transform() returns all output fields from the DSPy program. This is useful when you want access to intermediate outputs or additional fields, like reasoning traces.

Why? A Pipeline Integration Example: From Text to Embeddings to Visualization¶

One of DSPyMator's strengths is its compatibility with scikit-learn pipelines. Let's build a pipeline that:

Uses DSPyMator to generate reasoning about each review
Embeds the reasoning using EmbeddingTransformer
Reduces dimensionality with DimReducer

This creates a feature extraction pipeline where LLM reasoning becomes structured numerical features.

In [7]:

Copied!





from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import EmbeddingTransformer, DimReducer

dspymator = DSPyMator(
    program=dspy.ChainOfThought("review: str -> sentiment: str"),
    target_names="sentiment",
)

# Create an embedder that embeds the reasoning text
embedder = EmbeddingTransformer(
    model="openai/text-embedding-3-small",
    feature_names=["reasoning"],  # Embed the reasoning field
)

# Create a dimensionality reducer
dim_reducer = DimReducer(
    method="umap",
    n_components=2,  # Reduce embeddings to 2D for visualization
)

# Build the pipeline
pipeline = make_pipeline(dspymator, embedder, dim_reducer)

print("Pipeline created:")
display(pipeline)
from sklearn.pipeline import make_pipeline
from centimators.feature_transformers import EmbeddingTransformer, DimReducer

dspymator = DSPyMator(
    program=dspy.ChainOfThought("review: str -> sentiment: str"),
    target_names="sentiment",
)

# Create an embedder that embeds the reasoning text
embedder = EmbeddingTransformer(
    model="openai/text-embedding-3-small",
    feature_names=["reasoning"],  # Embed the reasoning field
)

# Create a dimensionality reducer
dim_reducer = DimReducer(
    method="umap",
    n_components=2,  # Reduce embeddings to 2D for visualization
)

# Build the pipeline
pipeline = make_pipeline(dspymator, embedder, dim_reducer)

print("Pipeline created:")
display(pipeline)

Pipeline created:

Pipeline(steps=[('dspymator',
                 DSPyMator(program=predict = Predict(StringSignature(review -> reasoning, sentiment
    instructions='Given the fields `review`, produce the fields `sentiment`.'
    review = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Review:', 'desc': '${review}'})
    reasoning = Field(annotation=str required=True json_sche...
    sentiment = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Sentiment:', 'desc': '${sentiment}'})
)),
                           target_names='sentiment',
                           feature_names=None,
                           lm='openai/gpt-5-nano',
                           temperature=1.0,
                           max_tokens=16000,
                           use_async=True,
                           max_concurrent=50,
                           verbose=True)),
                ('embeddingtransformer',
                 EmbeddingTransformer(categorical_mapping={},
                                      feature_names=['reasoning'],
                                      model='openai/text-embedding-3-small')),
                ('dimreducer', DimReducer(method='umap'))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [8]:

Copied!





# Run the full pipeline on a small subset (to save time and API costs)
print("\nRunning pipeline: DSPyMator → Embeddings → UMAP...\n")

# Fit and transform
reduced_features = pipeline.fit_transform(X_train, y_train)
print("\nFirst few rows:")
display(reduced_features.head())
# Run the full pipeline on a small subset (to save time and API costs)
print("\nRunning pipeline: DSPyMator → Embeddings → UMAP...\n")

# Fit and transform
reduced_features = pipeline.fit_transform(X_train, y_train)
print("\nFirst few rows:")
display(reduced_features.head())

Running pipeline: DSPyMator → Embeddings → UMAP...

DSPyMator predicting: 100%|██████████| 300/300 [00:00<00:00, 618.77it/s]

First few rows:

shape: (5, 2)

dim_0	dim_1
f32	f32
0.286903	5.589123
7.863194	4.594091
0.060558	5.437039
8.057485	5.643534
1.149364	4.60095

Visualize the Reasoning Embeddings¶

Let's visualize how the LLM's reasoning clusters different sentiments in 2D space. By embedding and visualizing the reasoning of the LLM-classifier, we can actually see the decision boundary that has been created and inspect why certain examples have been classified incorrectly in the two clusters.

In [9]:

Copied!





import cluestar

# Create an interactive visualization with cluestar
cluestar.plot_text(
    X=reduced_features,
    texts=X_train["review"].to_list(),
    color_array=y_train.to_list(),
)
import cluestar

# Create an interactive visualization with cluestar
cluestar.plot_text(
    X=reduced_features,
    texts=X_train["review"].to_list(),
    color_array=y_train.to_list(),
)

Out[9]:

6. Key Takeaways¶

In this tutorial, you learned:

✅ Basic Usage: How to wrap DSPy programs with DSPyMator for sklearn compatibility

✅ Prediction Methods:

predict() returns only target field(s)
transform() returns all output fields (including reasoning)

✅ Chain of Thought: Using dspy.ChainOfThought to get explainable predictions

✅ Optimization: Leveraging GEPA to automatically improve prompts based on training data

✅ Pipeline Integration: Building end-to-end pipelines combining LLM reasoning, embeddings, and dimensionality reduction

Next Steps¶

Experiment with other DSPy optimizers like MIPROv2 or BootstrapFewShot
Try different LLM providers (Anthropic, local models, etc.)
Combine DSPyMator with traditional ML models in ensemble pipelines
Explore multi-output predictions for richer feature extraction

	steps	[('dspymator', ...), ('embeddingtransformer', ...), ...]
	transform_input	None
	memory	None
	verbose	False

	program	predict = Pre...ntiment}'}) ))
	target_names	'sentiment'
	feature_names	None
	lm	'openai/gpt-5-nano'
	temperature	1.0
	max_tokens	16000
	use_async	True
	max_concurrent	50
	verbose	True

	model	'openai/text-embedding-3-small'
	feature_names	['reasoning']
	categorical_mapping	{}
	batch_size	200
	caching	True

	method	'umap'
	n_components	2
	feature_names	None