DSPy

Framework Context: 2022–2025

Introduced: DSPy (Declarative Self-improving Python) emerged from Stanford NLP, led by Omar Khattab. The foundational work began with the DSP (Demonstrate-Search-Predict) paper in December 2022, which introduced the concept of composing retrieval and generation into modular programs. By October 2023, this evolved into DSPy — a full framework where developers write Python code using declarative “signatures” (typed input/output specifications), and a compiler automatically generates optimized prompts, selects few-shot examples, and even fine-tunes weights. The key insight: treat prompts as compiled artifacts, not hand-written strings.

Modern LLM Status: DSPy has become a production-grade standard for building LLM pipelines. With over 28,000 GitHub stars, 160,000+ monthly pip downloads, and 500+ dependent projects, it has moved well beyond research. Major organizations use DSPy for production RAG systems, multi-hop reasoning pipelines, and complex NLP workflows. The framework’s MIPROv2 optimizer, BootstrapFewShot compiler, and support for structured outputs make it the most mature tool for systematic prompt optimization. DSPy represents a fundamental shift: from crafting individual prompts to engineering entire LM programs.

The Core Insight

Programming, Not Prompting

Traditional prompt engineering works like this: you write a prompt string, test it, tweak the wording, test again, and repeat until it works. If you change the model, the temperature, or the task slightly, you often have to start over. The prompt is a fragile, opaque artifact that is difficult to version, test, or improve systematically.

DSPy replaces this with structured programming. Instead of writing “You are an expert analyst. Given the following context: {context}, answer the question: {question}”, you write a Python signature: context, question → answer. DSPy’s compiler then figures out the best way to instruct the model — selecting demonstrations, generating instructions, and optimizing the prompt format — all automatically, guided by a metric you define.

Think of the difference between writing assembly code and writing Python. Assembly gives you total control but is brittle and hard to maintain. Python lets you express intent at a higher level, and the interpreter handles the low-level details. DSPy does the same for LLM interactions — you express what you want (the signature), and the compiler handles how to ask for it (the prompt).

Signatures: The Building Blocks

A DSPy signature defines the inputs and outputs of a language model call with typed fields. For example, “question → answer” tells the compiler this module takes a question and produces an answer. You can add field descriptions, constraints, and types. The compiler uses these declarations to automatically generate optimal prompts, select relevant few-shot examples from training data, and even fine-tune the model’s weights — all without you ever writing a prompt string.

The DSPy Pipeline

Five stages from signature to optimized program

1

Define Signatures

Declare what each language model call should do using typed input/output specifications. Signatures are the contracts between your code and the LLM — they specify the interface without dictating the implementation.

Example

Signature: “context: str, question: str → answer: str” — This tells DSPy the module receives a context paragraph and a question, and should produce a string answer. The compiler will figure out the best prompt format, instructions, and examples.

2

Build Modules

Wrap signatures in DSPy modules like Predict, ChainOfThought, or ReAct. Each module adds a specific prompting strategy around your signature — CoT adds reasoning steps, ReAct adds tool use loops. Modules are composable: you can nest them like regular Python classes.

Example

Use ChainOfThought(“context, question → answer”) to automatically add step-by-step reasoning. Or use dspy.ReAct(“question → answer”, tools=[SearchTool]) to create a tool-using agent — all from a one-line module definition.

3

Compose the Pipeline

Connect multiple modules into a complete program. A multi-hop QA system might chain a retriever, a relevance filter, and a generator. Each module can use a different LLM, different prompting strategy, and different optimization target — but they compose naturally as Python code.

Example

A RAG pipeline: (1) Retrieve(question → passages) fetches documents, (2) Rerank(passages, question → top_passages) filters for relevance, (3) Generate(top_passages, question → answer) produces the final answer with citations. Three modules, one pipeline, all optimizable together.

4

Compile with an Optimizer

Choose an optimizer (BootstrapFewShot, MIPROv2, or others) and provide a training set with a metric function. The optimizer runs your pipeline on training examples, collects traces of successful executions, generates candidate instructions, and searches for the combination that maximizes your metric.

Example

Compile with MIPROv2 using 200 training examples and an F1 metric. The optimizer bootstraps 50 demonstration traces, proposes 10 candidate instructions per module, and runs Bayesian search over 100 trials to find the optimal configuration — automatically producing prompts that score 23% higher than hand-written ones.

5

Execute and Iterate

Run the compiled program on new inputs. The optimized prompts, selected demonstrations, and tuned parameters are all saved and versioned. When the task changes, the model updates, or performance degrades, re-compile with new data — no manual prompt rewriting needed.

Example

After switching from GPT-4 to Claude, re-compile the same pipeline on the same training set. DSPy automatically adapts the prompt format, example selection, and instructions to the new model’s strengths — maintaining performance without any manual prompt changes.

See the Difference

Manual prompt engineering versus programmatic optimization

Approach

You are an expert question-answering system. Given the following context, answer the question accurately. If the answer is not in the context, say “I don’t know.” Context: {context} Question: {question} Answer:

Problems

Prompt works for GPT-4 but fails on Claude. Adding few-shot examples helps but requires manual curation. Changing the task (e.g., adding citations) means rewriting the entire prompt. No systematic way to measure or improve quality. Breaks when deployed to a different domain.

Fragile, model-specific, no systematic optimization

VS

Approach

Define signature: “context, question → answer”. Wrap in ChainOfThought module. Provide 200 labeled examples and an exact-match metric. Compile with MIPROv2. The compiler generates instructions, selects optimal demonstrations, and formats the prompt — all automatically.

Result

Compiled program scores 23% higher than hand-written prompt on held-out test set. Switching models requires only re-compilation, not rewriting. Adding citations means adding a field to the signature and re-compiling. Every optimization is reproducible, versioned, and testable.

Systematic, model-agnostic, reproducible optimization

DSPy in Action

Real-world examples of programmatic prompt optimization

Multi-Hop QA with RAG

DSPy Program Definition

Task: Answer complex questions that require reasoning across multiple documents.

Signature 1 (Retrieve): “question → search_queries: list[str]” — Generate multiple search queries from the original question.

Signature 2 (Generate): “context: list[str], question → reasoning, answer” — Chain-of-thought module that reasons through retrieved passages to produce an answer.

Metric: Combined score of answer correctness (exact match) and reasoning faithfulness (all claims grounded in retrieved context).

Optimizer: MIPROv2 with 300 training examples, 50 bootstrap trials.

Compiled Execution

Input: “What river runs through the city where the company that created the first commercially successful smartphone was founded?”

Step 1 — Compiled query generator: The optimized first module generates three search queries: (1) “first commercially successful smartphone company”, (2) “BlackBerry RIM founding city”, (3) “rivers in Waterloo Ontario.” The compiler learned to decompose multi-hop questions into parallel retrieval paths rather than sequential ones.

Step 2 — Retrieval: Three parallel searches return passages about Research In Motion (RIM), Waterloo Ontario, and the Grand River.

Step 3 — Compiled reasoning: The optimized generator uses automatically-selected demonstrations of similar multi-hop reasoning. Its compiled instructions emphasize citing specific passages for each reasoning step: “The first commercially successful smartphone was the BlackBerry, created by Research In Motion [Passage 1]. RIM was founded in Waterloo, Ontario [Passage 2]. The Grand River runs through Waterloo [Passage 3].”

Answer: “The Grand River.” Every claim is traceable to a retrieved passage — the faithfulness metric during compilation ensured grounded reasoning. Always verify AI-generated answers against primary sources.

Classification with Bootstrapped Demonstrations

DSPy Program Definition

Task: Classify customer support tickets into categories (billing, technical, feature request, complaint, general) with reasoning for each classification.

Signature: “ticket_text → reasoning, category, confidence” — The model must explain its reasoning before classifying, and provide a confidence level.

Module: ChainOfThought with typed output validation (category must be one of the five allowed values, confidence must be high/medium/low).

Optimizer: BootstrapFewShot with 500 labeled tickets and weighted F1 metric.

Compiled Execution

Input ticket: “I’ve been charged twice for my subscription this month. The second charge appeared on the 15th but I only have one account. I need this resolved before my next billing cycle.”

Compiled response: The optimizer bootstrapped 8 demonstration examples that the compiler selected for maximum diversity — covering edge cases like tickets that mention both billing and technical issues. The compiled instructions include specific guidance the optimizer discovered: “Classify based on the primary action the customer needs, not just mentioned topics.”

Reasoning: “The customer reports a duplicate charge and requests a billing correction before the next cycle. While the duplicate might have a technical cause, the customer’s primary need is a billing resolution.”

Category: billing
Confidence: high

The bootstrapped system achieves 94% weighted F1 compared to 78% with a hand-written prompt on the same test set. Classification decisions should be reviewed by humans, especially for edge cases.

Structured Data Extraction Pipeline

DSPy Program Definition

Task: Extract structured information from unstructured research abstracts — methods used, datasets, reported metrics, and key findings.

Signature 1 (Extract): “abstract → methods: list[str], datasets: list[str], metrics: dict, findings: list[str]” — Typed extraction with structured outputs.

Signature 2 (Validate): “abstract, extraction → corrections: list[str], validated_extraction” — Self-verification module that checks extracted data against the source text.

Pipeline: Extract then Validate, composed as a two-stage program.

Optimizer: MIPROv2 with 150 expert-labeled abstracts and a custom field-level accuracy metric.

Compiled Execution

Input abstract: A machine learning paper about using transformer architectures for protein structure prediction, reporting RMSD improvements on CASP14.

Stage 1 — Compiled extraction: The optimized extractor uses compiler-generated instructions that emphasize distinguishing between methods the authors propose vs. methods they compare against (a distinction the optimizer learned was critical for accuracy). Extracts: methods=[“ProteinTransformer (proposed)”, “AlphaFold2 (baseline)”], datasets=[“CASP14”, “PDB-2024”], metrics={“RMSD”: “0.82Å vs 1.1Å baseline”}, findings=[“23% RMSD improvement over AlphaFold2 on novel folds”].

Stage 2 — Compiled validation: The verifier cross-references each extracted claim against the abstract text. Catches that “PDB-2024” was used for pre-training, not evaluation — moves it to a separate field. Final output is corrected and verified.

Two-stage pipeline achieves 91% field-level accuracy vs. 72% with single-stage extraction. Extracted data should always be verified against source documents before use in research or production.

Python Implementation

A complete DSPy program for automated website auditing — from signature to compiled module

import dspy

# Configure the language model
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# Step 1: Define the Signature — the contract between code and LLM
class WebsiteAuditSignature(dspy.Signature):
    """Analyze HTML/CSS/JS code for accessibility (WCAG AA), security (CSP),
    and performance issues. Return findings by severity with remediation."""

    code_snippet = dspy.InputField(desc="The HTML/CSS/JS code to audit")
    context = dspy.InputField(desc="Production standards: CSP policy, WCAG level, target metrics")

    detailed_report = dspy.OutputField(desc="Point-by-point findings with severity and location")
    critical_vulnerabilities = dspy.OutputField(desc="Issues that would fail production deployment")
    remediation_steps = dspy.OutputField(desc="Prioritized fixes with corrected code examples")

# Step 2: Build the Module — compose DSPy primitives into a program
class ProfessionalAuditor(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_audit = dspy.ChainOfThought(WebsiteAuditSignature)

    def forward(self, code_snippet, context):
        return self.generate_audit(code_snippet=code_snippet, context=context)

# Step 3: Define a quality metric for the optimizer
def audit_metric(example, prediction, trace=None):
    expected = set(example.expected_issues)
    found = set(prediction.detailed_report.split("\n"))
    recall = len(expected & found) / max(len(expected), 1)
    has_fixes = "remediation" in prediction.remediation_steps.lower()
    return recall * 0.7 + (0.3 if has_fixes else 0)

# Step 4: Compile — the optimizer finds the best prompt automatically
optimizer = dspy.BootstrapFewShot(metric=audit_metric, max_bootstrapped_demos=4)
compiled_auditor = optimizer.compile(ProfessionalAuditor(), trainset=training_data)

# Step 5: Use the compiled module
# Important: Always verify AI-generated audit findings against actual code
result = compiled_auditor(
    code_snippet='<img src="hero.jpg" onclick="load()">',
    context="CSP A+ (no inline scripts), WCAG AA, zero external dependencies"
)
print(result.detailed_report)
print(result.critical_vulnerabilities)

AI-generated audit findings should always be verified against actual source code before acting on remediation steps.

When to Use DSPy

Best for systematic optimization of multi-stage LLM programs

Perfect For

Multi-Stage LLM Pipelines

When your application chains multiple LLM calls — retrieval, reasoning, generation, validation — and each stage needs independently optimized prompts.

Production Systems Requiring Reproducibility

When prompts need to be versioned, tested against benchmarks, and reliably reproduced across deployments and model updates.

Model-Agnostic Applications

When you need to switch between LLM providers (OpenAI, Anthropic, open-source) without rewriting prompts — DSPy re-compiles automatically for each model.

Performance Optimization at Scale

When manual prompt tuning has plateaued and you need data-driven optimization to squeeze the last 10–25% of quality from your LLM application.

Skip It When

Quick One-Off Prompts

When you need a single prompt for a simple task — the overhead of setting up signatures, training data, and compilation is not justified.

No Labeled Training Data

When you have no examples of correct outputs to optimize against — DSPy’s optimizers need a metric and at least a small training set to be effective.

No-Code Environments

When you or your team cannot write Python code — DSPy is a Python library and requires programming ability to use effectively.

Use Cases

Where DSPy delivers the most value

Enterprise RAG Systems

Optimize retrieval-augmented generation pipelines for internal knowledge bases, documentation search, and customer-facing Q&A with measurable accuracy gains.

Research Automation

Build multi-hop reasoning systems for literature review, evidence synthesis, and systematic analysis of research papers.

Data Processing Pipelines

Structured extraction, classification, and transformation of unstructured data at scale with typed outputs and validation.

Multi-Agent Systems

Design and optimize individual agent modules that compose into complex multi-agent workflows with measurable end-to-end performance.

Content Generation

Optimized pipelines for generating, fact-checking, and formatting content with consistent quality across topics and styles.

Evaluation Frameworks

Build LLM-as-judge evaluation systems with optimized scoring criteria, calibrated rubrics, and consistent grading across evaluators.

Where DSPy Fits

The evolution from manual prompting to programmatic optimization

Manual Prompting Hand-Crafted Strings Trial-and-error prompt writing

Few-Shot Learning Example-Based Manual example selection

Chain-of-Thought Structured Reasoning Step-by-step reasoning chains

DSPy Compiled Programs Automatic prompt optimization

The Compiler Paradigm

DSPy represents the same paradigm shift that compilers brought to software engineering. Just as C compilers freed programmers from writing assembly, DSPy frees AI engineers from hand-tuning prompt strings. You declare your intent through signatures and modules, and the compiler translates that into optimized instructions for whatever model you target. This makes LLM applications more maintainable, more testable, and more portable across models and providers.

Related Techniques & Frameworks

Explore the ecosystem around programmatic prompt optimization

Extension MIPRO DSPy’s most powerful optimizer — MIPRO proposes grounded instructions and searches for optimal prompt configurations using Bayesian optimization across multi-stage programs.

Complement Agentic Prompting DSPy can optimize individual modules within agentic systems — ReAct agents, planning modules, and tool-use patterns all benefit from compiled optimization.

Building Block RAG Retrieval-Augmented Generation is one of DSPy’s most common use cases — the framework excels at optimizing multi-stage retrieval and generation pipelines.

Build Structured Prompts

Apply DSPy compilation strategies to your own prompt pipelines or explore more techniques with our interactive tools.

Prompt Builder All Foundations

Programming, Not Prompting

The DSPy Pipeline

Define Signatures

Build Modules

Compose the Pipeline

Compile with an Optimizer

Execute and Iterate

See the Difference

Manual Prompt Engineering

DSPy Program

Practice Responsible AI

DSPy in Action

Python Implementation

When to Use DSPy

Perfect For

Skip It When

Use Cases

Enterprise RAG Systems

Research Automation

Data Processing Pipelines

Multi-Agent Systems

Content Generation

Evaluation Frameworks

Where DSPy Fits

Related Techniques & Frameworks

Build Structured Prompts