MIPRO

Framework Context: 2024–2025

Introduced: MIPRO was developed by Krista Opsahl-Ong, Michael Ryan, and collaborators at Stanford, published in June 2024 as “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” It addressed a critical gap: while DSPy could optimize few-shot examples (via BootstrapFewShot), it lacked a systematic way to optimize the instruction text itself — the natural language directions that tell each module what to do. MIPRO solved this by using the LLM itself to propose candidate instructions grounded in actual execution traces, then searching for the best combination using Bayesian optimization.

Modern LLM Status: MIPROv2 is now the default advanced optimizer in DSPy and the most effective approach for optimizing multi-stage LLM programs. It has been adopted by organizations building production RAG pipelines, classification systems, and complex reasoning chains. The key innovation — using program traces to ground instruction proposals — has influenced how the broader community thinks about prompt optimization. Instead of asking “what is the best prompt?”, MIPRO asks “what instructions, grounded in how this program actually runs, will maximize this metric?”

The Core Insight

Instructions Should Be Data-Driven

When humans write prompt instructions, we guess what the model needs to hear. We write “Be concise” or “Think step by step” based on intuition, then test and tweak. This approach works for simple tasks, but in multi-stage programs — where each module has its own prompt — the search space becomes enormous. There are thousands of possible instruction combinations, and the optimal instructions for one stage depend on what the other stages do.

MIPRO flips this by generating instructions from evidence. It runs your program on training data, collects traces of what actually happens at each stage (inputs, outputs, successes, failures), and uses those traces to propose grounded instructions. Instead of guessing “summarize the key points,” MIPRO might discover that “extract the specific numerical claims and their sources” works better for your particular pipeline — an instruction a human might never have tried.

Think of the difference between a coach who watches game tape and one who gives generic advice. The tape-watching coach sees your team’s specific strengths and weaknesses, and designs plays around them. MIPRO watches your program’s actual execution traces and designs instructions around what it observes.

Three-Stage Pipeline: Bootstrap, Propose, Search

MIPRO operates in three phases. First, it bootstraps by running your program on training examples and collecting execution traces — the actual inputs and outputs at every module. Second, it proposes candidate instructions by showing the LLM your code structure, data patterns, and execution traces, and asking it to draft instructions that would improve performance. Third, it searches using Bayesian optimization to find the optimal combination of instructions and demonstrations across all modules simultaneously.

The MIPRO Process

Five stages from raw execution to optimized instructions

1

Bootstrap Execution Traces

Run the unoptimized program across your training set. At each module, record the input, the prompt sent to the LLM, the output, and whether the final pipeline result was correct. These traces capture the program’s real behavior — including failure modes and edge cases.

Example

A RAG pipeline is run on 200 questions. The bootstrap captures traces like: Module 1 received “What is the GDP of France?” and generated search query “France GDP 2024”. Module 2 received 5 passages and generated an answer. The final answer was correct in 142/200 cases.

2

Filter by Success Metric

Keep only the traces from successful executions — those where the pipeline produced the correct output. These traces represent the program working at its best, and they form the evidence base for instruction proposals.

Example

Of 200 traces, 142 scored correct. These are kept as “gold traces.” The 58 failures are analyzed separately to understand common failure patterns — like the retriever returning irrelevant passages for ambiguous queries.

3

Propose Grounded Instructions

Show a proposer LLM your program’s code structure, the gold traces, and the failure patterns. Ask it to generate candidate instructions for each module that would improve performance. Because the proposals are grounded in actual execution data, they address real problems rather than hypothetical ones.

Example

For the retriever module, MIPRO proposes 10 candidate instructions including: “Generate 3 diverse search queries that cover different aspects of the question” and “If the question is ambiguous, generate one query for each possible interpretation.” These were proposed because the traces showed ambiguous queries caused the most failures.

4

Bayesian Search for Optimal Combination

With 10 candidate instructions per module and multiple demonstration options, the combinatorial space is huge. MIPRO uses Bayesian optimization (specifically, Tree-structured Parzen Estimators) to efficiently search this space, evaluating candidates on mini-batches of training data and using a surrogate model to guide the search toward the best combinations.

Example

With 2 modules and 10 candidates each, there are 100 instruction combinations plus demonstration choices. MIPRO evaluates 50 combinations on mini-batches of 20 examples each, using the surrogate model to focus on the most promising region of the search space. Total: 1,000 LLM calls instead of exhaustively testing all combinations.

5

Select and Deploy Best Configuration

The search identifies the best-performing combination of instructions and demonstrations across all modules. This compiled configuration is saved and deployed. When the program needs re-optimization (new data, model change, or performance drift), the process is repeated automatically.

Example

The winning configuration: Retriever uses instruction variant #7 with 3 bootstrapped demonstrations; Generator uses instruction variant #3 with 5 demonstrations. This combination scores 89% on the held-out test set, up from 71% with the original hand-written instructions — an 18-point improvement.

See the Difference

Hand-written instructions versus MIPRO-optimized instructions

Approach

Write instructions for each module by hand. “Given the context, answer the question accurately. Be concise and cite sources.” Test on a few examples. Tweak wording based on gut feeling. Same generic instructions for every module in the pipeline.

Problems

Instructions are generic — they don’t address this pipeline’s specific failure modes. No way to know if the retriever needs different instructions than the generator. Manual tuning plateaus at 71% accuracy with no systematic way to improve further.

Generic, not grounded in execution data, plateaus quickly

VS

Approach

Bootstrap 200 traces. MIPRO discovers that retrieval failures are caused by ambiguous queries and proposes query-diversification instructions. For the generator, it proposes instructions emphasizing numerical precision — a pattern it found in the gold traces.

Result

Per-module optimized instructions address specific failure modes. Bayesian search finds the best combination across all modules simultaneously. Final accuracy: 89% — an 18-point gain. Every instruction is traceable to observed execution patterns.

Data-grounded, per-module optimization, +18 points

MIPRO in Action

See how automated instruction optimization transforms LLM programs

Optimizing a RAG Pipeline

Optimization Setup

Pipeline: A 3-module RAG system for answering technical documentation questions. Module 1: Query Generator (question → search queries). Module 2: Passage Ranker (passages, question → top passages). Module 3: Answer Generator (top passages, question → answer with citations).

Training data: 300 question-answer pairs from real user queries against product documentation.

Metric: Combined score: 60% answer correctness + 40% citation accuracy (does the answer cite the right documentation sections?).

Baseline: Hand-written instructions score 68% on held-out test set.

MIPRO Optimization Results

Bootstrap phase: Ran pipeline on 300 training examples. 204 correct, 96 incorrect. Analysis of failures: 41 had poor retrieval (wrong docs), 33 had correct docs but wrong answer, 22 had correct answer but wrong citations.

Instruction discovery (Query Generator): MIPRO proposed: “Generate 3 search queries: one using the user’s exact terminology, one using official documentation terms, and one that captures the underlying intent. This ensures retrieval works even when users use colloquial terms.” This instruction was grounded in traces showing that user terminology often didn’t match documentation vocabulary.

Instruction discovery (Answer Generator): MIPRO proposed: “When citing documentation, include the section name and paragraph number. If the answer synthesizes information from multiple sections, list all sections and explain how they connect.” Grounded in traces where citation failures involved answers pulling from multiple sources without attribution.

Final result: Optimized pipeline scores 84% on held-out test — a 16-point improvement. Query retrieval failures dropped from 41 to 12. Always verify AI-generated citations against source documentation.

Multi-Hop Reasoning Optimization

Optimization Setup

Pipeline: A multi-hop QA system that answers questions requiring 2–4 reasoning steps. Module 1: Hop Planner (question → sub-questions). Module 2: Sub-Question Answerer (sub-question, accumulated context → intermediate answer). Module 3: Synthesizer (intermediate answers → final answer).

Training data: 250 multi-hop questions from HotpotQA with ground-truth decompositions.

Metric: Exact-match on final answer AND decomposition quality (are sub-questions logically ordered?).

Baseline: Hand-written instructions: 52% exact match.

MIPRO Optimization Results

Key discovery from traces: The bootstrap revealed that the Hop Planner consistently generated good first sub-questions but failed on later hops because it didn’t account for information already gathered. Sub-question 3 would often repeat what sub-question 1 already answered.

MIPRO’s proposed instruction for Hop Planner: “Before generating the next sub-question, review what has already been established by previous intermediate answers. Each new sub-question should build on prior findings and move closer to the final answer. Never ask for information that was already answered.”

MIPRO’s proposed instruction for Synthesizer: “Combine the intermediate answers by tracing the logical chain from first hop to last. State the final answer, then show the reasoning path: [hop 1 finding] leads to [hop 2 finding] which establishes [final answer].”

Final result: 67% exact match — a 15-point improvement. Redundant sub-questions dropped from 34% to 8%. Multi-hop reasoning should always be reviewed step-by-step for logical coherence.

Classification with Trace-Grounded Instructions

Optimization Setup

Pipeline: A 2-module system for classifying news articles into 8 categories with explanations. Module 1: Content Analyzer (article → key themes, entities, tone). Module 2: Classifier (themes, entities, tone → category, explanation, confidence).

Training data: 500 expert-labeled news articles with category justifications.

Metric: 70% weighted F1 + 30% explanation quality (does the explanation mention the right distinguishing features?).

Baseline: Generic “classify this article” instruction: 76% weighted F1.

MIPRO Optimization Results

Trace analysis insight: MIPRO discovered that the Content Analyzer consistently missed distinguishing between “Business” and “Technology” articles when companies were involved. 80% of misclassifications were in these two categories.

MIPRO’s proposed instruction for Content Analyzer: “Identify whether the primary subject is a business event (earnings, merger, market movement, executive change) or a technology development (product launch, research breakthrough, engineering achievement). When a technology company is involved, focus on whether the article discusses the business aspects or the technology itself.”

Bootstrapped demonstrations: MIPRO selected 4 demonstrations specifically targeting the Business/Technology confusion boundary, drawn from the gold traces where the pipeline correctly handled ambiguous cases.

Final result: 88% weighted F1 with optimized instructions and selected demonstrations — a 12-point gain. Business/Technology confusion dropped from 80% of errors to 22%. Classification systems should be regularly audited for category boundary accuracy.

When to Use MIPRO

Best for data-driven instruction optimization in multi-stage programs

Perfect For

Complex Multi-Stage Programs

When your pipeline has multiple LLM calls and each stage needs independently tuned instructions that work together as a system.

When Manual Tuning Has Plateaued

When you’ve tried many prompt variations and can’t improve further — MIPRO’s data-grounded search explores instructions you wouldn’t think to write.

Production Systems with Clear Metrics

When you have a measurable quality metric (accuracy, F1, BLEU, etc.) and enough labeled data to run meaningful optimization trials.

Diagnosing Pipeline Failures

When you need to understand why your LLM program fails — MIPRO’s trace analysis reveals specific failure modes and which module is responsible.

Skip It When

Single-Prompt Tasks

When your application is a single LLM call — MIPRO’s power comes from optimizing across multiple modules simultaneously.

Very Limited Training Data

When you have fewer than 50 labeled examples — MIPRO needs enough data to bootstrap meaningful traces and evaluate candidates.

Rapid Prototyping Phase

When your program’s structure is still changing frequently — optimize after the architecture stabilizes, not during early experimentation.

Use Cases

Where MIPRO delivers the most value

Enterprise Search

Optimize RAG pipelines for internal knowledge bases where query vocabulary differs from document terminology.

Content Generation Pipelines

Optimize multi-stage content workflows: research, draft, fact-check, and format with per-stage instruction tuning.

Medical NLP

Optimize clinical text processing pipelines where accuracy and grounded reasoning are critical for patient safety.

Legal Document Analysis

Optimize extraction and reasoning pipelines for contracts, filings, and regulatory documents with high precision requirements.

Customer Support

Optimize ticket classification, routing, and response generation pipelines for faster resolution and higher satisfaction.

Financial Analysis

Optimize multi-stage pipelines for earnings analysis, risk assessment, and market research with numerical precision metrics.

Where MIPRO Fits

The evolution from manual prompting to automated instruction optimization

Manual Prompts Hand-Written Intuition-based instructions

Auto-CoT Auto Demonstrations Automated example selection

BootstrapFewShot Trace-Based Examples Examples from execution traces

MIPRO Full Optimization Instructions + demos via Bayesian search

Beyond Human Intuition

MIPRO consistently discovers instructions that humans wouldn’t write — not because they’re unnatural, but because they address failure modes that are invisible without trace analysis. When a human writes “be accurate,” MIPRO writes “distinguish between business events and technology developments when companies are involved” — because the traces showed that’s where accuracy actually breaks down. This is the power of grounding instruction proposals in execution evidence rather than human intuition.

Related Techniques & Frameworks

Explore the foundations and extensions of instruction optimization

Foundation DSPy MIPRO is DSPy’s most powerful optimizer — it operates within DSPy’s programming model of signatures, modules, and compiled pipelines.

Complement Agentic Prompting MIPRO can optimize instructions for each module within agentic systems — planning prompts, tool selection prompts, and reflection prompts all benefit.

Precursor Active Prompting Active Prompting selects the most informative examples for annotation — a precursor to MIPRO’s trace-based approach of using execution data to guide optimization.

Build Structured Prompts

Apply MIPRO optimization patterns to your own prompt pipelines or explore more techniques with our interactive tools.

Prompt Builder All Foundations

Instructions Should Be Data-Driven

The MIPRO Process

Bootstrap Execution Traces

Filter by Success Metric

Propose Grounded Instructions

Bayesian Search for Optimal Combination

Select and Deploy Best Configuration

See the Difference

Manual Instructions

MIPRO-Optimized

Practice Responsible AI

MIPRO in Action

When to Use MIPRO

Perfect For

Skip It When

Use Cases

Enterprise Search

Content Generation Pipelines

Medical NLP

Legal Document Analysis

Customer Support

Financial Analysis

Where MIPRO Fits

Related Techniques & Frameworks

Build Structured Prompts