MIPRO
Hand-written instructions plateau. MIPRO (Multi-prompt Instruction Proposal Optimizer) automatically generates, evaluates, and searches for the optimal combination of instructions and demonstrations across every stage of a multi-step LLM program — delivering +11% gains over manual optimization through Bayesian search.
Introduced: MIPRO was developed by Krista Opsahl-Ong, Michael Ryan, and collaborators at Stanford, published in June 2024 as “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs.” It addressed a critical gap: while DSPy could optimize few-shot examples (via BootstrapFewShot), it lacked a systematic way to optimize the instruction text itself — the natural language directions that tell each module what to do. MIPRO solved this by using the LLM itself to propose candidate instructions grounded in actual execution traces, then searching for the best combination using Bayesian optimization.
Modern LLM Status: MIPROv2 is now the default advanced optimizer in DSPy and the most effective approach for optimizing multi-stage LLM programs. It has been adopted by organizations building production RAG pipelines, classification systems, and complex reasoning chains. The key innovation — using program traces to ground instruction proposals — has influenced how the broader community thinks about prompt optimization. Instead of asking “what is the best prompt?”, MIPRO asks “what instructions, grounded in how this program actually runs, will maximize this metric?”
Instructions Should Be Data-Driven
When humans write prompt instructions, we guess what the model needs to hear. We write “Be concise” or “Think step by step” based on intuition, then test and tweak. This approach works for simple tasks, but in multi-stage programs — where each module has its own prompt — the search space becomes enormous. There are thousands of possible instruction combinations, and the optimal instructions for one stage depend on what the other stages do.
MIPRO flips this by generating instructions from evidence. It runs your program on training data, collects traces of what actually happens at each stage (inputs, outputs, successes, failures), and uses those traces to propose grounded instructions. Instead of guessing “summarize the key points,” MIPRO might discover that “extract the specific numerical claims and their sources” works better for your particular pipeline — an instruction a human might never have tried.
Think of the difference between a coach who watches game tape and one who gives generic advice. The tape-watching coach sees your team’s specific strengths and weaknesses, and designs plays around them. MIPRO watches your program’s actual execution traces and designs instructions around what it observes.
MIPRO operates in three phases. First, it bootstraps by running your program on training examples and collecting execution traces — the actual inputs and outputs at every module. Second, it proposes candidate instructions by showing the LLM your code structure, data patterns, and execution traces, and asking it to draft instructions that would improve performance. Third, it searches using Bayesian optimization to find the optimal combination of instructions and demonstrations across all modules simultaneously.
The MIPRO Process
Five stages from raw execution to optimized instructions
Bootstrap Execution Traces
Run the unoptimized program across your training set. At each module, record the input, the prompt sent to the LLM, the output, and whether the final pipeline result was correct. These traces capture the program’s real behavior — including failure modes and edge cases.
A RAG pipeline is run on 200 questions. The bootstrap captures traces like: Module 1 received “What is the GDP of France?” and generated search query “France GDP 2024”. Module 2 received 5 passages and generated an answer. The final answer was correct in 142/200 cases.
Filter by Success Metric
Keep only the traces from successful executions — those where the pipeline produced the correct output. These traces represent the program working at its best, and they form the evidence base for instruction proposals.
Of 200 traces, 142 scored correct. These are kept as “gold traces.” The 58 failures are analyzed separately to understand common failure patterns — like the retriever returning irrelevant passages for ambiguous queries.
Propose Grounded Instructions
Show a proposer LLM your program’s code structure, the gold traces, and the failure patterns. Ask it to generate candidate instructions for each module that would improve performance. Because the proposals are grounded in actual execution data, they address real problems rather than hypothetical ones.
For the retriever module, MIPRO proposes 10 candidate instructions including: “Generate 3 diverse search queries that cover different aspects of the question” and “If the question is ambiguous, generate one query for each possible interpretation.” These were proposed because the traces showed ambiguous queries caused the most failures.
Bayesian Search for Optimal Combination
With 10 candidate instructions per module and multiple demonstration options, the combinatorial space is huge. MIPRO uses Bayesian optimization (specifically, Tree-structured Parzen Estimators) to efficiently search this space, evaluating candidates on mini-batches of training data and using a surrogate model to guide the search toward the best combinations.
With 2 modules and 10 candidates each, there are 100 instruction combinations plus demonstration choices. MIPRO evaluates 50 combinations on mini-batches of 20 examples each, using the surrogate model to focus on the most promising region of the search space. Total: 1,000 LLM calls instead of exhaustively testing all combinations.
Select and Deploy Best Configuration
The search identifies the best-performing combination of instructions and demonstrations across all modules. This compiled configuration is saved and deployed. When the program needs re-optimization (new data, model change, or performance drift), the process is repeated automatically.
The winning configuration: Retriever uses instruction variant #7 with 3 bootstrapped demonstrations; Generator uses instruction variant #3 with 5 demonstrations. This combination scores 89% on the held-out test set, up from 71% with the original hand-written instructions — an 18-point improvement.
See the Difference
Hand-written instructions versus MIPRO-optimized instructions
Manual Instructions
Write instructions for each module by hand. “Given the context, answer the question accurately. Be concise and cite sources.” Test on a few examples. Tweak wording based on gut feeling. Same generic instructions for every module in the pipeline.
Instructions are generic — they don’t address this pipeline’s specific failure modes. No way to know if the retriever needs different instructions than the generator. Manual tuning plateaus at 71% accuracy with no systematic way to improve further.
MIPRO-Optimized
Bootstrap 200 traces. MIPRO discovers that retrieval failures are caused by ambiguous queries and proposes query-diversification instructions. For the generator, it proposes instructions emphasizing numerical precision — a pattern it found in the gold traces.
Per-module optimized instructions address specific failure modes. Bayesian search finds the best combination across all modules simultaneously. Final accuracy: 89% — an 18-point gain. Every instruction is traceable to observed execution patterns.
Practice Responsible AI
Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.
48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.
MIPRO in Action
See how automated instruction optimization transforms LLM programs
Pipeline: A 3-module RAG system for answering technical documentation questions. Module 1: Query Generator (question → search queries). Module 2: Passage Ranker (passages, question → top passages). Module 3: Answer Generator (top passages, question → answer with citations).
Training data: 300 question-answer pairs from real user queries against product documentation.
Metric: Combined score: 60% answer correctness + 40% citation accuracy (does the answer cite the right documentation sections?).
Baseline: Hand-written instructions score 68% on held-out test set.
Bootstrap phase: Ran pipeline on 300 training examples. 204 correct, 96 incorrect. Analysis of failures: 41 had poor retrieval (wrong docs), 33 had correct docs but wrong answer, 22 had correct answer but wrong citations.
Instruction discovery (Query Generator): MIPRO proposed: “Generate 3 search queries: one using the user’s exact terminology, one using official documentation terms, and one that captures the underlying intent. This ensures retrieval works even when users use colloquial terms.” This instruction was grounded in traces showing that user terminology often didn’t match documentation vocabulary.
Instruction discovery (Answer Generator): MIPRO proposed: “When citing documentation, include the section name and paragraph number. If the answer synthesizes information from multiple sections, list all sections and explain how they connect.” Grounded in traces where citation failures involved answers pulling from multiple sources without attribution.
Final result: Optimized pipeline scores 84% on held-out test — a 16-point improvement. Query retrieval failures dropped from 41 to 12. Always verify AI-generated citations against source documentation.
Pipeline: A multi-hop QA system that answers questions requiring 2–4 reasoning steps. Module 1: Hop Planner (question → sub-questions). Module 2: Sub-Question Answerer (sub-question, accumulated context → intermediate answer). Module 3: Synthesizer (intermediate answers → final answer).
Training data: 250 multi-hop questions from HotpotQA with ground-truth decompositions.
Metric: Exact-match on final answer AND decomposition quality (are sub-questions logically ordered?).
Baseline: Hand-written instructions: 52% exact match.
Key discovery from traces: The bootstrap revealed that the Hop Planner consistently generated good first sub-questions but failed on later hops because it didn’t account for information already gathered. Sub-question 3 would often repeat what sub-question 1 already answered.
MIPRO’s proposed instruction for Hop Planner: “Before generating the next sub-question, review what has already been established by previous intermediate answers. Each new sub-question should build on prior findings and move closer to the final answer. Never ask for information that was already answered.”
MIPRO’s proposed instruction for Synthesizer: “Combine the intermediate answers by tracing the logical chain from first hop to last. State the final answer, then show the reasoning path: [hop 1 finding] leads to [hop 2 finding] which establishes [final answer].”
Final result: 67% exact match — a 15-point improvement. Redundant sub-questions dropped from 34% to 8%. Multi-hop reasoning should always be reviewed step-by-step for logical coherence.
Pipeline: A 2-module system for classifying news articles into 8 categories with explanations. Module 1: Content Analyzer (article → key themes, entities, tone). Module 2: Classifier (themes, entities, tone → category, explanation, confidence).
Training data: 500 expert-labeled news articles with category justifications.
Metric: 70% weighted F1 + 30% explanation quality (does the explanation mention the right distinguishing features?).
Baseline: Generic “classify this article” instruction: 76% weighted F1.
Trace analysis insight: MIPRO discovered that the Content Analyzer consistently missed distinguishing between “Business” and “Technology” articles when companies were involved. 80% of misclassifications were in these two categories.
MIPRO’s proposed instruction for Content Analyzer: “Identify whether the primary subject is a business event (earnings, merger, market movement, executive change) or a technology development (product launch, research breakthrough, engineering achievement). When a technology company is involved, focus on whether the article discusses the business aspects or the technology itself.”
Bootstrapped demonstrations: MIPRO selected 4 demonstrations specifically targeting the Business/Technology confusion boundary, drawn from the gold traces where the pipeline correctly handled ambiguous cases.
Final result: 88% weighted F1 with optimized instructions and selected demonstrations — a 12-point gain. Business/Technology confusion dropped from 80% of errors to 22%. Classification systems should be regularly audited for category boundary accuracy.
When to Use MIPRO
Best for data-driven instruction optimization in multi-stage programs
Perfect For
When your pipeline has multiple LLM calls and each stage needs independently tuned instructions that work together as a system.
When you’ve tried many prompt variations and can’t improve further — MIPRO’s data-grounded search explores instructions you wouldn’t think to write.
When you have a measurable quality metric (accuracy, F1, BLEU, etc.) and enough labeled data to run meaningful optimization trials.
When you need to understand why your LLM program fails — MIPRO’s trace analysis reveals specific failure modes and which module is responsible.
Skip It When
When your application is a single LLM call — MIPRO’s power comes from optimizing across multiple modules simultaneously.
When you have fewer than 50 labeled examples — MIPRO needs enough data to bootstrap meaningful traces and evaluate candidates.
When your program’s structure is still changing frequently — optimize after the architecture stabilizes, not during early experimentation.
Use Cases
Where MIPRO delivers the most value
Enterprise Search
Optimize RAG pipelines for internal knowledge bases where query vocabulary differs from document terminology.
Content Generation Pipelines
Optimize multi-stage content workflows: research, draft, fact-check, and format with per-stage instruction tuning.
Medical NLP
Optimize clinical text processing pipelines where accuracy and grounded reasoning are critical for patient safety.
Legal Document Analysis
Optimize extraction and reasoning pipelines for contracts, filings, and regulatory documents with high precision requirements.
Customer Support
Optimize ticket classification, routing, and response generation pipelines for faster resolution and higher satisfaction.
Financial Analysis
Optimize multi-stage pipelines for earnings analysis, risk assessment, and market research with numerical precision metrics.
Where MIPRO Fits
The evolution from manual prompting to automated instruction optimization
MIPRO consistently discovers instructions that humans wouldn’t write — not because they’re unnatural, but because they address failure modes that are invisible without trace analysis. When a human writes “be accurate,” MIPRO writes “distinguish between business events and technology developments when companies are involved” — because the traces showed that’s where accuracy actually breaks down. This is the power of grounding instruction proposals in execution evidence rather than human intuition.
Related Techniques & Frameworks
Explore the foundations and extensions of instruction optimization
Build Structured Prompts
Apply MIPRO optimization patterns to your own prompt pipelines or explore more techniques with our interactive tools.