Active Prompting
Not all questions are equally hard for a model. Active Prompting identifies the questions where the model is most uncertain — where its answers disagree across multiple samples — and focuses human annotation effort on crafting chain-of-thought reasoning for exactly those cases. The result: better few-shot examples with less human work.
Introduced: Active Prompting was published in 2023 by Diao et al. The technique applies active learning principles to chain-of-thought prompting. Rather than randomly selecting questions for human-annotated CoT demonstrations, Active Prompting queries the model multiple times on each question, measures the disagreement across answers (uncertainty), and selects the most uncertain questions for annotation. Humans then write detailed chain-of-thought reasoning for only those high-uncertainty cases, producing a focused, high-impact set of few-shot exemplars.
Modern LLM Status: The core insight — that annotation effort should target where the model struggles most — remains a sound principle for optimizing few-shot CoT examples. However, modern LLMs like Claude, GPT-4, and Gemini have significantly improved at zero-shot reasoning, reducing the need for carefully curated few-shot demonstrations in many scenarios. Active Prompting is most relevant today when you are building a production pipeline that relies on few-shot CoT and want to maximize the return on human annotation investment. For everyday prompting tasks, simpler approaches often suffice.
Let the Model Tell You Where It Needs Help
Traditional chain-of-thought prompting uses a fixed set of hand-picked examples as few-shot demonstrations. The problem? Those examples are chosen by humans guessing which questions are “representative” — not by measuring where the model actually struggles. A question that seems difficult to a human may be trivially easy for the model, while a seemingly simple question might consistently trip it up.
Active Prompting flips the selection process. Instead of guessing which examples to annotate, you ask the model itself. By generating multiple answers to each question (using temperature sampling) and measuring how much those answers disagree, you get a direct signal of model uncertainty. High disagreement means the model is unsure — and that is exactly where a well-crafted chain-of-thought demonstration will have the greatest impact.
Think of it like a teacher who, instead of preparing lesson plans based on what they assume students find hard, first gives a diagnostic quiz. The questions students answer inconsistently reveal the real knowledge gaps — and that is where the teacher focuses their instruction.
When you randomly select questions for CoT annotation, you inevitably waste effort on questions the model can already handle and miss the ones where it truly falters. Active Prompting’s uncertainty metric acts as a triage system: it surfaces the cases where the model’s internal reasoning is most fragile, ensuring that every annotated example addresses a real weakness rather than reinforcing what the model already knows. This targeted approach yields stronger few-shot performance with fewer annotated examples.
The Active Prompting Process
Four stages from uncertainty measurement to optimized demonstrations
Sample Multiple Answers per Question
For each question in your dataset, query the model multiple times using non-zero temperature settings to introduce variation. Each query produces a different reasoning path and potentially a different final answer. Typically, five to ten samples per question provide a reliable signal without excessive cost.
Question: “A store had 47 apples. It sold 19 on Monday and received a shipment of 32 on Tuesday. How many apples does it have?” — Generate 8 independent responses at temperature 0.7, yielding answers like: 60, 60, 60, 56, 60, 60, 60, 60.
Measure Uncertainty and Disagreement
Calculate how much the model’s answers disagree across samples. The original paper uses metrics like disagreement ratio (fraction of answers that differ from the majority) or entropy across answer distributions. Questions where the model produces consistent answers have low uncertainty; questions that split the model reveal genuine difficulty.
The apple question above shows 7/8 agreement (low uncertainty — the model mostly gets it right). But a question about probability where the model splits 4-4 between two answers has high uncertainty — this is a prime candidate for annotation.
Select and Annotate High-Uncertainty Questions
Rank all questions by their uncertainty score and select the top-k most uncertain ones. Human annotators then write detailed chain-of-thought reasoning for these specific questions — step-by-step explanations that show exactly how to arrive at the correct answer. This is the only step that requires human labor, and it is concentrated where it matters most.
From a pool of 200 math questions, the top 8 most uncertain are selected. An annotator writes: “Step 1: Identify the total number of combinations. Step 2: Calculate favorable outcomes. Step 3: Divide to get the probability…” for each one.
Use Annotated Examples as Few-Shot Demonstrations
The annotated chain-of-thought examples become your few-shot demonstrations. When the model encounters new questions, these carefully targeted examples guide its reasoning through the exact patterns it previously struggled with. Because the demonstrations address the model’s actual weak points, they produce stronger performance gains than randomly selected examples of equal quantity.
The 8 annotated CoT examples are prepended as few-shot demonstrations. On the same test set, the model’s accuracy jumps from 78% (with random CoT examples) to 86% (with uncertainty-selected CoT examples) — same annotation budget, better results.
See the Difference
Why targeted annotation outperforms random example selection
Direct CoT (Random Examples)
Manually pick a few questions that seem representative. Write chain-of-thought annotations for them. Use those as few-shot demonstrations for all new questions.
Some examples target problems the model can already solve. Other real failure modes are never addressed. Annotation effort is spread thin across easy and hard cases alike, with no way to know which examples are actually helping.
Active Prompting
Sample multiple answers per question. Measure disagreement. Rank by uncertainty. Annotate CoT reasoning only for the questions where the model disagrees with itself most. Use those targeted examples as demonstrations.
Every annotated example addresses a verified model weakness. The demonstrations teach the model exactly the reasoning patterns it was missing. Same annotation budget produces measurably higher accuracy on the target task.
Practice Responsible AI
Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.
48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.
Active Prompting in Action
See how uncertainty-driven annotation improves reasoning quality
Question: “A farmer has 3 fields. The first yields 240 bushels per acre across 5 acres. The second yields 180 bushels per acre across 8 acres. The third yields 310 bushels per acre across 3 acres. What is the weighted average yield per acre?”
8 sampled answers: 222.5, 230.6, 222.5, 245.0, 222.5, 230.6, 222.5, 230.6
Disagreement: High — three distinct answers appear, indicating the model inconsistently handles weighted averaging versus simple averaging.
Step 1: Calculate total bushels per field.
Field 1: 240 × 5 = 1,200 bushels
Field 2: 180 × 8 = 1,440 bushels
Field 3: 310 × 3 = 930 bushels
Step 2: Sum all bushels: 1,200 + 1,440 + 930 = 3,570 bushels
Step 3: Sum all acres: 5 + 8 + 3 = 16 acres
Step 4: Weighted average = total bushels / total acres = 3,570 / 16 = 223.125 bushels per acre
Key insight: A weighted average divides total output by total units, not by averaging the per-unit rates directly. The answer is 223.125 bushels per acre.
Question: “All managers attend the Monday meeting. Some engineers are managers. Does it follow that some engineers attend the Monday meeting?”
8 sampled answers: Yes, Yes, No, Yes, No, Yes, Yes, No
Disagreement: High — the model splits 5/3 between “Yes” and “No,” revealing confusion about syllogistic reasoning with quantifiers.
Step 1: Identify the premises.
Premise A: All managers attend the Monday meeting.
Premise B: Some engineers are managers.
Step 2: Apply the syllogism.
From Premise B, there exists at least one person who is both an engineer and a manager. Call this person X.
Step 3: Apply Premise A to X.
Since X is a manager, and all managers attend the Monday meeting, X attends the Monday meeting.
Step 4: Draw the conclusion.
Since X is an engineer who attends the Monday meeting, it follows that some engineers attend the Monday meeting.
Answer: Yes. This is a valid syllogism — the intersection of “some engineers are managers” with “all managers attend” guarantees that those engineer-managers attend.
Question: “Sarah put a frozen pizza in the oven and set a timer for 15 minutes. She then went to water her garden. When she came back inside, the kitchen was filled with smoke. What most likely happened?”
8 sampled answers: The pizza burned (3), She forgot to turn on the oven and something else caught fire (2), The timer was wrong and she was outside too long (2), The oven malfunctioned (1)
Disagreement: High — four distinct explanations with no clear majority, indicating the model cannot consistently prioritize the most likely causal chain.
Step 1: Identify the key facts.
A frozen pizza was placed in the oven. A 15-minute timer was set. Sarah went outside. She returned to a smoke-filled kitchen.
Step 2: Consider the most common cause of kitchen smoke.
Smoke in a kitchen with an active oven most commonly results from food burning or overheating. The oven was on and contained a pizza.
Step 3: Evaluate the timeline.
Watering a garden could easily take longer than 15 minutes, especially if Sarah did not hear the timer while outside. If she exceeded the cooking time significantly, the pizza would burn.
Step 4: Assess alternative explanations.
Oven malfunction is possible but less likely than the simplest explanation. Something else catching fire has no supporting evidence in the scenario.
Answer: The most likely explanation is that Sarah spent longer watering her garden than she realized, missed the timer, and the pizza burned in the oven. The simplest causal chain — food left in a hot oven too long produces smoke — is the most probable.
When to Use Active Prompting
Best for optimizing few-shot demonstration selection at scale
Perfect For
When building systems that rely on few-shot chain-of-thought demonstrations at scale, Active Prompting ensures your annotation budget is spent on the examples that will improve accuracy the most.
When you can only afford to annotate a small number of examples, uncertainty-driven selection guarantees that each annotation targets a genuine model weakness rather than a question it already handles well.
For domain-specific tasks like medical reasoning, legal analysis, or financial calculations where correctness is critical and you need to systematically identify and address failure modes.
Even if you do not use the full annotation pipeline, the uncertainty sampling step alone reveals which question types your model struggles with — valuable intelligence for prompt design.
Skip It When
If your model already performs well on the task without any few-shot examples, the overhead of uncertainty sampling and annotation is unnecessary. Modern LLMs handle many tasks without demonstrations.
Active Prompting is designed for systematic optimization across a dataset. For single questions or ad-hoc use, simply writing a good prompt or using zero-shot CoT is far more practical.
Active Prompting relies on measuring answer disagreement as a proxy for uncertainty. For tasks with no single correct answer — like creative writing, brainstorming, or opinion generation — disagreement is expected, not a signal of weakness.
Use Cases
Where Active Prompting delivers the most value
Medical Question Answering
Identify which clinical reasoning questions the model answers inconsistently, then annotate chain-of-thought demonstrations for those specific diagnostic patterns to improve reliability in healthcare AI systems.
Financial Calculation Pipelines
Surface the types of financial calculations where the model disagrees with itself — compound interest edge cases, tax bracket transitions, amortization schedules — and build targeted few-shot examples for each.
Standardized Test Preparation
Run uncertainty analysis across a bank of practice questions to find the problem types that need the most reasoning support, then create targeted CoT exemplars for tutoring systems.
Legal Contract Analysis
Discover which contract clause interpretations produce inconsistent model outputs, then annotate reasoning chains that clarify the legal logic for those ambiguous clause types.
Quality Assurance Automation
In automated testing pipelines, use uncertainty sampling to identify the edge cases where the model’s test generation or bug classification is unreliable, then focus annotation on those failure patterns.
Safety-Critical Classification
For content moderation or safety screening, identify the borderline cases where the model is most uncertain about classification, then provide clear CoT demonstrations that establish consistent decision boundaries.
Where Active Prompting Fits
Active Prompting bridges static CoT and adaptive example selection
Active Prompting and Self-Consistency are natural complements. Active Prompting uses disagreement across samples to select better demonstrations, while Self-Consistency uses disagreement across samples to select better final answers at inference time. You can use Active Prompting to build your demonstration set, then apply Self-Consistency when running those demonstrations to squeeze maximum accuracy from both the example selection and the answer selection stages.
Related Techniques
Explore complementary reasoning and selection techniques
Optimize Your Demonstrations
Apply Active Prompting principles to find where your model needs the most help, or explore our tools to build stronger prompts from the ground up.