Active Prompting

Technique Context: 2023

Introduced: Active Prompting was published in 2023 by Diao et al. The technique applies active learning principles to chain-of-thought prompting. Rather than randomly selecting questions for human-annotated CoT demonstrations, Active Prompting queries the model multiple times on each question, measures the disagreement across answers (uncertainty), and selects the most uncertain questions for annotation. Humans then write detailed chain-of-thought reasoning for only those high-uncertainty cases, producing a focused, high-impact set of few-shot exemplars.

Modern LLM Status: The core insight — that annotation effort should target where the model struggles most — remains a sound principle for optimizing few-shot CoT examples. However, modern LLMs like Claude, GPT-4, and Gemini have significantly improved at zero-shot reasoning, reducing the need for carefully curated few-shot demonstrations in many scenarios. Active Prompting is most relevant today when you are building a production pipeline that relies on few-shot CoT and want to maximize the return on human annotation investment. For everyday prompting tasks, simpler approaches often suffice.

The Core Insight

Let the Model Tell You Where It Needs Help

Traditional chain-of-thought prompting uses a fixed set of hand-picked examples as few-shot demonstrations. The problem? Those examples are chosen by humans guessing which questions are “representative” — not by measuring where the model actually struggles. A question that seems difficult to a human may be trivially easy for the model, while a seemingly simple question might consistently trip it up.

Active Prompting flips the selection process. Instead of guessing which examples to annotate, you ask the model itself. By generating multiple answers to each question (using temperature sampling) and measuring how much those answers disagree, you get a direct signal of model uncertainty. High disagreement means the model is unsure — and that is exactly where a well-crafted chain-of-thought demonstration will have the greatest impact.

Think of it like a teacher who, instead of preparing lesson plans based on what they assume students find hard, first gives a diagnostic quiz. The questions students answer inconsistently reveal the real knowledge gaps — and that is where the teacher focuses their instruction.

Why Uncertainty-Driven Selection Outperforms Random Selection

When you randomly select questions for CoT annotation, you inevitably waste effort on questions the model can already handle and miss the ones where it truly falters. Active Prompting’s uncertainty metric acts as a triage system: it surfaces the cases where the model’s internal reasoning is most fragile, ensuring that every annotated example addresses a real weakness rather than reinforcing what the model already knows. This targeted approach yields stronger few-shot performance with fewer annotated examples.

The Active Prompting Process

Four stages from uncertainty measurement to optimized demonstrations

1

Sample Multiple Answers per Question

For each question in your dataset, query the model multiple times using non-zero temperature settings to introduce variation. Each query produces a different reasoning path and potentially a different final answer. Typically, five to ten samples per question provide a reliable signal without excessive cost.

Example

Question: “A store had 47 apples. It sold 19 on Monday and received a shipment of 32 on Tuesday. How many apples does it have?” — Generate 8 independent responses at temperature 0.7, yielding answers like: 60, 60, 60, 56, 60, 60, 60, 60.

2

Measure Uncertainty and Disagreement

Calculate how much the model’s answers disagree across samples. The original paper uses metrics like disagreement ratio (fraction of answers that differ from the majority) or entropy across answer distributions. Questions where the model produces consistent answers have low uncertainty; questions that split the model reveal genuine difficulty.

Example

The apple question above shows 7/8 agreement (low uncertainty — the model mostly gets it right). But a question about probability where the model splits 4-4 between two answers has high uncertainty — this is a prime candidate for annotation.

3

Select and Annotate High-Uncertainty Questions

Rank all questions by their uncertainty score and select the top-k most uncertain ones. Human annotators then write detailed chain-of-thought reasoning for these specific questions — step-by-step explanations that show exactly how to arrive at the correct answer. This is the only step that requires human labor, and it is concentrated where it matters most.

Example

From a pool of 200 math questions, the top 8 most uncertain are selected. An annotator writes: “Step 1: Identify the total number of combinations. Step 2: Calculate favorable outcomes. Step 3: Divide to get the probability…” for each one.

4

Use Annotated Examples as Few-Shot Demonstrations

The annotated chain-of-thought examples become your few-shot demonstrations. When the model encounters new questions, these carefully targeted examples guide its reasoning through the exact patterns it previously struggled with. Because the demonstrations address the model’s actual weak points, they produce stronger performance gains than randomly selected examples of equal quantity.

Example

The 8 annotated CoT examples are prepended as few-shot demonstrations. On the same test set, the model’s accuracy jumps from 78% (with random CoT examples) to 86% (with uncertainty-selected CoT examples) — same annotation budget, better results.

See the Difference

Why targeted annotation outperforms random example selection

Approach

Manually pick a few questions that seem representative. Write chain-of-thought annotations for them. Use those as few-shot demonstrations for all new questions.

Result

Some examples target problems the model can already solve. Other real failure modes are never addressed. Annotation effort is spread thin across easy and hard cases alike, with no way to know which examples are actually helping.

Untargeted effort, wasted annotations, blind to actual model weaknesses

VS

Approach

Sample multiple answers per question. Measure disagreement. Rank by uncertainty. Annotate CoT reasoning only for the questions where the model disagrees with itself most. Use those targeted examples as demonstrations.

Result

Every annotated example addresses a verified model weakness. The demonstrations teach the model exactly the reasoning patterns it was missing. Same annotation budget produces measurably higher accuracy on the target task.

Data-driven selection, efficient annotation, directly targets model uncertainty

Active Prompting in Action

See how uncertainty-driven annotation improves reasoning quality

Mathematical Reasoning: Identifying Weak Spots

Uncertainty Sampling

Question: “A farmer has 3 fields. The first yields 240 bushels per acre across 5 acres. The second yields 180 bushels per acre across 8 acres. The third yields 310 bushels per acre across 3 acres. What is the weighted average yield per acre?”

8 sampled answers: 222.5, 230.6, 222.5, 245.0, 222.5, 230.6, 222.5, 230.6

Disagreement: High — three distinct answers appear, indicating the model inconsistently handles weighted averaging versus simple averaging.

Annotated CoT Demonstration

Step 1: Calculate total bushels per field.
Field 1: 240 × 5 = 1,200 bushels
Field 2: 180 × 8 = 1,440 bushels
Field 3: 310 × 3 = 930 bushels

Step 2: Sum all bushels: 1,200 + 1,440 + 930 = 3,570 bushels

Step 3: Sum all acres: 5 + 8 + 3 = 16 acres

Step 4: Weighted average = total bushels / total acres = 3,570 / 16 = 223.125 bushels per acre

Key insight: A weighted average divides total output by total units, not by averaging the per-unit rates directly. The answer is 223.125 bushels per acre.

Logical Reasoning: Resolving Ambiguity

Uncertainty Sampling

Question: “All managers attend the Monday meeting. Some engineers are managers. Does it follow that some engineers attend the Monday meeting?”

8 sampled answers: Yes, Yes, No, Yes, No, Yes, Yes, No

Disagreement: High — the model splits 5/3 between “Yes” and “No,” revealing confusion about syllogistic reasoning with quantifiers.

Annotated CoT Demonstration

Step 1: Identify the premises.
Premise A: All managers attend the Monday meeting.
Premise B: Some engineers are managers.

Step 2: Apply the syllogism.
From Premise B, there exists at least one person who is both an engineer and a manager. Call this person X.

Step 3: Apply Premise A to X.
Since X is a manager, and all managers attend the Monday meeting, X attends the Monday meeting.

Step 4: Draw the conclusion.
Since X is an engineer who attends the Monday meeting, it follows that some engineers attend the Monday meeting.

Answer: Yes. This is a valid syllogism — the intersection of “some engineers are managers” with “all managers attend” guarantees that those engineer-managers attend.

Commonsense Reasoning: Navigating Tricky Scenarios

Uncertainty Sampling

Question: “Sarah put a frozen pizza in the oven and set a timer for 15 minutes. She then went to water her garden. When she came back inside, the kitchen was filled with smoke. What most likely happened?”

8 sampled answers: The pizza burned (3), She forgot to turn on the oven and something else caught fire (2), The timer was wrong and she was outside too long (2), The oven malfunctioned (1)

Disagreement: High — four distinct explanations with no clear majority, indicating the model cannot consistently prioritize the most likely causal chain.

Annotated CoT Demonstration

Step 1: Identify the key facts.
A frozen pizza was placed in the oven. A 15-minute timer was set. Sarah went outside. She returned to a smoke-filled kitchen.

Step 2: Consider the most common cause of kitchen smoke.
Smoke in a kitchen with an active oven most commonly results from food burning or overheating. The oven was on and contained a pizza.

Step 3: Evaluate the timeline.
Watering a garden could easily take longer than 15 minutes, especially if Sarah did not hear the timer while outside. If she exceeded the cooking time significantly, the pizza would burn.

Step 4: Assess alternative explanations.
Oven malfunction is possible but less likely than the simplest explanation. Something else catching fire has no supporting evidence in the scenario.

Answer: The most likely explanation is that Sarah spent longer watering her garden than she realized, missed the timer, and the pizza burned in the oven. The simplest causal chain — food left in a hot oven too long produces smoke — is the most probable.

When to Use Active Prompting

Best for optimizing few-shot demonstration selection at scale

Perfect For

Production CoT Pipelines

When building systems that rely on few-shot chain-of-thought demonstrations at scale, Active Prompting ensures your annotation budget is spent on the examples that will improve accuracy the most.

Limited Annotation Budgets

When you can only afford to annotate a small number of examples, uncertainty-driven selection guarantees that each annotation targets a genuine model weakness rather than a question it already handles well.

Task-Specific Accuracy Optimization

For domain-specific tasks like medical reasoning, legal analysis, or financial calculations where correctness is critical and you need to systematically identify and address failure modes.

Model Weakness Diagnostics

Even if you do not use the full annotation pipeline, the uncertainty sampling step alone reveals which question types your model struggles with — valuable intelligence for prompt design.

Skip It When

Zero-Shot Tasks Work Fine

If your model already performs well on the task without any few-shot examples, the overhead of uncertainty sampling and annotation is unnecessary. Modern LLMs handle many tasks without demonstrations.

One-Off Prompting Scenarios

Active Prompting is designed for systematic optimization across a dataset. For single questions or ad-hoc use, simply writing a good prompt or using zero-shot CoT is far more practical.

Creative or Subjective Tasks

Active Prompting relies on measuring answer disagreement as a proxy for uncertainty. For tasks with no single correct answer — like creative writing, brainstorming, or opinion generation — disagreement is expected, not a signal of weakness.

Use Cases

Where Active Prompting delivers the most value

Medical Question Answering

Identify which clinical reasoning questions the model answers inconsistently, then annotate chain-of-thought demonstrations for those specific diagnostic patterns to improve reliability in healthcare AI systems.

Financial Calculation Pipelines

Surface the types of financial calculations where the model disagrees with itself — compound interest edge cases, tax bracket transitions, amortization schedules — and build targeted few-shot examples for each.

Standardized Test Preparation

Run uncertainty analysis across a bank of practice questions to find the problem types that need the most reasoning support, then create targeted CoT exemplars for tutoring systems.

Legal Contract Analysis

Discover which contract clause interpretations produce inconsistent model outputs, then annotate reasoning chains that clarify the legal logic for those ambiguous clause types.

Quality Assurance Automation

In automated testing pipelines, use uncertainty sampling to identify the edge cases where the model’s test generation or bug classification is unreliable, then focus annotation on those failure patterns.

Safety-Critical Classification

For content moderation or safety screening, identify the borderline cases where the model is most uncertain about classification, then provide clear CoT demonstrations that establish consistent decision boundaries.

Where Active Prompting Fits

Active Prompting bridges static CoT and adaptive example selection

Chain-of-Thought Fixed Demonstrations Hand-picked CoT examples for all questions

Auto-CoT Automatic Clustering Machine-selected diverse examples

Active Prompting Uncertainty-Driven Target the model’s actual weak points

Self-Consistency Multi-Path Voting Sample many paths, take the majority answer

Combine with Self-Consistency

Active Prompting and Self-Consistency are natural complements. Active Prompting uses disagreement across samples to select better demonstrations, while Self-Consistency uses disagreement across samples to select better final answers at inference time. You can use Active Prompting to build your demonstration set, then apply Self-Consistency when running those demonstrations to squeeze maximum accuracy from both the example selection and the answer selection stages.

Related Techniques

Explore complementary reasoning and selection techniques

Foundation Chain-of-Thought The step-by-step reasoning technique that Active Prompting optimizes — by selecting which CoT demonstrations to annotate based on model uncertainty.

Complement Self-Consistency Uses the same multi-sample approach but for answer selection at inference time — while Active Prompting uses it for example selection during preparation.

Optimize Your Demonstrations

Apply Active Prompting principles to find where your model needs the most help, or explore our tools to build stronger prompts from the ground up.

Prompt Builder All Foundations

Active Prompting

Let the Model Tell You Where It Needs Help

The Active Prompting Process

Sample Multiple Answers per Question

Measure Uncertainty and Disagreement

Select and Annotate High-Uncertainty Questions

Use Annotated Examples as Few-Shot Demonstrations

See the Difference

Direct CoT (Random Examples)

Active Prompting

Practice Responsible AI

Active Prompting in Action

When to Use Active Prompting

Perfect For

Skip It When

Use Cases

Medical Question Answering

Financial Calculation Pipelines

Standardized Test Preparation

Legal Contract Analysis

Quality Assurance Automation

Safety-Critical Classification

Where Active Prompting Fits

Related Techniques

Optimize Your Demonstrations