Demonstration Ensembling
Any single set of few-shot examples carries its own biases — the phrasing, the order, the edge cases it happens to cover. Demonstration Ensembling neutralizes that fragility by running the same query across multiple different example sets and combining the results, producing predictions that are more robust, more consistent, and less dependent on the luck of which demonstrations you happened to pick.
Introduced: Demonstration Ensembling emerged from research in 2022 showing that LLM outputs can be surprisingly sensitive to which few-shot examples appear in the prompt. Two different sets of demonstrations — both perfectly valid — can produce different answers to the same question. This technique addresses that instability head-on: instead of betting on a single example set, you create multiple distinct sets from your available pool, run each one independently, and aggregate the results through majority voting or averaging. The ensemble approach borrows a proven principle from classical machine learning, where combining multiple weak learners consistently outperforms any single model.
Modern LLM Status: The principle of ensembling across different contexts remains a powerful reliability strategy, now commonly applied in evaluation pipelines and production systems where consistency matters more than speed. Modern implementations often combine demonstration ensembling with self-consistency sampling, temperature variation, or retrieval-augmented example selection. In high-stakes domains like medical triage, content moderation, and financial classification, ensembling across example sets is a standard practice for reducing variance and catching edge cases that any single prompt configuration might miss.
Why Varying Examples Reduces Bias
Every set of few-shot examples encodes implicit assumptions. If your three examples all happen to be short responses, the model learns “keep it brief.” If they all handle straightforward cases, the model may stumble on ambiguity. A single demonstration set is like asking one expert for their opinion — useful, but inherently limited by that expert’s perspective and experience.
Demonstration Ensembling treats examples like a jury rather than a single judge. By assembling multiple distinct sets of demonstrations — each drawing from different parts of your example pool — you expose the model to a broader range of patterns, edge cases, and response styles. When you aggregate the outputs, the biases of any individual set wash out. What survives the vote is the signal that persists across all contexts.
Think of it like surveying a landscape from multiple vantage points. Each viewpoint reveals features that others miss, and the composite picture is richer and more accurate than any single perspective could provide.
Research has shown that simply reordering the same few-shot examples can swing classification accuracy by 10–30 percentage points. Swapping one example for another from the same category can flip the model’s answer entirely. This isn’t a flaw in the model — it’s a consequence of how in-context learning works. Demonstration Ensembling doesn’t try to find the “perfect” example set (which may not exist). Instead, it embraces the variance and uses aggregation to extract the stable, reliable signal underneath.
The Demonstration Ensembling Process
Four stages from example pool to aggregated prediction
Create N Distinct Example Sets
Start with a pool of available few-shot examples and sample N different subsets from it. Each set should contain a representative mix of cases but draw from different examples. The goal is diversity — each set should expose the model to a slightly different slice of your problem space, varying in difficulty, phrasing, and edge case coverage.
From a pool of 20 labeled sentiment examples, create 5 sets of 3 examples each. Set A might include a sarcastic review, a straightforward positive, and a mixed sentiment. Set B draws a formal complaint, an enthusiastic endorsement, and a neutral description.
Run the Same Query with Each Set Independently
Submit your target query to the model N times, each time paired with a different example set. The query itself stays identical — only the demonstrations change. Each prompt independently primes the model with its unique context, producing a response shaped by that particular set of examples. These runs can execute in parallel for efficiency.
The query “Classify this review: ‘The battery life is decent but the screen cracks too easily’” is sent 5 times, each preceded by a different set of 3 labeled examples. Each prompt independently produces a sentiment classification.
Collect All N Responses
Gather the outputs from all N runs. For classification tasks, this yields N predicted labels. For generation tasks, you collect N distinct text outputs. For numerical predictions, you have N values. Each response reflects the model’s interpretation as influenced by its particular demonstration context — some may agree, others may diverge on edge cases.
The 5 runs return: Mixed (Set A), Negative (Set B), Mixed (Set C), Mixed (Set D), Negative (Set E). Three out of five agree on “Mixed” while two say “Negative.”
Aggregate Results via Majority Vote, Averaging, or Consensus
Combine the N responses into a single final answer. For classification, use majority voting — the label that appears most frequently wins. For numerical tasks, take the mean or median. For generation tasks, you can select the response most similar to the others, use an LLM to synthesize a consensus answer, or rank outputs by agreement. The aggregation step is where individual biases cancel out and the robust signal emerges.
Majority vote: 3 out of 5 responses say “Mixed,” so the final classification is “Mixed Sentiment” with 60% agreement confidence. The two “Negative” votes flag this as a borderline case worth human review.
See the Difference
Why multiple example sets outperform a single demonstration
Single Example Set
Three few-shot examples are chosen for a support ticket classifier. All three happen to be billing-related complaints with angry tone. The model sees: refund request → Billing, overcharge dispute → Billing, payment failure → Billing.
New ticket: “I can’t log in to my account after the update.” The model classifies it as Billing because the examples biased it toward that category. The actual category should be Technical Support.
Demo Ensembling
Five different example sets are created, each mixing billing, technical, and account issues. The same login ticket is classified independently by each set. Results: Technical (Set 1), Technical (Set 2), Account (Set 3), Technical (Set 4), Technical (Set 5).
Majority vote: 4 out of 5 say Technical Support. The ensemble correctly identifies the category despite individual sets having different example compositions. The one “Account” vote is outweighed by the consensus.
Practice Responsible AI
Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.
48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.
Demonstration Ensembling in Action
See how ensembling across example sets improves reliability
“Hi team, I wanted to follow up on the quarterly numbers. Can we schedule a call this week to discuss projections and also loop in the design lead for the product review?”
Set A (examples: meeting request, project update, leave request):
Classification → Meeting Request
Set B (examples: data inquiry, scheduling, feedback):
Classification → Scheduling
Set C (examples: product review, team coordination, status check):
Classification → Meeting Request
Set D (examples: follow-up, escalation, scheduling):
Classification → Meeting Request
Set E (examples: introduction, meeting request, info request):
Classification → Meeting Request
Majority Vote (4/5): Meeting Request. The ensemble correctly identifies the primary intent despite the email containing multiple sub-intents including data discussion, scheduling, and cross-team coordination.
Write a product description for a noise-cancelling headphone aimed at audiophiles and commuters. Three example sets each show different product descriptions as demonstrations.
Set A (examples: luxury tech products with aspirational tone):
Output emphasizes premium design aesthetics, lifestyle benefits, and brand prestige.
Set B (examples: practical office gear with feature-focused tone):
Output emphasizes battery life, call clarity, microphone quality, and comfort for 8-hour wear.
Set C (examples: balanced product descriptions mixing emotion and specs):
Output blends comfort claims with concrete specifications and realistic use scenarios.
Consensus Synthesis: An aggregation pass identifies common themes across all three outputs — noise cancellation for focus, all-day comfort, and clear call quality. The final description combines the practical specifics from Set B with the engaging framing from Set A and the balanced structure from Set C, producing a description more complete and well-rounded than any single output.
Extract vendor name, invoice number, date, and total amount from a scanned invoice with imperfect OCR output: “Vndr: Acme Ccrp. Inv#: 2024-0847 Dt: 03/15/24 Ttl: $12,450.00”
Set A (examples: clean invoices with standard formatting):
Vendor: Acme Corp, Invoice: 2024-0847, Date: 03/15/24, Total: $12,450.00
Set B (examples: messy OCR invoices with abbreviations):
Vendor: Acme Corp., Invoice: 2024-0847, Date: 2024-03-15, Total: $12,450.00
Set C (examples: international invoices with varied date formats):
Vendor: Acme Ccrp, Invoice: 2024-0847, Date: March 15, 2024, Total: $12,450.00
Set D (examples: invoices with OCR correction patterns):
Vendor: Acme Corp, Invoice: 2024-0847, Date: 03/15/2024, Total: $12,450.00
Consensus: All four agree on invoice number and total. Three of four correct “Ccrp” to “Corp” — majority vote applies the correction. Date format is standardized to the most common output. The ensemble catches the OCR error that Set C missed.
When to Use Demonstration Ensembling
Best for tasks where consistency and reliability outweigh latency costs
Perfect For
Medical triage, content moderation, fraud detection — any domain where a single misclassification carries significant consequences and reliability trumps speed.
When inputs frequently fall between categories or have multiple valid interpretations, ensembling reveals whether disagreement is a feature of the input or an artifact of the examples.
When measuring LLM performance, ensembling across example sets produces more stable metrics that reflect true model capability rather than prompt sensitivity.
When you have far more labeled examples than fit in a single prompt, ensembling lets you leverage the full pool rather than discarding most of your data.
Skip It When
Each ensemble member adds an API call. If your use case requires sub-second responses — like autocomplete or real-time chat — the overhead of N parallel calls may be prohibitive.
If you only have 3–4 examples total, you cannot create meaningfully diverse subsets. The “ensembles” would overlap too heavily to provide independent signals.
Ensembling multiplies your API costs by N. For exploratory or low-value tasks where a single good-enough answer suffices, the cost-benefit ratio doesn’t justify the approach.
Use Cases
Where Demonstration Ensembling delivers the most value
Content Moderation
Run flagged content through multiple example sets spanning different violation types to reduce both false positives and false negatives in automated moderation pipelines.
Medical Triage
Classify patient symptoms across multiple demonstration sets covering different specialties, ensuring the urgency assessment isn’t biased by the particular clinical examples shown.
Document Classification
Sort incoming documents into categories using diverse example sets that cover different document styles, formats, and edge cases for each category.
Sentiment Analysis
Ensemble across example sets that emphasize different sentiment signals — sarcasm, understatement, cultural context — to produce more nuanced and consistent sentiment scores.
Intent Recognition
Classify user queries in chatbot systems using multiple example sets that cover different phrasings and contexts for each intent, reducing misrouting of ambiguous requests.
LLM Evaluation
Benchmark model performance using ensembled example sets to produce stable accuracy metrics that reflect true capability rather than sensitivity to prompt construction.
Where Demonstration Ensembling Fits
Bridging few-shot learning and systematic reliability engineering
Demonstration Ensembling and Self-Consistency target different sources of variance. Ensembling varies the context (which examples the model sees), while Self-Consistency varies the reasoning path (how the model thinks through the same context). Combining both — running multiple reasoning samples across multiple example sets — creates a two-dimensional ensemble that reduces variance from both sources simultaneously. In production systems handling critical decisions, this layered approach can push reliability close to human-level consistency.
Related Techniques
Explore complementary ensemble and example-based techniques
Build Reliable Predictions
Try Demonstration Ensembling on your own classification tasks or explore ensemble strategies with our prompt engineering tools.