Example Ordering
The same few-shot examples can produce wildly different results depending on their sequence. Example Ordering studies and optimizes the arrangement of demonstrations — turning a hidden source of variance into a deliberate lever for performance.
Introduced: Example Ordering was formally studied by Lu et al. in 2022 in the paper “Fantastically Ordered Prompts and Where to Find Them.” The research demonstrated that simply permuting the order of few-shot examples could swing accuracy from near-random to near-optimal on the same task with the same model — sometimes by more than 20 percentage points. This revealed that ordering is not a cosmetic detail but a first-class variable in prompt engineering that demands the same attention as example selection itself.
Modern LLM Status: Example Ordering remains an active and relevant technique. While modern LLMs such as Claude, GPT-4, and Gemini are somewhat more robust to ordering effects than earlier models, the phenomenon persists, especially for classification tasks, mathematical reasoning, and tasks with label imbalance. Research has consistently shown that recency bias, majority label bias, and category clustering effects continue to influence model outputs. Understanding ordering effects is essential for anyone doing serious few-shot prompting, and optimizing arrangement remains one of the simplest ways to improve results without changing any content.
Sequence Is a Silent Variable
When you provide few-shot examples to a language model, you probably focus on choosing the right examples. But there is a second, equally powerful variable hiding in plain sight: the order in which those examples appear. Two practitioners using identical examples for the same task can get dramatically different results simply because one arranged them differently.
The underlying mechanisms are well-documented. Recency bias causes models to weight later examples more heavily — the last example you provide disproportionately shapes the output. Majority label bias means that if recent examples cluster around one label, the model over-predicts that label. And category anchoring occurs when examples from the same class appear consecutively, causing the model to fixate on that pattern and under-represent others.
Think of it like a jury hearing witness testimony. The order in which witnesses present their evidence shapes the narrative the jury constructs — even when the underlying facts are identical. Example Ordering applies this same principle deliberately, arranging demonstrations to guide the model toward balanced, accurate predictions.
Lu et al. found that on some tasks, the worst ordering of the same examples performed at near-chance level while the best ordering achieved near state-of-the-art accuracy. This variance was not a marginal effect — it was the difference between a usable system and a broken one. The most alarming finding was that practitioners had no reliable intuition about which ordering would work best, making systematic approaches essential rather than optional.
The Example Ordering Process
Four stages from uncontrolled variance to optimized arrangement
Select Your Examples
Choose a set of few-shot demonstrations for the target task. Typically three to ten examples work well for standard tasks, though the exact number depends on the complexity of the domain and the diversity of categories you need to represent. Ensure the examples themselves are high-quality before worrying about their arrangement.
For a sentiment analysis task, select 6 product reviews — 3 positive and 3 negative — each clearly representative of its category and covering different product types.
Evaluate Ordering Sensitivity
Test multiple permutations of the same examples to measure how much the arrangement affects results. If accuracy swings more than 5 percentage points across different orderings, the task is ordering-sensitive and optimization will yield meaningful gains. This diagnostic step prevents wasted effort on tasks where ordering has minimal impact.
Run 10 random permutations of your 6 sentiment examples against a validation set of 50 reviews. If accuracy ranges from 68% to 93% across permutations, ordering sensitivity is high and optimization is warranted.
Apply Ordering Heuristics
Use evidence-based strategies to arrange your examples. Alternate labels to prevent majority bias so the model does not see consecutive examples of the same class. Place the most representative or complex examples last to leverage recency advantage. Distribute diverse categories evenly throughout the sequence rather than clustering them together.
Instead of grouping all positive reviews first, interleave them: positive, negative, positive, negative, positive, negative. Place the most nuanced review — one with mixed signals that resolves to a clear label — in the final position.
Validate with a Held-Out Set
Test the chosen ordering against examples the model has not seen during optimization. This confirms the arrangement generalizes beyond the validation set and is not overfitting to a narrow subset of inputs. Compare the optimized ordering against both random orderings and the reverse arrangement to quantify the improvement.
Evaluate your interleaved arrangement on a fresh set of 100 reviews. The optimized ordering achieves 91% accuracy versus 74% average across random orderings — a 17-point improvement from arrangement alone.
See the Difference
Why strategic arrangement outperforms random ordering
Random Ordering
Ex 1: “Battery lasts forever” → Positive
Ex 2: “Great screen quality” → Positive
Ex 3: “Love the camera” → Positive
Ex 4: “Broke after a week” → Negative
Ex 5: “Terrible customer support” → Negative
Ex 6: “Overpriced for what you get” → Negative
Model over-predicts “Negative” due to recency bias — all three negative examples appear last, anchoring the model toward that label for ambiguous inputs.
Optimized Ordering
Ex 1: “Battery lasts forever” → Positive
Ex 2: “Broke after a week” → Negative
Ex 3: “Great screen quality” → Positive
Ex 4: “Terrible customer support” → Negative
Ex 5: “Love the camera” → Positive
Ex 6: “Overpriced for what you get” → Negative
Labels alternate strategically, preventing any single sentiment from dominating the recency window. The model evaluates each new input on its own merits rather than defaulting to the last-seen pattern.
Practice Responsible AI
Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.
48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.
Example Ordering in Action
See how deliberate arrangement transforms few-shot performance
A team classifying product reviews noticed their few-shot prompt achieved only 72% accuracy on ambiguous reviews — cases where the language was mixed or subtle. The six examples they used were all high-quality, but three positive reviews appeared before three negative ones.
Problem identified: The three consecutive negative examples at the end created strong recency bias. When the model encountered an ambiguous review like “decent product but shipping was slow,” it defaulted to “Negative” because that was the most recently reinforced pattern.
Ordering fix: Interleave positive and negative reviews so labels alternate throughout the sequence. Place the most nuanced example — a review with mixed language that clearly resolves to one sentiment — in the final position to calibrate the model for borderline cases.
Result: Accuracy on ambiguous reviews jumped from 72% to 91%. The model stopped defaulting to the last-seen label and instead evaluated each review on its own linguistic signals. No examples were added or removed — only their order changed.
A math tutoring application provided four worked examples before asking students’ questions. The examples ranged from basic arithmetic to multi-step algebra, but they appeared in random order. The model struggled with harder problems, often reverting to overly simplistic solution strategies.
Problem identified: When a simple arithmetic example appeared last, the model’s recency bias caused it to apply simplistic strategies even to complex algebra problems. The most sophisticated reasoning pattern was buried in the middle of the sequence where it had the least influence.
Ordering fix: Arrange examples in ascending complexity — basic arithmetic first, then simple algebra, then multi-step problems, with the most complex worked example in the final position. This leverages recency bias constructively by ensuring the most sophisticated reasoning pattern is freshest in context when the model encounters the target problem.
Result: The model’s success rate on multi-step problems improved from 58% to 79%. By placing the most complex demonstration last, the model approached new hard problems with the most advanced solution strategy readily available rather than defaulting to the simplest pattern it saw.
A customer support system classified incoming tickets into five categories: billing, technical, shipping, returns, and general inquiry. The few-shot prompt grouped examples by category — two billing examples, then two technical, then two shipping, and so on. The model heavily over-predicted “general inquiry” because it was the last category in the sequence.
Problem identified: Grouping examples by category created a compounding effect. The model saw two consecutive “general inquiry” examples last, making that the dominant pattern in its recency window. Ambiguous tickets that could reasonably fall into billing or technical were consistently misclassified as general inquiry.
Ordering fix: Distribute examples from all five categories evenly throughout the sequence, ensuring no two consecutive examples share the same label. The arrangement followed a round-robin pattern: billing, technical, shipping, returns, general, billing, technical, shipping, returns, general.
Result: Over-prediction of the final category dropped from 38% to 19%, closely matching the true distribution. Classification accuracy across all five categories improved by 14 percentage points on average, with the largest gains on categories that had previously been under-represented in the recency window.
When to Use Example Ordering
Best for few-shot tasks where consistency and accuracy matter
Perfect For
Tasks where you provide labeled examples across several classes — ordering determines whether the model gives balanced predictions or over-predicts whichever category it saw most recently.
When you notice that the same prompt produces different quality outputs across runs or slight modifications — ordering sensitivity is likely a contributing factor worth investigating.
Sentiment analysis, intent detection, and topic classification — any task where label distribution in examples directly influences the model’s prediction tendencies.
When results must be consistent and defensible — medical triage, legal classification, or financial categorization where ordering-induced variance is unacceptable.
Skip It When
When no examples are provided in the prompt, there is nothing to order. Example Ordering only applies when you are using few-shot demonstrations.
Tasks where all examples share the same label or output structure — such as summarization or translation — have no label bias to mitigate through ordering.
When using large numbers of demonstrations, ordering effects diminish with volume. The sheer quantity of examples naturally averages out positional biases that dominate in smaller sets.
Use Cases
Where Example Ordering delivers the most value
Sentiment Classification
Interleave positive and negative review examples to prevent recency bias from skewing predictions on ambiguous or mixed-sentiment inputs.
Medical Triage Categorization
Distribute urgency-level examples evenly so the model does not default to the most recently seen triage category when evaluating borderline patient cases.
Customer Intent Detection
Arrange intent examples in a round-robin pattern across categories to ensure balanced prediction rates for purchase, support, complaint, and inquiry intents.
Code Bug Classification
Alternate examples of different bug types — logic errors, syntax issues, performance problems — to prevent the model from over-classifying toward whichever category it saw last.
Content Moderation
Interleave safe and flagged content examples so the moderation model maintains calibrated thresholds rather than becoming either overly permissive or overly restrictive based on recency.
Survey Response Coding
Distribute thematic category examples evenly to ensure open-ended survey responses are coded across all relevant themes rather than clustering toward the final category in the sequence.
Where Example Ordering Fits
Example Ordering bridges example selection and volume-based approaches
Example Selection and Example Ordering are two sides of the same coin. Selection determines which demonstrations appear in your prompt, while ordering determines their arrangement. Optimizing one without the other leaves performance on the table. The strongest few-shot prompts apply both — first selecting high-quality, representative examples, then arranging them to minimize positional bias and maximize the model’s ability to generalize from the pattern.
Related Techniques
Explore complementary in-context learning techniques
Optimize Your Example Order
Experiment with example arrangements or build optimized prompts with our tools.