In-Context Learning Technique

Example Ordering

The same few-shot examples can produce wildly different results depending on their sequence. Example Ordering studies and optimizes the arrangement of demonstrations — turning a hidden source of variance into a deliberate lever for performance.

Technique Context: 2022

Introduced: Example Ordering was formally studied by Lu et al. in 2022 in the paper “Fantastically Ordered Prompts and Where to Find Them.” The research demonstrated that simply permuting the order of few-shot examples could swing accuracy from near-random to near-optimal on the same task with the same model — sometimes by more than 20 percentage points. This revealed that ordering is not a cosmetic detail but a first-class variable in prompt engineering that demands the same attention as example selection itself.

Modern LLM Status: Example Ordering remains an active and relevant technique. While modern LLMs such as Claude, GPT-4, and Gemini are somewhat more robust to ordering effects than earlier models, the phenomenon persists, especially for classification tasks, mathematical reasoning, and tasks with label imbalance. Research has consistently shown that recency bias, majority label bias, and category clustering effects continue to influence model outputs. Understanding ordering effects is essential for anyone doing serious few-shot prompting, and optimizing arrangement remains one of the simplest ways to improve results without changing any content.

The Core Insight

Sequence Is a Silent Variable

When you provide few-shot examples to a language model, you probably focus on choosing the right examples. But there is a second, equally powerful variable hiding in plain sight: the order in which those examples appear. Two practitioners using identical examples for the same task can get dramatically different results simply because one arranged them differently.

The underlying mechanisms are well-documented. Recency bias causes models to weight later examples more heavily — the last example you provide disproportionately shapes the output. Majority label bias means that if recent examples cluster around one label, the model over-predicts that label. And category anchoring occurs when examples from the same class appear consecutively, causing the model to fixate on that pattern and under-represent others.

Think of it like a jury hearing witness testimony. The order in which witnesses present their evidence shapes the narrative the jury constructs — even when the underlying facts are identical. Example Ordering applies this same principle deliberately, arranging demonstrations to guide the model toward balanced, accurate predictions.

Why Order Matters More Than You Think

Lu et al. found that on some tasks, the worst ordering of the same examples performed at near-chance level while the best ordering achieved near state-of-the-art accuracy. This variance was not a marginal effect — it was the difference between a usable system and a broken one. The most alarming finding was that practitioners had no reliable intuition about which ordering would work best, making systematic approaches essential rather than optional.

The Example Ordering Process

Four stages from uncontrolled variance to optimized arrangement

1

Select Your Examples

Choose a set of few-shot demonstrations for the target task. Typically three to ten examples work well for standard tasks, though the exact number depends on the complexity of the domain and the diversity of categories you need to represent. Ensure the examples themselves are high-quality before worrying about their arrangement.

Example

For a sentiment analysis task, select 6 product reviews — 3 positive and 3 negative — each clearly representative of its category and covering different product types.

2

Evaluate Ordering Sensitivity

Test multiple permutations of the same examples to measure how much the arrangement affects results. If accuracy swings more than 5 percentage points across different orderings, the task is ordering-sensitive and optimization will yield meaningful gains. This diagnostic step prevents wasted effort on tasks where ordering has minimal impact.

Example

Run 10 random permutations of your 6 sentiment examples against a validation set of 50 reviews. If accuracy ranges from 68% to 93% across permutations, ordering sensitivity is high and optimization is warranted.

3

Apply Ordering Heuristics

Use evidence-based strategies to arrange your examples. Alternate labels to prevent majority bias so the model does not see consecutive examples of the same class. Place the most representative or complex examples last to leverage recency advantage. Distribute diverse categories evenly throughout the sequence rather than clustering them together.

Example

Instead of grouping all positive reviews first, interleave them: positive, negative, positive, negative, positive, negative. Place the most nuanced review — one with mixed signals that resolves to a clear label — in the final position.

4

Validate with a Held-Out Set

Test the chosen ordering against examples the model has not seen during optimization. This confirms the arrangement generalizes beyond the validation set and is not overfitting to a narrow subset of inputs. Compare the optimized ordering against both random orderings and the reverse arrangement to quantify the improvement.

Example

Evaluate your interleaved arrangement on a fresh set of 100 reviews. The optimized ordering achieves 91% accuracy versus 74% average across random orderings — a 17-point improvement from arrangement alone.

See the Difference

Why strategic arrangement outperforms random ordering

Random Ordering

Example Sequence

Ex 1: “Battery lasts forever” → Positive
Ex 2: “Great screen quality” → Positive
Ex 3: “Love the camera” → Positive
Ex 4: “Broke after a week” → Negative
Ex 5: “Terrible customer support” → Negative
Ex 6: “Overpriced for what you get” → Negative

Result

Model over-predicts “Negative” due to recency bias — all three negative examples appear last, anchoring the model toward that label for ambiguous inputs.

Uncontrolled label distribution creates recency and majority biases, inconsistent results
VS

Optimized Ordering

Example Sequence

Ex 1: “Battery lasts forever” → Positive
Ex 2: “Broke after a week” → Negative
Ex 3: “Great screen quality” → Positive
Ex 4: “Terrible customer support” → Negative
Ex 5: “Love the camera” → Positive
Ex 6: “Overpriced for what you get” → Negative

Result

Labels alternate strategically, preventing any single sentiment from dominating the recency window. The model evaluates each new input on its own merits rather than defaulting to the last-seen pattern.

Labels alternate strategically, diverse examples at key positions, stable and reproducible results

Practice Responsible AI

Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.

48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.

Example Ordering in Action

See how deliberate arrangement transforms few-shot performance

The Challenge

A team classifying product reviews noticed their few-shot prompt achieved only 72% accuracy on ambiguous reviews — cases where the language was mixed or subtle. The six examples they used were all high-quality, but three positive reviews appeared before three negative ones.

Optimized Ordering

Problem identified: The three consecutive negative examples at the end created strong recency bias. When the model encountered an ambiguous review like “decent product but shipping was slow,” it defaulted to “Negative” because that was the most recently reinforced pattern.

Ordering fix: Interleave positive and negative reviews so labels alternate throughout the sequence. Place the most nuanced example — a review with mixed language that clearly resolves to one sentiment — in the final position to calibrate the model for borderline cases.

Result: Accuracy on ambiguous reviews jumped from 72% to 91%. The model stopped defaulting to the last-seen label and instead evaluated each review on its own linguistic signals. No examples were added or removed — only their order changed.

The Challenge

A math tutoring application provided four worked examples before asking students’ questions. The examples ranged from basic arithmetic to multi-step algebra, but they appeared in random order. The model struggled with harder problems, often reverting to overly simplistic solution strategies.

Optimized Ordering

Problem identified: When a simple arithmetic example appeared last, the model’s recency bias caused it to apply simplistic strategies even to complex algebra problems. The most sophisticated reasoning pattern was buried in the middle of the sequence where it had the least influence.

Ordering fix: Arrange examples in ascending complexity — basic arithmetic first, then simple algebra, then multi-step problems, with the most complex worked example in the final position. This leverages recency bias constructively by ensuring the most sophisticated reasoning pattern is freshest in context when the model encounters the target problem.

Result: The model’s success rate on multi-step problems improved from 58% to 79%. By placing the most complex demonstration last, the model approached new hard problems with the most advanced solution strategy readily available rather than defaulting to the simplest pattern it saw.

The Challenge

A customer support system classified incoming tickets into five categories: billing, technical, shipping, returns, and general inquiry. The few-shot prompt grouped examples by category — two billing examples, then two technical, then two shipping, and so on. The model heavily over-predicted “general inquiry” because it was the last category in the sequence.

Optimized Ordering

Problem identified: Grouping examples by category created a compounding effect. The model saw two consecutive “general inquiry” examples last, making that the dominant pattern in its recency window. Ambiguous tickets that could reasonably fall into billing or technical were consistently misclassified as general inquiry.

Ordering fix: Distribute examples from all five categories evenly throughout the sequence, ensuring no two consecutive examples share the same label. The arrangement followed a round-robin pattern: billing, technical, shipping, returns, general, billing, technical, shipping, returns, general.

Result: Over-prediction of the final category dropped from 38% to 19%, closely matching the true distribution. Classification accuracy across all five categories improved by 14 percentage points on average, with the largest gains on categories that had previously been under-represented in the recency window.

When to Use Example Ordering

Best for few-shot tasks where consistency and accuracy matter

Perfect For

Few-Shot Classification with Multiple Categories

Tasks where you provide labeled examples across several classes — ordering determines whether the model gives balanced predictions or over-predicts whichever category it saw most recently.

Inconsistent Results with the Same Examples

When you notice that the same prompt produces different quality outputs across runs or slight modifications — ordering sensitivity is likely a contributing factor worth investigating.

Label-Sensitive Tasks

Sentiment analysis, intent detection, and topic classification — any task where label distribution in examples directly influences the model’s prediction tendencies.

High-Stakes Applications Requiring Reproducibility

When results must be consistent and defensible — medical triage, legal classification, or financial categorization where ordering-induced variance is unacceptable.

Skip It When

Zero-Shot Tasks

When no examples are provided in the prompt, there is nothing to order. Example Ordering only applies when you are using few-shot demonstrations.

Single Output Category or Format

Tasks where all examples share the same label or output structure — such as summarization or translation — have no label bias to mitigate through ordering.

Many-Shot Prompting with Hundreds of Examples

When using large numbers of demonstrations, ordering effects diminish with volume. The sheer quantity of examples naturally averages out positional biases that dominate in smaller sets.

Use Cases

Where Example Ordering delivers the most value

Sentiment Classification

Interleave positive and negative review examples to prevent recency bias from skewing predictions on ambiguous or mixed-sentiment inputs.

Medical Triage Categorization

Distribute urgency-level examples evenly so the model does not default to the most recently seen triage category when evaluating borderline patient cases.

Customer Intent Detection

Arrange intent examples in a round-robin pattern across categories to ensure balanced prediction rates for purchase, support, complaint, and inquiry intents.

Code Bug Classification

Alternate examples of different bug types — logic errors, syntax issues, performance problems — to prevent the model from over-classifying toward whichever category it saw last.

Content Moderation

Interleave safe and flagged content examples so the moderation model maintains calibrated thresholds rather than becoming either overly permissive or overly restrictive based on recency.

Survey Response Coding

Distribute thematic category examples evenly to ensure open-ended survey responses are coded across all relevant themes rather than clustering toward the final category in the sequence.

Where Example Ordering Fits

Example Ordering bridges example selection and volume-based approaches

Few-Shot Learning Provide Examples Demonstrate the task with labeled inputs
Example Selection Choose WHICH Examples Pick the most relevant demonstrations
Example Ordering Arrange Optimally Sequence examples to maximize performance
Many-Shot Overwhelm with Volume Use hundreds of examples to average out bias
The Ordering-Selection Synergy

Example Selection and Example Ordering are two sides of the same coin. Selection determines which demonstrations appear in your prompt, while ordering determines their arrangement. Optimizing one without the other leaves performance on the table. The strongest few-shot prompts apply both — first selecting high-quality, representative examples, then arranging them to minimize positional bias and maximize the model’s ability to generalize from the pattern.

Optimize Your Example Order

Experiment with example arrangements or build optimized prompts with our tools.