Vote-K Prompting

Technique Context: 2022, Su et al.

Introduced: Vote-K was proposed by Su et al. in 2022 as an active learning strategy for in-context learning. The core idea is elegant: rather than randomly selecting which unlabeled examples to annotate for few-shot demonstrations, let the language model itself “vote” on which k candidates from an unlabeled pool would be most informative. The model evaluates each candidate’s potential contribution to the demonstration set, effectively identifying examples that sit near decision boundaries or cover underrepresented patterns. This transforms demonstration pool construction from a guessing game into a guided, model-informed selection process.

Modern LLM Status: The concept of model-guided data selection remains highly relevant for building high-quality demonstration pools, especially in specialized domains where annotation is expensive. While modern LLMs like Claude, GPT-4, and Gemini have reduced the need for carefully curated few-shot examples in general tasks, Vote-K’s principles shine in domain-specific applications — medical coding, legal classification, and technical support routing — where each annotated example costs real time and expert attention. The broader insight that strategic data curation outperforms brute-force data collection has become a foundational principle in applied AI.

The Core Insight

Model-Guided Selective Annotation

When building a demonstration pool for few-shot prompting, the default approach is to annotate examples at random. But random selection wastes effort on easy, redundant cases that the model already handles well. The examples that matter most are the ones the model finds confusing, ambiguous, or novel — exactly the cases where a labeled demonstration would teach it something new.

Vote-K flips annotation from passive to active. Instead of blindly labeling data, you present the model with unlabeled candidates and ask it to vote on which k examples it most wants to see annotated. The model identifies its own uncertainty: examples where it is least confident, where multiple valid classifications seem possible, or where the input pattern differs from anything in the current demonstration set. These high-value votes guide human annotators toward the data points that will deliver the greatest learning impact per label.

Think of it like a student who, instead of being assigned random homework problems, gets to point at the specific problems they find most confusing — then a teacher works through those targeted examples to maximize learning efficiency.

Why Targeted Annotation Beats Random Selection

A randomly selected demonstration pool often contains many similar, easy examples that teach the model nothing new — while leaving difficult edge cases unrepresented. Vote-K concentrates annotation effort on the examples the model finds most uncertain or informative, building a demonstration set where every labeled example carries maximum instructional weight. The result: better performance from fewer annotations, which directly translates to lower cost and faster deployment.

The Vote-K Process

Four stages from unlabeled data to an optimized demonstration pool

1

Start with a Labeled Seed Pool and Unlabeled Candidates

Begin with a small set of already-labeled examples (the seed pool) and a larger collection of unlabeled data. The seed pool gives the model a baseline understanding of the task, while the unlabeled candidates represent the raw material from which the model will select its most-wanted annotations.

Example

You have 20 labeled customer support tickets across 5 categories, plus 500 unlabeled tickets waiting to be classified. The model can already handle straightforward cases but struggles with edge cases and ambiguous tickets.

2

Model Votes on K Most Useful Unlabeled Examples

The model evaluates unlabeled candidates against the current demonstration set and votes on which k examples would be most valuable to annotate. It prioritizes examples where its prediction confidence is lowest, where it detects novel patterns not covered by existing demonstrations, or where the input lies near the boundary between multiple categories.

Example

The model reviews 500 unlabeled tickets and votes for 10 it finds most confusing — including a complaint that blends billing and technical issues, a request written in mixed language, and tickets from a product line not represented in the seed pool.

3

Human Annotates the Top-Voted Examples

A domain expert reviews and labels the k examples the model selected. Because these are the cases the model found most challenging, each annotation carries outsized instructional value. The human effort is concentrated exactly where it matters most, rather than spread thinly across easy cases the model already understands.

Example

The support team lead labels the 10 voted tickets: the billing-plus-technical ticket gets classified as “Billing” with a note about routing, the mixed-language ticket gets its primary category, and the new product line tickets establish fresh category patterns.

4

Add to Demonstration Pool and Repeat

The newly annotated examples join the demonstration pool, expanding the model’s reference set. The process can repeat iteratively — each cycle the model votes on the next batch of most-valuable candidates, and each round of annotation targets a different frontier of the model’s uncertainty. Performance improves with each iteration as the demonstration pool becomes increasingly comprehensive and strategically composed.

Example

After three rounds of Vote-K selection, the demonstration pool grows from 20 to 50 examples — but those 50 strategically chosen examples outperform a randomly selected pool of 200 because they cover the exact edge cases and boundary conditions where the model needed guidance.

See the Difference

Why model-guided selection outperforms random annotation

Approach

Randomly select 50 unlabeled examples from a pool of 500, send them to domain experts for labeling, and add all 50 to the demonstration set regardless of their instructional value.

Outcome

35 of the 50 annotated examples are straightforward cases the model already classifies correctly. Only 15 cover novel patterns or edge cases. The annotation budget is largely spent on redundant demonstrations that add no new learning signal.

Wasteful — most annotations duplicate what the model already knows

VS

Approach

The model votes on which 50 examples from the pool of 500 it finds most uncertain or informative. Only these model-selected candidates go to domain experts for labeling.

Outcome

All 50 annotated examples target the model’s actual weaknesses — ambiguous edge cases, underrepresented categories, and boundary-straddling inputs. Every annotation teaches the model something new, producing a demonstration pool that delivers superior performance with fewer labels.

Efficient — every annotation maximizes learning impact per label

Vote-K in Action

See how model-guided selection optimizes annotation across domains

Customer Support Categorization

Scenario

A support team needs to classify incoming tickets into categories: Billing, Technical, Account Access, Feature Request, and General Inquiry. They have 15 labeled examples but 800 unlabeled tickets. Budget allows labeling only 30 more.

Vote-K Selection

Model votes for these high-value candidates:

Voted #1: “I was charged twice but also can’t log in to check my invoice.”
Why voted: Straddles Billing and Account Access — the model is uncertain which category takes priority when both apply.

Voted #2: “The export button doesn’t work. Can you add CSV export?”
Why voted: Blends Technical (bug report) with Feature Request — the model needs guidance on how to classify mixed-intent tickets.

Voted #3: “Hola, necesito ayuda con mi cuenta pero no hablo inglés bien.”
Why voted: Non-English input not represented in seed pool — the model has no reference for how to handle language-mixed tickets.

Result: After annotating 30 model-selected tickets, classification accuracy improved by 23% compared to annotating 30 random tickets — with the biggest gains on multi-category and edge-case tickets.

Legal Clause Extraction

Scenario

A legal tech company needs to identify and classify clauses in contracts: Limitation of Liability, Indemnification, Termination, Confidentiality, and Force Majeure. They have 25 labeled clause examples from standard templates but 2,000 unlabeled clauses from diverse real-world contracts.

Vote-K Selection

Model votes for these high-value candidates:

Voted #1: A clause that combines indemnification obligations with liability caps in a single paragraph.
Why voted: Overlaps two categories — the model cannot determine whether to classify it as Limitation of Liability or Indemnification without a labeled example showing the precedent.

Voted #2: A pandemic-specific termination clause referencing government shutdown orders.
Why voted: Blends Force Majeure language with Termination provisions — a pattern absent from the standard-template seed pool.

Voted #3: An informal clause from a startup contract written in plain English rather than legal terminology.
Why voted: The model’s seed pool contains only formal legal language — it has no reference for recognizing clause types when expressed colloquially.

Result: After two Vote-K rounds (40 total annotations), clause extraction F1 score reached 0.89 — a level that random annotation required 120 labeled examples to achieve.

Product Review Analysis

Scenario

An e-commerce company wants to classify product reviews by sentiment (Positive, Negative, Neutral) and extract specific aspect mentions (Quality, Price, Shipping, Customer Service). They have 30 labeled reviews but 5,000 unlabeled reviews across product categories.

Vote-K Selection

Model votes for these high-value candidates:

Voted #1: “The product is amazing but the shipping was a nightmare and customer service was unhelpful. 3 stars.”
Why voted: Mixed sentiment across multiple aspects — the model cannot determine overall sentiment when individual aspects conflict.

Voted #2: “Returned it. Twice. Third one worked fine I guess.”
Why voted: Terse, ambiguous language with implied negative experience but a partially positive resolution — the model finds the sentiment boundary unclear.

Voted #3: A review that uses heavy sarcasm: “Oh sure, I love waiting three weeks for something that breaks on day one. Five stars for the entertainment value.”
Why voted: Sarcasm inverts surface-level positive language into negative sentiment — a pattern the seed pool does not address.

Result: Vote-K-guided annotation produced a demonstration pool where sentiment accuracy on edge cases improved by 31% over random selection, with sarcasm detection improving from near-chance to 78% accuracy.

When to Use Vote-K

Best for building high-quality demonstration pools with limited annotation budgets

Perfect For

Expensive Annotation Domains

Medical, legal, financial, or scientific domains where each labeled example requires costly expert time — Vote-K ensures every annotation delivers maximum value.

Large Unlabeled Data Pools

When you have thousands of unlabeled examples but can only afford to label a small fraction — Vote-K identifies the highest-impact subset to annotate.

Iterative Few-Shot Improvement

When you want to systematically improve few-shot performance over multiple rounds — each Vote-K cycle targets a different frontier of model uncertainty.

Edge Case Discovery

When your task has many ambiguous or boundary-straddling examples that random sampling tends to miss — Vote-K surfaces exactly these hard cases.

Skip It When

Annotation Is Cheap

If labeling examples is fast and inexpensive, the overhead of running Vote-K selection may not justify the efficiency gains — just label everything directly.

Small Unlabeled Pools

If you only have a handful of unlabeled examples, the selection overhead is unnecessary — simply annotate them all and skip the voting step.

Zero-Shot Performance Is Sufficient

If the model already performs well enough on the task without any demonstrations, building an elaborate demonstration pool provides diminishing returns.

Use Cases

Where Vote-K delivers the most value

Medical Record Coding

Identify the most ambiguous clinical notes for expert coding — focusing annotation on cases where diagnosis codes overlap or symptom descriptions are atypical, rather than wasting physician time on clear-cut records.

Contract Classification

Surface contract clauses that defy standard templates for legal expert review — hybrid provisions, unusual language, and jurisdiction-specific variations that the model cannot confidently categorize.

Support Ticket Routing

Let the model identify tickets it cannot confidently route between departments, then have team leads label those boundary cases to build a demonstration set that resolves the trickiest routing decisions.

Content Moderation

Focus moderator annotation on content the model finds ambiguous — sarcasm, cultural context, and borderline cases — rather than clearly benign or clearly violating posts the model already handles reliably.

Financial Transaction Tagging

Identify transactions the model struggles to categorize — unusual merchant names, cross-category purchases, and subscription patterns — then have accountants label those specific cases for maximum classification improvement.

Research Paper Tagging

Surface interdisciplinary papers that span multiple research domains for expert categorization, building a demonstration set that teaches the model to handle cross-field publications that defy single-category classification.

Where Vote-K Fits

Vote-K bridges passive data selection and intelligent demonstration engineering

Random Selection Passive Sampling Annotate examples at random

Vote-K Model-Guided Selection Model votes on most valuable candidates

Active Prompting Uncertainty-Driven Select based on prediction disagreement

KNN Prompting Similarity-Based Retrieve nearest neighbors as demonstrations

Combine with Example Selection

Vote-K and Example Selection are natural partners. Use Vote-K to build a high-quality, strategically annotated demonstration pool, then use Example Selection techniques (similarity-based, diversity-based, or task-specific retrieval) to pick the best demonstrations from that pool for each new input at inference time. This two-stage pipeline — Vote-K for pool construction, Example Selection for runtime retrieval — maximizes both annotation efficiency and per-query performance.

Related Techniques

Explore complementary data selection and demonstration techniques

Foundation Few-Shot Learning The demonstration-based prompting paradigm that Vote-K optimizes — Vote-K makes few-shot learning more efficient by ensuring every demonstration in the pool earns its place.

Complement Example Selection While Vote-K decides which examples to annotate, Example Selection decides which annotated examples to use for each query — the natural second stage after Vote-K builds the pool.

Optimize Your Annotation Budget

Apply Vote-K principles to build smarter demonstration pools, or explore our tools for crafting effective few-shot prompts.

Prompt Builder All Foundations

Vote-K Prompting

Model-Guided Selective Annotation

The Vote-K Process

Start with a Labeled Seed Pool and Unlabeled Candidates

Model Votes on K Most Useful Unlabeled Examples

Human Annotates the Top-Voted Examples

Add to Demonstration Pool and Repeat

See the Difference

Random Annotation

Vote-K Guided Annotation

Practice Responsible AI

Vote-K in Action

When to Use Vote-K

Perfect For

Skip It When

Use Cases

Medical Record Coding

Contract Classification

Support Ticket Routing

Content Moderation

Financial Transaction Tagging

Research Paper Tagging

Where Vote-K Fits

Related Techniques

Optimize Your Annotation Budget