Vote-K Prompting
Not all unlabeled data is created equal. Vote-K lets the model itself vote on which examples from an unlabeled pool would be most valuable to annotate — concentrating expensive human effort on the demonstrations that will actually move the needle.
Introduced: Vote-K was proposed by Su et al. in 2022 as an active learning strategy for in-context learning. The core idea is elegant: rather than randomly selecting which unlabeled examples to annotate for few-shot demonstrations, let the language model itself “vote” on which k candidates from an unlabeled pool would be most informative. The model evaluates each candidate’s potential contribution to the demonstration set, effectively identifying examples that sit near decision boundaries or cover underrepresented patterns. This transforms demonstration pool construction from a guessing game into a guided, model-informed selection process.
Modern LLM Status: The concept of model-guided data selection remains highly relevant for building high-quality demonstration pools, especially in specialized domains where annotation is expensive. While modern LLMs like Claude, GPT-4, and Gemini have reduced the need for carefully curated few-shot examples in general tasks, Vote-K’s principles shine in domain-specific applications — medical coding, legal classification, and technical support routing — where each annotated example costs real time and expert attention. The broader insight that strategic data curation outperforms brute-force data collection has become a foundational principle in applied AI.
Model-Guided Selective Annotation
When building a demonstration pool for few-shot prompting, the default approach is to annotate examples at random. But random selection wastes effort on easy, redundant cases that the model already handles well. The examples that matter most are the ones the model finds confusing, ambiguous, or novel — exactly the cases where a labeled demonstration would teach it something new.
Vote-K flips annotation from passive to active. Instead of blindly labeling data, you present the model with unlabeled candidates and ask it to vote on which k examples it most wants to see annotated. The model identifies its own uncertainty: examples where it is least confident, where multiple valid classifications seem possible, or where the input pattern differs from anything in the current demonstration set. These high-value votes guide human annotators toward the data points that will deliver the greatest learning impact per label.
Think of it like a student who, instead of being assigned random homework problems, gets to point at the specific problems they find most confusing — then a teacher works through those targeted examples to maximize learning efficiency.
A randomly selected demonstration pool often contains many similar, easy examples that teach the model nothing new — while leaving difficult edge cases unrepresented. Vote-K concentrates annotation effort on the examples the model finds most uncertain or informative, building a demonstration set where every labeled example carries maximum instructional weight. The result: better performance from fewer annotations, which directly translates to lower cost and faster deployment.
The Vote-K Process
Four stages from unlabeled data to an optimized demonstration pool
Start with a Labeled Seed Pool and Unlabeled Candidates
Begin with a small set of already-labeled examples (the seed pool) and a larger collection of unlabeled data. The seed pool gives the model a baseline understanding of the task, while the unlabeled candidates represent the raw material from which the model will select its most-wanted annotations.
You have 20 labeled customer support tickets across 5 categories, plus 500 unlabeled tickets waiting to be classified. The model can already handle straightforward cases but struggles with edge cases and ambiguous tickets.
Model Votes on K Most Useful Unlabeled Examples
The model evaluates unlabeled candidates against the current demonstration set and votes on which k examples would be most valuable to annotate. It prioritizes examples where its prediction confidence is lowest, where it detects novel patterns not covered by existing demonstrations, or where the input lies near the boundary between multiple categories.
The model reviews 500 unlabeled tickets and votes for 10 it finds most confusing — including a complaint that blends billing and technical issues, a request written in mixed language, and tickets from a product line not represented in the seed pool.
Human Annotates the Top-Voted Examples
A domain expert reviews and labels the k examples the model selected. Because these are the cases the model found most challenging, each annotation carries outsized instructional value. The human effort is concentrated exactly where it matters most, rather than spread thinly across easy cases the model already understands.
The support team lead labels the 10 voted tickets: the billing-plus-technical ticket gets classified as “Billing” with a note about routing, the mixed-language ticket gets its primary category, and the new product line tickets establish fresh category patterns.
Add to Demonstration Pool and Repeat
The newly annotated examples join the demonstration pool, expanding the model’s reference set. The process can repeat iteratively — each cycle the model votes on the next batch of most-valuable candidates, and each round of annotation targets a different frontier of the model’s uncertainty. Performance improves with each iteration as the demonstration pool becomes increasingly comprehensive and strategically composed.
After three rounds of Vote-K selection, the demonstration pool grows from 20 to 50 examples — but those 50 strategically chosen examples outperform a randomly selected pool of 200 because they cover the exact edge cases and boundary conditions where the model needed guidance.
See the Difference
Why model-guided selection outperforms random annotation
Random Annotation
Randomly select 50 unlabeled examples from a pool of 500, send them to domain experts for labeling, and add all 50 to the demonstration set regardless of their instructional value.
35 of the 50 annotated examples are straightforward cases the model already classifies correctly. Only 15 cover novel patterns or edge cases. The annotation budget is largely spent on redundant demonstrations that add no new learning signal.
Vote-K Guided Annotation
The model votes on which 50 examples from the pool of 500 it finds most uncertain or informative. Only these model-selected candidates go to domain experts for labeling.
All 50 annotated examples target the model’s actual weaknesses — ambiguous edge cases, underrepresented categories, and boundary-straddling inputs. Every annotation teaches the model something new, producing a demonstration pool that delivers superior performance with fewer labels.
Practice Responsible AI
Always verify AI-generated content before use. AI systems can produce confident but incorrect responses. When using AI professionally, transparent disclosure is both best practice and increasingly a legal requirement.
48 US states now require AI transparency in key areas. Critical thinking remains your strongest tool against misinformation.
Vote-K in Action
See how model-guided selection optimizes annotation across domains
A support team needs to classify incoming tickets into categories: Billing, Technical, Account Access, Feature Request, and General Inquiry. They have 15 labeled examples but 800 unlabeled tickets. Budget allows labeling only 30 more.
Model votes for these high-value candidates:
Voted #1: “I was charged twice but also can’t log in to check my invoice.”
Why voted: Straddles Billing and Account Access — the model is uncertain which category takes priority when both apply.
Voted #2: “The export button doesn’t work. Can you add CSV export?”
Why voted: Blends Technical (bug report) with Feature Request — the model needs guidance on how to classify mixed-intent tickets.
Voted #3: “Hola, necesito ayuda con mi cuenta pero no hablo inglés bien.”
Why voted: Non-English input not represented in seed pool — the model has no reference for how to handle language-mixed tickets.
Result: After annotating 30 model-selected tickets, classification accuracy improved by 23% compared to annotating 30 random tickets — with the biggest gains on multi-category and edge-case tickets.
A legal tech company needs to identify and classify clauses in contracts: Limitation of Liability, Indemnification, Termination, Confidentiality, and Force Majeure. They have 25 labeled clause examples from standard templates but 2,000 unlabeled clauses from diverse real-world contracts.
Model votes for these high-value candidates:
Voted #1: A clause that combines indemnification obligations with liability caps in a single paragraph.
Why voted: Overlaps two categories — the model cannot determine whether to classify it as Limitation of Liability or Indemnification without a labeled example showing the precedent.
Voted #2: A pandemic-specific termination clause referencing government shutdown orders.
Why voted: Blends Force Majeure language with Termination provisions — a pattern absent from the standard-template seed pool.
Voted #3: An informal clause from a startup contract written in plain English rather than legal terminology.
Why voted: The model’s seed pool contains only formal legal language — it has no reference for recognizing clause types when expressed colloquially.
Result: After two Vote-K rounds (40 total annotations), clause extraction F1 score reached 0.89 — a level that random annotation required 120 labeled examples to achieve.
An e-commerce company wants to classify product reviews by sentiment (Positive, Negative, Neutral) and extract specific aspect mentions (Quality, Price, Shipping, Customer Service). They have 30 labeled reviews but 5,000 unlabeled reviews across product categories.
Model votes for these high-value candidates:
Voted #1: “The product is amazing but the shipping was a nightmare and customer service was unhelpful. 3 stars.”
Why voted: Mixed sentiment across multiple aspects — the model cannot determine overall sentiment when individual aspects conflict.
Voted #2: “Returned it. Twice. Third one worked fine I guess.”
Why voted: Terse, ambiguous language with implied negative experience but a partially positive resolution — the model finds the sentiment boundary unclear.
Voted #3: A review that uses heavy sarcasm: “Oh sure, I love waiting three weeks for something that breaks on day one. Five stars for the entertainment value.”
Why voted: Sarcasm inverts surface-level positive language into negative sentiment — a pattern the seed pool does not address.
Result: Vote-K-guided annotation produced a demonstration pool where sentiment accuracy on edge cases improved by 31% over random selection, with sarcasm detection improving from near-chance to 78% accuracy.
When to Use Vote-K
Best for building high-quality demonstration pools with limited annotation budgets
Perfect For
Medical, legal, financial, or scientific domains where each labeled example requires costly expert time — Vote-K ensures every annotation delivers maximum value.
When you have thousands of unlabeled examples but can only afford to label a small fraction — Vote-K identifies the highest-impact subset to annotate.
When you want to systematically improve few-shot performance over multiple rounds — each Vote-K cycle targets a different frontier of model uncertainty.
When your task has many ambiguous or boundary-straddling examples that random sampling tends to miss — Vote-K surfaces exactly these hard cases.
Skip It When
If labeling examples is fast and inexpensive, the overhead of running Vote-K selection may not justify the efficiency gains — just label everything directly.
If you only have a handful of unlabeled examples, the selection overhead is unnecessary — simply annotate them all and skip the voting step.
If the model already performs well enough on the task without any demonstrations, building an elaborate demonstration pool provides diminishing returns.
Use Cases
Where Vote-K delivers the most value
Medical Record Coding
Identify the most ambiguous clinical notes for expert coding — focusing annotation on cases where diagnosis codes overlap or symptom descriptions are atypical, rather than wasting physician time on clear-cut records.
Contract Classification
Surface contract clauses that defy standard templates for legal expert review — hybrid provisions, unusual language, and jurisdiction-specific variations that the model cannot confidently categorize.
Support Ticket Routing
Let the model identify tickets it cannot confidently route between departments, then have team leads label those boundary cases to build a demonstration set that resolves the trickiest routing decisions.
Content Moderation
Focus moderator annotation on content the model finds ambiguous — sarcasm, cultural context, and borderline cases — rather than clearly benign or clearly violating posts the model already handles reliably.
Financial Transaction Tagging
Identify transactions the model struggles to categorize — unusual merchant names, cross-category purchases, and subscription patterns — then have accountants label those specific cases for maximum classification improvement.
Research Paper Tagging
Surface interdisciplinary papers that span multiple research domains for expert categorization, building a demonstration set that teaches the model to handle cross-field publications that defy single-category classification.
Where Vote-K Fits
Vote-K bridges passive data selection and intelligent demonstration engineering
Vote-K and Example Selection are natural partners. Use Vote-K to build a high-quality, strategically annotated demonstration pool, then use Example Selection techniques (similarity-based, diversity-based, or task-specific retrieval) to pick the best demonstrations from that pool for each new input at inference time. This two-stage pipeline — Vote-K for pool construction, Example Selection for runtime retrieval — maximizes both annotation efficiency and per-query performance.
Related Techniques
Explore complementary data selection and demonstration techniques
Optimize Your Annotation Budget
Apply Vote-K principles to build smarter demonstration pools, or explore our tools for crafting effective few-shot prompts.