Active-Prompt: pick the few-shot examples that actually teach
Active-Prompt uses uncertainty signals to identify which examples should be human-annotated and added to your few-shot pool. The result: smaller example sets that outperform larger ones picked at random.
You're building a few-shot classifier and you have 200 unlabeled inputs you could potentially use as examples. Labeling all 200 takes a week. Picking 5 random ones gives you a mediocre prompt. Picking 5 by gut feeling gives you a slightly-less-mediocre prompt.
Active-Prompt is the smarter alternative. Run the model on each candidate with no examples, measure how uncertain it is on each, and annotate only the most uncertain ones. The inputs the model finds hardest are exactly the ones whose annotations teach the most. The result: 5 well-chosen examples beat 20 random ones.
The whole idea in one line
The mental model: active learning for prompts#
In supervised ML, "active learning" is a long-standing idea: instead of labeling random data, label the data your model is most uncertain about. Each new label resolves the most ambiguity, so you reach high accuracy with far less labeled data.
Active-Prompt applies the same logic to few-shot example selection. Your model is already "trained" (it's the base model). The few-shot examples are like additional training data injected at run time. So picking which ones to annotate is an active-learning problem — and uncertainty is the signal.
The Active-Prompt loop#
- Sample with uncertainty. For each unlabeled candidate, run the model N times at non-zero temperature. Different runs may produce different answers — the variance is your uncertainty signal.
- Rank by disagreement. Candidates where all N runs agree are easy (the model's already confident). Candidates where runs disagree are hard. Sort by disagreement, descending.
- Annotate the hardest. Take the top K most-uncertain candidates and have a human label them correctly.
- Add to few-shot pool. Use these annotated examples in your few-shot prompt going forward.
Measuring uncertainty without logprobs#
The original research uses model logprobs to measure uncertainty directly. In production, most APIs don't expose logprobs (or expose them inconsistently). Three practical proxies that work without logprobs:
- Sampled disagreement. Run the model 5 times at temperature 0.7. Count how many distinct answers appear. More distinct answers = higher uncertainty.
- Self-reported confidence. Ask the model to output its confidence 1-5 alongside the answer. Use 1-3 as "uncertain". Less reliable than sampled disagreement (the same model judging its own confidence is biased) but cheap.
- Cross-model disagreement. Run the candidate through two different models. Where they disagree, the case is ambiguous and worth labeling.
# Run this prompt N=5 times per candidate at temperature 0.7
# Then count distinct answers. More distinct = more uncertain.
Classify the candidate input below as exactly one of:
{{categories}}
Input: {{candidate_input}}
Output the classification only — no explanation.
Classification:When Active-Prompt earns its keep#
Active-Prompt vs. alternatives
| If your situation is… | Reach for… | Why |
|---|---|---|
| Annotation is expensive (subject-matter experts, complex labels) | Active-Prompt | Each annotation matters; pick the high-leverage ones |
| You have 100+ unlabeled candidates and budget for 5–10 labels | Active-Prompt | The selection problem is the bottleneck; uncertainty solves it |
| Subjective tasks (style, voice) where uncertainty is hard to measure | Skip — random or curated selection | Uncertainty proxies don't correlate well with subjective difficulty |
| Small candidate pool (<20 inputs) | Just label them all | Selection overhead exceeds the labeling savings |
| Annotation is cheap (the team can label fast) | Random sampling + diversity heuristics | Active-Prompt's ranking cost > the labeling savings |
| Few-shot prompt that's already plateaued | Active-Prompt to find new failure modes | High-uncertainty cases reveal what your current examples don't cover |
Going further: production patterns#
Rolling re-annotation#
As production data shifts, the cases that used to be high-uncertainty stop being so — and new ones emerge. Re-run Active-Prompt periodically (monthly for high-volume tasks) and rotate stale examples out. Keeps the few-shot set aligned with the current input distribution.
Combine uncertainty with diversity#
Pure top-K uncertainty can pick examples that are similar to each other (clustered in input space). Add a diversity penalty: from the top 20 most-uncertain, pick 5 that are most different from each other. The resulting examples cover more ground.
Per-domain example pools#
Maintain separate few-shot pools per input subdomain. Active-Prompt picks the hardest examples within each domain. At runtime, retrieve examples from the pool matching the current input. Compounds with few-shot dynamic selection patterns.
Evaluate the selection itself#
Build an eval set independent of your few-shot pool. Compare downstream task accuracy with: (a) random examples, (b) Active-Prompt examples, (c) hand-curated examples. The lift from Active-Prompt over random measures whether the selection is actually working — sometimes it doesn't on tasks where uncertainty proxies are weak.
Common mistakes#
- Sampling at temperature 0. Same input → same output 5 times means uncertainty is always 0. Set temperature 0.7+ for the disagreement measurement.
- Using model self-confidence without validating it. Self-reported confidence is biased. Cross-check against actual accuracy on a labeled test set before trusting it as your uncertainty signal.
- Picking only the most uncertain examples. Adversarial / extreme cases can poison a few-shot prompt — the model treats the weird examples as the typical pattern. Mix in a few easy cases too.
- One-shot Active-Prompt with no re-runs. Production data drifts. The hard cases of Q1 aren't the hard cases of Q4. Re-run periodically.
Quick reference#
The 60-second summary
What it is: active learning applied to few-shot example selection. Annotate the inputs the model is most uncertain about.
Uncertainty proxy: sampled disagreement (run N times at temperature 0.7, count distinct outputs).
When it shines: annotation is expensive, candidate pool is large (100+), classification or extraction tasks.
When to skip: small candidate pool, cheap annotation, subjective tasks where uncertainty proxies are weak.
The discipline: mix uncertain + easy examples, refresh periodically as data drifts, evaluate the selection on a held-out test set.
What to read next#
Active-Prompt builds on top of few-shot prompting — the technique whose example selection problem it solves. For automated prompt generation that pairs naturally with Active-Prompt, Automatic Prompt Engineer. And for the eval workflow that validates whether selection is actually helping, A/B testing prompts.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.