A/B testing prompts: pick the winner with data, not vibes
Eyeballing two prompt outputs is not a test. Learn how to run a proper A/B test on prompts — across versions, across models — and lock in the winner with confidence.
Two engineers stand over a laptop. They run two prompts, look at the outputs, and the senior engineer says "yeah, B is clearly better." Two weeks later, they realize B was producing slightly worse summaries on customer-support emails — they only tested it on the easy ones, and B happened to use a phrasing one of them preferred.
Eyeballing two prompt outputs is not a test. It's taste. And taste lies — about your own preferences, about what "clearly better" means, about whether you actually saw enough cases. Real prompt A/B testing means running both versions on the same inputs, scoring them on the same rubric, and letting numbers decide.
Done well, it takes 20 minutes and saves you from shipping a prompt that's 15% worse than the one you already had. Done badly, it produces results-shaped feelings. This guide is the done-well version.
The whole idea in one line
The mental model: it's a science experiment, not a demo#
Demos show one input. Tests show 20. Demos hit the happy path. Tests hit edge cases. Demos confirm what you wanted to believe. Tests tell you what's actually true.
A real A/B test is small science: a hypothesis (B is better than A), a control (the inputs, the model, the rubric), and a measurement (the score). The whole point is to remove your taste from the loop — because your taste was selectively trained on the cases you already saw.
When A/B testing is worth the effort#
Prompt A/B tests are worth running when:
- The prompt runs frequently. Optimizing a prompt that runs 5 times a week is a fine hobby; optimizing one that runs 500 times pays for itself in days.
- The output ships to customers. Internal drafts can be cleaned up by a human; customer-facing outputs cannot.
- You're considering a model switch. Different models reward different phrasing. Switching from Claude to GPT-4o without re-testing is a coin flip.
- Two team members disagree. Don't debate it in Slack. Test it.
Don't bother A/B testing one-off prompts you'll run twice. The math doesn't work.
The 5-step prompt A/B testing recipe#
Step 1 — Define the rubric before you test#
The biggest mistake people make is running the test, then deciding what "better" means. That is how confirmation bias creeps in. Write your rubric first.
For an email-summarization prompt, a good rubric might be:
- Accuracy — does the summary correctly represent the source? (1–5)
- Conciseness — is it under the word limit and free of filler? (1–5)
- Action clarity — does it correctly identify whether action is needed? (binary: yes/no)
Three to five rubric items is the sweet spot. More than five and you'll spend more time scoring than fixing the prompt.
Step 2 — Build a fixed test set of 10–20 inputs#
Lock in your inputs before you test, and use the same inputs for both versions. Sample them from real production traffic where possible — synthetic test inputs lie. Aim for 10 to 20 inputs: small enough to score by hand, large enough to expose variance.
Pick adversarial inputs on purpose
Step 3 — Run both versions on every input#
In the playground, run version A and version B against the same input back-to-back. Save the outputs. The order matters: alternate which version goes first to avoid a recency preference creeping into your scoring.
If you're also testing across models (e.g. Claude vs. GPT-4o), do the full grid: v_A × Claude, v_A × GPT-4o, v_B × Claude, v_B × GPT-4o. Same inputs everywhere.
Step 4 — Score blind, then reveal#
Strip the version labels. Score every output against your rubric without knowing which version produced it. This is the step that catches confirmation bias most reliably.
If you have a teammate available, even better — have them score separately, then reconcile differences. The reconciliation conversation usually surfaces the most interesting findings.
Step 5 — Decide with a margin#
After scoring, total each version and look at the gap. Three outcomes:
- Clear winner — version B beats A by 10%+ on every rubric item. Switch.
- Trade-off — B is better on accuracy, worse on conciseness. Now you have a real product decision: which matters more for this use case?
- Noise — B is 2% better, but it's within run-to-run variance. Don't switch. The change didn't move the needle; ship the simpler one.
The 'noise' result is a real result
Multi-model comparisons: a special case#
Different models reward different phrasing. The prompt that wins on Claude might lose on GPT-4o. Two heuristics:
- Test the prompt + model as a unit. A great prompt on the wrong model is worse than a mediocre prompt on the right one. Pick the pair, not the prompt.
- Watch out for default-model nudges. Claude is more likely to use bullet points without being asked; GPT-4o produces more flowery prose. If your rubric punishes one of those, you're testing model defaults, not prompt quality. Either fix the rubric or explicitly constrain the format.
A worked example#
Here are two versions of an email summarizer being A/B tested. Version A is the current production prompt; B is a candidate.
Summarize the email below in 3 bullet points.
Each bullet should be one sentence.
Email:
"""
{{email_body}}
"""Summarize the email below in 3 bullet points.
For each bullet, lead with a verb (e.g., "Confirms", "Asks").
End with one final line: "Action needed: <yes/no, what>".
Email:
"""
{{email_body}}
"""Rubric: accuracy (1–5), conciseness (1–5), action clarity (yes/no, must match a hand-labeled answer).
Test set: 15 real customer-support emails.
Result (illustrative): Version B wins on action clarity (14/15 vs. 8/15) and ties on accuracy and conciseness. The action-needed line was the entire point — the verb constraint forced it. Promote B to current; pin v_A in case of regressions.
Picking the right test pattern#
A/B test by context
| If your situation is… | Reach for… | Why |
|---|---|---|
| Single prompt, two versions | Manual A/B test (this guide) | Fast, no infrastructure; the right starting point |
| Many prompts, frequent iteration | Automated eval framework (Promptfoo, Braintrust) | Manual scoring doesn't scale past ~50 inputs |
| Outputs are subjective (writing, code review) | LLM-as-judge with paired scoring | Human scoring is gold standard but slow; LLM judges scale |
| Outputs are extractable (JSON, classification) | Programmatic equality check | Compare against ground truth deterministically |
| Production traffic with measurable downstream metrics | Live A/B with metric tracking | The most rigorous; needs flag infrastructure |
| Disagreement between teammates | Manual A/B with both scoring blind | Reconciling disagreement is the actual learning |
Going further: production A/B patterns#
LLM-as-judge for scaling beyond 50 inputs#
Manual scoring doesn't scale past ~50-100 inputs. Use a separate LLM call to score outputs against your rubric. The judge prompt: "Given input X and two outputs A and B, which better satisfies this rubric? Output reasoning + winner." Pair with human-spotcheck on a sample to calibrate.
Eval frameworks#
For systematic prompt testing, use a dedicated framework — Promptfoo, Braintrust, LangSmith. They handle test set management, parallel execution, judge orchestration, and trend tracking over time. See tools for the curated list.
Building a good eval set#
The test set is the product. Curate it like one:
- Source from real production traffic (with permission).
- Include edge cases that have historically caused problems.
- Hold out a portion as a true holdout — never train (or few-shot) on it.
- Re-curate every few months as production traffic shifts.
Regression testing on every change#
Once you have a stable eval set, run it automatically before any prompt change ships to production. Same idea as a code test suite: catches regressions immediately, makes refactors safe. Tie to version-control workflows so promotion to v_next is gated on eval pass. See versioning.
Anti-patterns to avoid#
- Testing on one input. One input is a demo, not a test. You'll see what you want to see.
- Scoring with the version label visible. Confirmation bias is automatic. Always score blind.
- Changing two things at once. If A and B differ in three places, you can't tell which change drove the result. Keep it to one variable per test, or accept that you're testing the package, not the components.
- Ignoring trade-offs. If B is better on accuracy but slower or pricier, that matters. The total is not always the right metric.
- Letting the same person write the rubric and score. The author's biases leak into both. Have someone else write or review the rubric when stakes are high.
Quick reference#
The 60-second summary
The 5 steps: rubric first → fixed test set → run both → score blind → decide with margin.
When it's worth it: high-volume prompts, customer-facing outputs, model switches, team disagreements.
Three outcomes: clear winner (switch), trade-off (real product decision), noise (don't switch).
The discipline: rubric before outputs (not after), 12 inputs minimum, score blind always.
Scaling up: manual at 50 inputs, LLM-as-judge past that, eval frameworks for repeated testing, regression-gated promotion in production.
What to read next#
A/B testing tells you which prompt is better. To make sure you can roll back if a regression slips through, set up version control: Version control for prompts.
And to make sure the winner actually gets used by your team, read Build a team prompt library. For tools that automate the workflow, see tools.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.