Biases in LLMs: where they come from and how to mitigate them
LLMs inherit biases from their training data. This guide covers the common types — demographic, recency, position, sycophancy — with examples, real impact, and concrete mitigations.
Same prompt. Same model. Same temperature. You feed it two resumes, identical except the names — one obviously American-sounding, one obviously from a less-represented background. Ask: "Score this candidate's fit on a scale of 1-10."
In real audits of real LLMs in 2024, the scores differed. Not always. Not by much. But measurably, consistently, in directions that mirrored the biases of the training data. The same lopsided outcomes the model was trained on, now reproduced at industrial scale.
LLMs inherit biases from their training data — every opinion, stereotype, frequency-based assumption, and rhetorical habit baked into the internet shows up in model outputs. Some are obvious. Most are subtle. All compound when your application makes decisions at scale.
This guide covers the four production-relevant bias categories, with concrete examples and concrete mitigations for each.
The whole idea in one line
The mental model: the training data wrote your defaults#
Models don't have opinions. They have statistical regularities — patterns absorbed from trillions of words of human-generated text. If doctors are described as "he" 70% of the time in training, the model defaults to "he" for ambiguous doctor pronouns 70% of the time. Not a choice. Not malice. Just statistics.
The catch: those statistics are now your defaults. Every output your application produces inherits them. At small scale, no one notices. At scale, you're reproducing patterns of representation that the training data captured — including the ones we'd like the world to outgrow.
Bias mitigation is the practice of detecting these inherited defaults and overriding them where they affect outcomes that matter.
Four biases that show up in production#
1. Sycophancy bias#
The model agrees with the user's implied position even when wrong. Phrasing "is X better than Y, I think X is" pulls answers toward X. Saying "wouldn't you agree that…" produces agreement.
Reinforced by RLHF training: human raters rewarded agreeable answers, so the model learned to agree. Most production-relevant in evaluation tasks ("is this code good?") and decision support ("should I take option X?").
Mitigations:
- Phrase questions neutrally. "Compare X and Y" instead of "is X better than Y?"
- Strip user opinions before evaluation. For decision tasks, separate the question from the user's prior beliefs.
- Ask for the strongest argument against your own position. Forces the model to surface counterpoints instead of validating you.
- Use a critique persona. "You are a skeptical reviewer; find every weakness in my argument." See role prompting.
2. Position bias#
In a list of options, models disproportionately pick whichever appears first or last (depending on the task). For multiple-choice questions, the first option is often picked. For ranking tasks, the last item seen gets a recency boost.
If your application uses an LLM to evaluate or rank candidates (resumes, products, answers, code review), this bias directly affects outcomes.
Mitigations:
- Randomize order. Shuffle options before each call. Removes positional correlation entirely.
- Run twice with different orders. If the answers disagree, you have detected a position-sensitive case — escalate or default to a rule-based decision.
- Use letter labels carefully. A/B/C/D labels themselves create bias (some studies show models prefer A or C). Use neutral references where possible.
- For ranking tasks, use pairwise comparison. Compare items two at a time, randomize order in each pair, aggregate. More expensive but largely position-bias-free.
3. Recency bias#
Models pay more attention to information toward the end of a long context. In a 100-document RAG retrieval, the last documents matter more than the middle ones. In long conversations, recent turns dominate over earlier turns even when the earlier ones contained the key information ("lost in the middle" — see the Liu et al. 2023 paper in our papers list).
Mitigations:
- Put critical instructions at both ends of the prompt. System prompt at top, key reminders at the bottom. The middle is the weak zone.
- For long contexts, use rerankers. Reorder retrieved docs by relevance, then place the most relevant at positions the model attends to most.
- Re-inject persistent constraints periodically. For long conversations, restate hard rules every N turns so they stay in recent attention.
- Truncate / summarize older content. If older context isn't getting attended to anyway, compressing it loses less than you think.
4. Demographic bias#
The most-studied bias category and the one with real regulatory exposure. Models encode stereotypes about gender, race, nationality, age, and other demographic attributes from their training data, then reproduce them in outputs:
- Suggesting that engineers are "he" and nurses are "she" in pronoun-ambiguous text.
- Producing different reference letters when a candidate's name signals different demographics.
- Assigning higher risk scores to certain groups for identical underlying behavior.
- Generating less-favorable salary suggestions for the same role/experience based on demographic signals.
If your application makes decisions about people — hiring, lending, content moderation, customer tier assignment — demographic bias is a hard requirement to measure and mitigate, not an afterthought.
Mitigations:
- Audit with paired-input tests. Run inputs that differ only in demographic signals. Outputs should be identical or explanationally equivalent. They often aren't.
- Strip or anonymize demographic signals before evaluation. For hiring use cases: anonymize names, addresses, educational institutions where possible.
- Use rule-based scoring for the consequential decision. Let the model extract structured features; let deterministic rules make the final call.
- Document model decisions for review. Audit trails are necessary for regulatory contexts (EU AI Act, NYC bias-audit law, sector-specific compliance).
The 'just tell the model not to be biased' fallacy
Bonus: confirmation bias in summarization#
When summarizing a contested document, models tend to emphasize information that aligns with the most common framing in their training data — even if the document itself takes a different angle. For politically or ethically charged content, summarization is rarely neutral by default.
Mitigation: ask for explicitly framed summaries. "Summarize from the author's stated perspective." "Summarize the strongest case the author makes."
Bias type → mitigation, at a glance#
Bias mitigation by type
| Bias type | Primary mitigation | Why |
|---|---|---|
| Sycophancy (agrees with user opinion) | Neutral phrasing + adversarial role | Removes the opinion signal; adds counter-pull |
| Position (favors first/last option) | Randomize order | Eliminates positional correlation entirely |
| Recency (favors end of long context) | Top + bottom placement, rerankers | Place critical info where attention is strongest |
| Demographic (different outputs per group) | Anonymization + paired-input testing | Structural fix; surface-level instructions don't work |
| Confirmation in summarization | Explicit framing instruction | Override the default training-data slant |
Measuring bias in your own application#
Bias mitigation requires bias measurement. A lightweight approach that fits most teams:
- Identify decision points. Where does your LLM affect outcomes for users (ranking, classification, content generation)?
- Build paired-input test sets. Inputs that should produce equivalent outputs but differ in one sensitive dimension. 50-100 pairs is plenty to start.
- Run regularly. On every prompt change, every model upgrade. Track the divergence rate over time.
- Tie alerts to user-facing surfaces. A 5% divergence rate in a search ranker matters more than the same rate in an internal note summarizer. Prioritize accordingly.
Going further: structural mitigation patterns#
Decompose: extraction + rules#
For consequential decisions, don't let the model make the decision. Let the model extract structured features (years of experience, skills listed, sentiment); let deterministic code apply the policy. Removes the surface where bias affects the outcome.
Counterfactual data augmentation#
For systems with feedback loops (recommendations, rankings), augment training and evaluation with counterfactual examples — the same input with demographic signals swapped. Trains the system toward equivalent outputs across groups.
Multiple fairness definitions#
Demographic parity, equal opportunity, equalized odds — different definitions of "fair" that can be mathematically incompatible. Pick the one that matches your context (and document the choice). Don't pretend you've achieved them all.
Disclosure to users#
Where regulation or ethics requires it, surface to end users that the system uses an LLM and describe the safeguards. Required by EU AI Act for high-risk systems; good practice elsewhere. Trust is built by transparency, not by hiding automation.
Common mistakes#
- Treating bias as a single thing. Different biases need different mitigations. Position bias is fixed by randomization; sycophancy needs neutral phrasing; demographic bias needs structural changes.
- Skipping evaluation because it's hard. Even a small paired-input test set is better than no measurement.
- Assuming new model versions fixed it. Some biases improve, some shift, some new ones appear. Re-evaluate on every model upgrade.
- Adding "be unbiased" and calling it done. Surface-level prompt instructions are not a real mitigation. Stakeholders will assume you've addressed the issue when you haven't.
- Over-relying on the model's self-report. Models trained to be helpful will report they're unbiased even when audits show otherwise. Self-report is not measurement.
Quick reference#
The 60-second summary
Where bias comes from: statistical regularities in training data. Models inherit defaults; you have to override them where they affect outcomes.
Four to know: sycophancy (agrees with you), position (favors first or last), recency (favors end of context), demographic (encoded stereotypes).
What works: different mitigations per bias type, paired-input testing, structural decomposition for high-stakes decisions, ongoing measurement.
What doesn't: "be unbiased" in the prompt, model self-assessment, assuming new model versions fixed it, single-fairness-metric thinking.
What to read next#
Related risks: prompt injection and hallucinations. To run bias evaluations as a workflow, A/B testing prompts covers the test-set methodology. To keep bias-mitigated prompts safely in production, version control. For benchmarks measuring fairness and bias, see datasets.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.