Hallucinations: why models lie confidently and what to do about it
Hallucinations are the highest-impact failure mode in LLM applications. Learn why they happen, the four flavors that show up in production, and the techniques that measurably reduce them.
In May 2023, two New York lawyers filed a brief citing six federal court cases. Confident citations, plausible names, on-point holdings. Beautiful work. The judge could not find any of them — because none of them existed. The lawyers had asked ChatGPT for cases supporting their argument; ChatGPT had invented them, with full citations, in tone indistinguishable from real ones.
The lawyers were sanctioned. The story made every front page. And the underlying problem — hallucination — became the single most important reliability issue in production LLM applications.
A model that's slow, expensive, or stilted is fixable. A model that confidently invents facts that look real is the kind of problem that loses customer trust in one bad output. This guide is about why it happens, the four flavors that show up in production, and the techniques that measurably reduce them.
The whole idea in one line
The mental model: probable text isn't true text#
LLMs are trained to produce probable next tokens, not true ones. When the most probable continuation isn't a known fact, the model produces something that sounds like a known fact — same shape, same vocabulary, fabricated content.
The model has no internal "is this true?" signal. It has a "is this plausible?" signal — a probability distribution over possible continuations. For topics it knows well, plausible and true coincide. For long-tail topics, plausible drifts away from true. The output looks the same regardless.
Three structural reasons hallucinations happen:
- The training data lied. If fabricated content (forum guesses, outdated info) was in training, the model learned it as truth.
- The training data was incomplete. For long-tail facts, the model never saw enough examples to learn correctly. It generalizes — badly — from what it did see.
- The decoding rewards confidence. Models tuned with RLHF often learn that confident answers get higher reward than hedged ones. "I'm not sure" is unrewarded behavior.
Four flavors you'll see in production#
1. Fabricated specifics#
The model invents specific details — a fake quote, a fake URL, a fake court case, a fake API method. Hardest to catch because they look authoritative. The lawyer story above is this flavor at maximum impact.
Most damaging in legal, medical, technical, and financial contexts where specifics are load-bearing.
2. Outdated facts#
The model produces information that was true at training time but is no longer current. Easy to spot if you know the topic; invisible if you don't. "The CEO of Twitter is Jack Dorsey" was true once; now it's obviously wrong; in five years it'll be non-obviously wrong because most readers won't have lived through the changes.
3. Misattribution#
The model conflates two similar things — attributes a quote to the wrong author, a feature to the wrong product, a study's findings to a different study. The pieces are all real; the combination is wrong. Often the hardest flavor to detect because every individual element checks out.
4. Rule-violation#
The model says something that contradicts an explicit rule in the prompt. "Never quote a price" → model quotes a price anyway."Always cite a source" → model cites a source it didn't retrieve. Common in long conversations or under heavy context pressure.
Techniques that measurably reduce hallucinations#
Grounding via RAG#
The single highest-impact mitigation. Retrieve real documents, inject them into the prompt, instruct the model to answer only from them. See the RAG guide for the full pattern. The key prompt rule: "If the answer is not in the documents, say so. Do not infer."
Lower the temperature#
Higher temperatures produce more diverse — and more hallucinated — outputs. For factual tasks, drop to temperature 0 or 0.1. You sacrifice variety; you gain consistency. Most production Q&A and extraction tasks should run at low temperature by default.
Explicit uncertainty instructions#
Tell the model what to do when uncertain — and make that the rewarded behavior in your prompt.
Answer the question below.
Rules:
- If you don't know the answer with high confidence, say
"I don't have a reliable answer for that."
- Never invent specific details (names, numbers, dates, URLs).
- For factual claims, indicate your confidence level: HIGH | MEDIUM | LOW.
- If asked for a specific source you can't cite, say so explicitly.
Question: {{user_question}}Cite-or-refuse pattern#
For RAG-style applications, require every factual claim to cite a source from the retrieved context. If the model can't cite, it can't make the claim. Combine with output validation that strips any claim without a valid citation before showing to the user.
Self-check pass#
After generating an answer, run a second prompt that asks the model to verify the answer's claims against the retrieved context. Flag any unsupported claims. Worth the latency for high-stakes outputs. This is prompt chaining applied to grounding.
Chain-of-Thought for reasoning hallucinations#
Reasoning errors often hide inside confident-sounding conclusions. Chain-of-Thought exposes the reasoning so errors are visible to users (and downstream validators). Pair with self-consistency for high-stakes reasoning.
Things that DON'T meaningfully help#
- "Don't hallucinate" in the prompt. The model has no concept of what hallucinating means. Tell it WHAT to do (cite sources, say don't-know), not what NOT to do.
- Bigger models alone. Larger models hallucinate less on average but still hallucinate on long-tail facts. Don't skip the system-level fixes just because you upgraded.
- Asking the model to grade itself. A model that confidently makes things up will confidently grade them as correct. Use a separate model or external validation for high-stakes checks.
- Verbose prompts that "remind" the model to be careful. Adds tokens; produces marginal effect. Tight structure beats verbose pleas.
Picking the right mitigation#
Mitigation by use case
| If your situation is… | Reach for… | Why |
|---|---|---|
| Q&A over your own docs (support, internal knowledge) | RAG + cite-or-refuse | Grounding kills the entire problem class |
| Open-domain Q&A where you can't retrieve | Explicit uncertainty + low temp | Train model to say "I don't know"; reduce sampling variance |
| Extraction from a known document | Strict prompt + structured outputs | Schema enforcement prevents fabricated fields |
| Reasoning tasks (math, logic, multi-step) | CoT + self-consistency | Expose reasoning so errors are visible |
| High-stakes (medical, legal, financial) | All of the above + human review | No automated mitigation is sufficient at this stakes level |
| Creative writing where invention is the point | No mitigation | Hallucination IS the feature; just don't mix with factual claims |
Going further: production hallucination management#
Build a hallucination test set early#
Without measurement, hallucination rate is just a feeling. Build a test set of 50-100 inputs known to trigger hallucinations and a binary scoring rubric:
- Did the output contain any factually false claims?
- Were any specific details (names, numbers, quotes) verifiable?
- Did the model refuse appropriately when it lacked info?
Re-run after every prompt change, every model upgrade, every retrieval index change. See A/B testing prompts for the workflow.
Calibrated confidence in the UI#
Surface the model's uncertainty to users instead of hiding it. Show confidence levels visually (color, icon, label). Show citations inline. Let users distinguish "the model is sure" from "the model is guessing." Trust comes from making uncertainty visible, not from pretending it doesn't exist.
Ensemble agreement as a confidence signal#
Run the same query through two different models (or the same model with different prompts). If they agree, ship the answer. If they disagree, escalate to human review or refuse to answer. Catches a class of hallucinations where any single run looks confident but the consensus fails.
Production monitoring#
In production, you can't hand-grade every output. Two cheap signals to monitor:
- Citation density. If your cite-or-refuse prompts suddenly produce outputs without citations, something is drifting.
- Refusal rate. A sudden drop in "I don't know" outputs often means the model is overreaching. A sudden spike means retrieval is failing.
Alert on changes; investigate before users notice.
Calibrated honesty is a feature
Common mistakes#
- No measurement. Without a test set, hallucination rate is just a feeling. Get the test set.
- Trusting model self-assessment. Confidence scores produced by the same model making the claim are unreliable.
- Showing uncertainty inconsistently. If your UI shows confident answers next to confident hallucinations identically, users can't calibrate. Make uncertainty visible.
- Treating it as a one-time fix. Hallucinations re-emerge after every model upgrade, prompt change, or new corpus. Run the eval continuously.
- Fixing at the wrong layer. Most hallucination problems in RAG-style apps aren't prompt problems — they're retrieval problems (the right doc never arrived). Fix retrieval first.
Quick reference#
The 60-second summary
What it is: models producing confident, plausible-sounding output that isn't true.
Four flavors: fabricated specifics, outdated facts, misattribution, rule-violation.
What helps: RAG grounding, low temperature, explicit uncertainty instructions, cite-or-refuse, self-check passes, CoT for reasoning.
What doesn't: "don't hallucinate" in the prompt, bigger models alone, model self-assessment.
The non-negotiable: a real test set. Without measurement, all your mitigations are theatre.
What to read next#
For the highest-leverage mitigation, RAG. For related risks, prompt injection and biases. To lock in regression-free improvements, version control for prompts. For benchmarks specifically targeting hallucination (TruthfulQA, HaluEval), see datasets.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.