Agent memory: short-term, long-term, and the patterns that work
Agents need memory to be useful past a single turn. Learn the three memory types, the architectures behind them, and the patterns production agents use to retrieve the right context.
Open ChatGPT in a new browser window. Type "hi, my name is Sam." Then close the tab. Open another new tab. Type "what's my name?"
Without a memory layer, the model has no idea. Every API call is its own world. The model you're talking to has zero awareness that you said anything yesterday — or thirty seconds ago, in a different session. It's permanently amnesiac by default.
If your agent is going to be useful past a single response, you need to give it memory. The question is which kind, where to store it, and how to retrieve it without drowning the model. Three different memory types serve three different jobs. Production agents use all three.
The whole idea in one line
The mental model: three storage tiers, three retrieval patterns#
Human memory works in tiers. Working memory holds the current thought. Short-term memory keeps the conversation you're in. Long-term memory stores what you've learned about people and the world. Each tier has different capacity, different decay, different retrieval cost.
Agent memory borrows the same architecture because the same trade-offs apply. You can't fit everything in the prompt (capacity); you can't look up everything on every turn (cost); you can't treat all information equally (relevance). The three-tier split is the cleanest way to manage these trade-offs.
The three memory types#
Working memory: this turn#
The intermediate state during a single agent loop. Thoughts, tool calls, observations from the current run — everything that happens between the user's message and the agent's final response.
Lives in: the message array passed to the model on each step. Retrieval: already in context, no lookup needed. Lifetime: cleared at the end of the turn.
You don't typically "design" working memory — it just is. The agent loop manages it automatically. The only question is whether you trim or summarize it when it grows long within a single turn (rare but possible with tool-heavy agents).
Short-term memory: this conversation#
The chat history of the current session. Everything the user has said and the agent has replied since the conversation started.
Lives in: a database keyed by session ID. Retrieval: loaded on each new turn. Lifetime: the duration of the session, plus retention policy.
Has to be managed: long conversations exceed context windows. You need a strategy (truncation, summarization, sliding window — see below).
Long-term memory: across conversations#
Facts about the user and previous interactions that should persist forever. The user's preferences, their past projects, what they told you their name was three weeks ago.
Lives in: a database — typically a vector store (for semantic retrieval) and/or structured records (for fixed facts like name, role, preferences). Retrieval: selectively, based on what the current message is about. Lifetime: indefinite, with user-controlled deletion.
Managing short-term memory at scale#
Real conversations grow past context limits fast. Three strategies, choose based on what your users expect:
1. Truncation#
Keep the last N turns; drop older ones. Simple, fast, lossy. Good when older context genuinely doesn't matter (one-off Q&A, rapid back-and-forth). Bad when users reference something they said earlier and the agent has no idea what they're talking about.
2. Rolling summarization#
When the conversation grows past a threshold, summarize the older portion into a few paragraphs and keep that summary plus the recent turns verbatim. Better preservation of context; slightly higher cost; some loss of fidelity in the summarized portion.
Summarize the conversation below for use as context in future turns.
Preserve:
- Decisions made
- Facts the user shared (preferences, names, dates, contexts)
- Open questions or pending threads
- Any commitments the assistant made
Skip:
- Filler exchanges
- Things the user later corrected
Conversation:
"""
{{older_messages}}
"""
Output as a tight 200-word summary. Use bullet points if helpful.3. Sliding window with periodic summarization#
Combination: keep the last N turns verbatim, plus a rolling summary of everything before. Best for long-running agents (support bots, coding assistants) where both recent context and conversation-level history matter. Most production-grade pattern of the three.
What to put in long-term memory#
Long-term memory should contain durable facts, not conversation history. Two structures work well together:
- Structured profile. Key facts as fields: name, role, communication preferences, industry, project context. Loaded on every turn.
- Vector store of episodic memories. One entry per "memorable moment" from past conversations: "User mentioned they use Postgres", "User said they prefer bullet-point summaries", etc. Retrieved on-demand based on the current query's semantic similarity.
Don't store every message in long-term memory
Extracting facts: the "what's worth remembering" prompt#
A simple pattern that works: at the end of each conversation, run an extractor prompt that pulls out long-lasting facts.
Review the conversation below. Extract facts that should be
remembered for future conversations with this user.
Categories to extract:
- User identity: name, role, company, location (if mentioned)
- Preferences: communication style, format preferences, tools they use
- Context: projects they're working on, problems they're solving
- Decisions: anything the user committed to or established as a constraint
Output as a JSON array. Each fact has:
- "category": one of "identity", "preference", "context", "decision"
- "fact": the fact itself, in plain English
- "confidence": 1-5
Only include facts that are likely to matter past this conversation.
Skip filler, transient context, and things the user might change later.
Conversation:
"""
{{full_conversation}}
"""
Output:Retrieving long-term memory selectively#
You can't inject every long-term memory into every turn — you'd blow the context window. Two retrieval strategies, used together:
- Always-on profile. Inject the structured profile (name, role, top preferences) on every turn. Small, high-value, no retrieval needed.
- Query-conditional retrieval. For episodic memories, embed the current message and pull the top 3-5 most relevant past memories. Same pattern as RAG, just over your own memory store.
Picking the right memory strategy#
Memory strategy by use case
| If your situation is… | Reach for… | Why |
|---|---|---|
| One-off task agent (research, code generation, single doc Q&A) | Working only | Each task is its own session; no carryover needed |
| Multi-turn chat within a session | Working + short-term truncation | Simplest pattern that handles in-session continuity |
| Long-running chat (support bot, coding assistant) | Working + sliding window + summary | Preserves recent detail and conversation-level context |
| Personal assistant returning users want to remember them | All three tiers + extraction | Long-term memory is the whole product |
| Stateless tool (translate, classify, summarize) | No memory | No reason to remember anything |
| Multi-tenant where users see each other's outputs | Strict per-user isolation + audit | Memory leaks across tenants are a security issue |
Going further: production memory patterns#
Memory consolidation#
Periodically review long-term memory and merge near-duplicates, archive stale entries, and promote frequently-retrieved entries to the always-on profile. Same maintenance discipline you'd apply to any growing data store. A weekly cron is enough for most agents.
Time-based decay#
Some facts age. "User is working on project X" might be true today and irrelevant in three months. Tag entries with timestamps; reduce retrieval weight for older entries; auto-archive after a configurable window. Prevents the long-tail of stale facts from polluting context.
Privacy controls and user-facing memory#
Long-term memory holds personal data. Provide a way for users to view, edit, and delete what the agent remembers about them. Required by GDPR and CCPA where applicable; good practice everywhere. Surface the "memory" concept in your UI — users trust agents that show their receipts.
Hybrid memory stores#
Different memory types want different stores:
- Structured profile → Postgres or DynamoDB
- Episodic memories → vector store (pgvector, Pinecone, Qdrant)
- Recent conversation → Redis or in-app session state
Don't shoehorn everything into one store because "it's easier." The trade-offs of each store match different memory types.
Evaluating memory quality#
Build a test set of multi-session conversations where the agent should remember something from an earlier session. Score whether retrieval surfaced the right memory, whether the agent used it, whether the user perceived continuity. Most teams skip memory eval and ship agents that feel forgetful in subtle ways. See A/B testing prompts for the workflow.
When you don't need long-term memory#
Long-term memory adds complexity. You probably don't need it for:
- Single-task agents (research, code generation, document Q&A). Each task is its own session.
- Stateless tools (translation, summarization, classification). No reason to remember anything.
- Early-stage prototypes. Get short-term right first; long-term is rarely the bottleneck in v1.
Common mistakes#
- Storing too much. Every message in long-term memory poisons retrieval. Be selective.
- Storing too little. No memory at all means re-asking the user their name every time. Start with a structured profile even if you skip episodic memory.
- Treating short-term and long-term as the same thing. Different storage, different retrieval, different update cadence. Don't conflate them.
- Not letting users edit / delete their memory. Long-term memory holds personal data. Privacy controls are not optional.
- Skipping memory eval. The most common "feels broken" complaint about chatty agents traces to bad memory retrieval. Measure it.
Quick reference#
The 60-second summary
Three tiers: working (current turn), short-term (this conversation), long-term (across conversations).
Short-term strategies: truncation (simple), rolling summarization (better fidelity), sliding window with summary (production-grade).
Long-term storage: structured profile (always loaded) + vector store of episodic memories (retrieved selectively).
The discipline: store deliberately, retrieve selectively, give users edit/delete controls, evaluate quality with multi-session test sets.
What to read next#
For the other half of agents, Agent tools. For retrieval mechanics that long-term memory borrows from, RAG. For production-grade agent prompts that need versioning, version control.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.