01. What Short-Term Memory Is
A single language model call is stateless. It sees only the messages you hand it at that moment, so without memory every request starts from scratch. That is why short-term memory matters. It reuses recent context from the same conversation, feeding the model a short window of previous turns, which makes the call stateful within a single thread. You keep that thread alive with a checkpointer that saves the conversation state. The trade is that the context window is limited, so you must sometimes drop old messages to stay under the budget. Long-term memory works differently. It stores information across many separate conversations or sessions, using external storage with custom namespaces rather than thread-level persistence. An agent can recall that data at any time, in any thread. So short-term memory helps a single session feel continuous, while long-term memory helps the system remember you from one chat to the next. Both are useful, yet they solve different problems. One keeps the current flow going. The other builds a lasting profile that outlives any single conversation.
<!-- mem:begin -->Generate it: Short-term memory reuses recent context from the same conversation, which makes the call s_______ within a single thread. (cue: s_______; answer: stateful)
Generate it: Long-term memory stores information across many separate conversations using external storage with custom n_________ rather than thread-level persistence. (cue: n_________; answer: namespaces)
Ask yourself: A single model call is stateless, yet short-term memory makes a thread feel stateful — what does it actually feed the model to create that illusion of continuity?
<!-- mem:end -->Recall check (try before reading the answer):
If short-term memory only reuses recent context, why must you sometimes drop old messages? Answer: The context window is limited, so you must sometimes drop old messages to stay under the budget.
What lets an agent recall stored data at any time, in any thread — and how is that different from short-term memory? Answer: Long-term memory stores information across many separate conversations using external storage with custom namespaces rather than thread-level persistence, so an agent can recall that data at any time, in any thread.
In one line, what distinct problem does each kind of memory solve? Answer: Short-term memory helps a single session feel continuous, while long-term memory helps the system remember you from one chat to the next.
Short-term memory uses a checkpointer and message-removal middleware to keep a stateful conversation within a thread.
from langchain.messages import RemoveMessage
from langchain.agents import create_agent, AgentState
from langchain.agents.middleware import after_model
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.runtime import Runtime
@after_model
def delete_old_messages(state: AgentState, runtime: Runtime) -> dict | None:
"""Remove old messages to keep conversation manageable."""
messages = state["messages"]
if len(messages) > 2:
return {"messages": [RemoveMessage(id=m.id) for m in messages[:2]]}
return None
agent = create_agent(
"gpt-5-nano",
tools=[...],
system_prompt="Please be concise and to the point.",
middleware=[delete_old_messages],
checkpointer=InMemorySaver(),
)
config = {"configurable": {"thread_id": "1"}}
for event in agent.stream(
{"messages": [{"role": "user", "content": "hi! I'm bob"}]},
config, stream_mode="values",
):
print(event["messages"])
Think of short-term memory like a small whiteboard you carry into a conversation. You can write the last few things said on it, so the next person can see them. But if the conversation goes on too long, you must erase old notes to make room for new ones—otherwise the board overflows and becomes useless.
Concretely, short-term memory makes a single language model call stateful within a conversation thread. Instead of starting from scratch each time, the agent reuses recent context from previous turns. The mechanism uses a checkpointer (like InMemorySaver) that saves the conversation’s state under a thread_id. When the agent is invoked again with the same thread ID, it reads that saved state automatically. Because the model’s context window is limited, you can trim old messages using middleware like @before_model to keep only the last few turns plus the first system message—removing everything else with RemoveMessage.
Without short-term memory, every request would be completely isolated. You would have to re‑introduce yourself, repeat instructions, and re‑state every detail each time. The model would forget your name, the topic, and any earlier decisions. That’s the failure a beginner would feel: frustration from having to start over again and again, as if the other person had amnesia after every sentence.
In the short-term memory subsystem, the ordered mechanism begins with the agent’s state being read at the start of each graph step. When the graph is invoked with a thread_id, the checkpointer (e.g., InMemorySaver) loads the persisted AgentState containing the message list. The model then receives these messages. After the model call or after each tool execution, the state is written back to the checkpointer. On failure—for instance, if the message list exceeds the LLM’s context window—no model output is produced; the checkpointer does not update until a successful step completes, leaving the thread in its prior state. Middleware such as @before_model can intercept before the model runs (e.g., trim_messages), and @after_model middleware can process the output. If the trimming or summarization middleware fails, the graph raises an exception and the state remains unchanged.
The design preserves the invariant of thread resumability: once a state is persisted via the checkpointer, any subsequent invocation with the same thread_id resumes from that saved state, exactly as the source states: “the thread can be resumed at any time.” The checkpointer ensures that the state is durable across crashes within the thread scope; reads at the start of each step and writes after each step guarantee that the agent always sees the latest short-term memory from that conversation. No duplicate or lost messages occur within a correctly functioning graph, because the state is atomically updated.
The key trade‑off is limiting the message list to stay within the LLM’s context window rather than keeping every message. The obvious alternative—keeping all messages—is explicitly rejected because “most LLMs still perform poorly over long contexts; they get ‘distracted’ by stale or off-topic content, all while suffering from slower response times and higher costs.” By rejecting the keep‑all approach, the system avoids the costs of degraded model performance and excessive token usage. Instead, the subsystem adopts either trimming (via trim_messages using RemoveMessage and REMOVE_ALL_MESSAGES) or summarizing (via SummarizationMiddleware) to condense the history.
A concrete failure mode occurs when the message list grows beyond the model’s supported token limit and no trimming middleware is active. In that case, the LLM call raises a context‑length error (e.g., “maximum context length exceeded”). The operator would see this error in the runtime logs, clearly indicating that the input exceeded the model’s capacity. The signal is a direct API error from the LLM provider, and the thread remains unaltered because the state was never written back. This follows directly from the source’s warning that “a full history may not fit inside an LLM’s context window, resulting in a context loss or errors.”
Summarization Token Threshold Not Reached
- Trigger — Messages accumulate but the total token count stays below the
triggervalue of4000(configured inSummarizationMiddleware). No summarization occurs, and eventually the message list exceeds the context window of thegpt-5.5model, causing a context-length error during the model call. - Guard — None shown. The
SummarizationMiddlewarehas no fallback or proactive trimming when the token count is below the threshold. - Posture — fail-hard. The model call aborts with an exception, stopping the graph run.
- Operator signal — An error from the model invocation, e.g.
“maximum context length exceeded”or a similar token-limit error from the LLM provider. - Recovery — No automatic retry. The operator must reduce message history manually or lower the
triggerthreshold to force earlier summarization.
InMemorySaver State Loss on Process Restart
- Trigger — The application process crashes or is restarted, destroying the in‑memory dictionary that backs
InMemorySaver(). - Guard — None shown. The source uses
InMemorySaverwith no disk or database persistence. - Posture — fail-hard. All thread state is lost; subsequent invocations start with an empty state.
- Operator signal — Silent absence of previous context. The agent responds as if the user is new, e.g. not remembering the user’s name from earlier calls.
- Recovery — No automatic recovery. The operator must re‑enter the conversation context manually (e.g. repeat previous messages).
after_model Middleware Removes the Last Message
- Trigger — The
validate_responsefunction detects aSTOP_WORD(such as"password"or"secret") in the AI message content, and returns{"messages": [RemoveMessage(id=last_message.id)]}. If that message is the only message instate["messages"], the state becomes empty. - Guard — The
validate_responsefunction itself is the guard, but it contains no check that the removal would leave the message list empty. - Posture — fail-soft. The graph continues executing but now has an empty
messageslist, leading to confusion or failure in subsequent model calls. - Operator signal — The agent may output a nonsensical reply or throw an error when it tries to process an empty
state["messages"]. - Recovery — No built‑in retry or fallback. The operator must replay the interaction or add a guard that prevents removal when only one message remains.
Summarization Model Call Failure
- Trigger — The
SummarizationMiddlewareattempts to invokegpt-5.4-minito create a summary when the token threshold (tokens >= 4000) is reached, but that model call fails (network error, rate limit, or model outage). - Guard — None shown. There is no
try/exceptor retry logic surrounding the summarization model invocation in the source code excerpt. - Posture — fail-hard. The exception propagates and aborts the entire graph run.
- Operator signal — An error trace from the summarization model, e.g.
RateLimitError,ConnectionError, or a timeout. - Recovery — No automatic retry. The operator must resubmit the invocation, and the middleware will attempt summarization again on the next trigger.
Thread ID Collision Across Different Users
- Trigger — Two distinct conversations (different users or sessions) use the same
thread_idvalue (e.g. both setconfigurable: {"thread_id": "1"}). TheInMemorySaverstores state under that single thread ID, mixing messages from both users. - Guard — None shown. The source provides no validation or uniqueness enforcement for
thread_id. - Posture — fail‑soft. The graph continues to run, but the state contains interleaved messages from different conversations.
- Operator signal — The agent refers to information from the wrong user, e.g. calling user “Bob” when the current user is someone else.
- Recovery — No automatic recovery. The operator must assign unique
thread_idvalues per session or use a different checkpointer that prevents collisions.