Short-Term Memory

7 chapters · read at your own pace

01. What Short-Term Memory Is

A single language model call is stateless. It sees only the messages you hand it at that moment, so without memory every request starts from scratch. That is why short-term memory matters. It reuses recent context from the same conversation, feeding the model a short window of previous turns, which makes the call stateful within a single thread. You keep that thread alive with a checkpointer that saves the conversation state. The trade is that the context window is limited, so you must sometimes drop old messages to stay under the budget. Long-term memory works differently. It stores information across many separate conversations or sessions, using external storage with custom namespaces rather than thread-level persistence. An agent can recall that data at any time, in any thread. So short-term memory helps a single session feel continuous, while long-term memory helps the system remember you from one chat to the next. Both are useful, yet they solve different problems. One keeps the current flow going. The other builds a lasting profile that outlives any single conversation.

Generate it: Short-term memory reuses recent context from the same conversation, which makes the call s_______ within a single thread. (cue: s_______; answer: stateful)

Generate it: Long-term memory stores information across many separate conversations using external storage with custom n_________ rather than thread-level persistence. (cue: n_________; answer: namespaces)

Ask yourself: A single model call is stateless, yet short-term memory makes a thread feel stateful — what does it actually feed the model to create that illusion of continuity?

Recall check (try before reading the answer):

If short-term memory only reuses recent context, why must you sometimes drop old messages? Answer: The context window is limited, so you must sometimes drop old messages to stay under the budget.

What lets an agent recall stored data at any time, in any thread — and how is that different from short-term memory? Answer: Long-term memory stores information across many separate conversations using external storage with custom namespaces rather than thread-level persistence, so an agent can recall that data at any time, in any thread.

In one line, what distinct problem does each kind of memory solve? Answer: Short-term memory helps a single session feel continuous, while long-term memory helps the system remember you from one chat to the next.

Short-term memory uses a checkpointer and message-removal middleware to keep a stateful conversation within a thread.

python

from langchain.messages import RemoveMessage
from langchain.agents import create_agent, AgentState
from langchain.agents.middleware import after_model
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.runtime import Runtime

@after_model
def delete_old_messages(state: AgentState, runtime: Runtime) -> dict | None:
    """Remove old messages to keep conversation manageable."""
    messages = state["messages"]
    if len(messages) > 2:
        return {"messages": [RemoveMessage(id=m.id) for m in messages[:2]]}
    return None

agent = create_agent(
    "gpt-5-nano",
    tools=[...],
    system_prompt="Please be concise and to the point.",
    middleware=[delete_old_messages],
    checkpointer=InMemorySaver(),
)

config = {"configurable": {"thread_id": "1"}}
for event in agent.stream(
    {"messages": [{"role": "user", "content": "hi! I'm bob"}]},
    config, stream_mode="values",
):
    print(event["messages"])

ELI5 — the plain-language version

Think of short-term memory like a small whiteboard you carry into a conversation. You can write the last few things said on it, so the next person can see them. But if the conversation goes on too long, you must erase old notes to make room for new ones—otherwise the board overflows and becomes useless.

Concretely, short-term memory makes a single language model call stateful within a conversation thread. Instead of starting from scratch each time, the agent reuses recent context from previous turns. The mechanism uses a checkpointer (like InMemorySaver) that saves the conversation’s state under a thread_id. When the agent is invoked again with the same thread ID, it reads that saved state automatically. Because the model’s context window is limited, you can trim old messages using middleware like @before_model to keep only the last few turns plus the first system message—removing everything else with RemoveMessage.

Without short-term memory, every request would be completely isolated. You would have to re‑introduce yourself, repeat instructions, and re‑state every detail each time. The model would forget your name, the topic, and any earlier decisions. That’s the failure a beginner would feel: frustration from having to start over again and again, as if the other person had amnesia after every sentence.

System design — mechanism, invariant, trade-off

In the short-term memory subsystem, the ordered mechanism begins with the agent’s state being read at the start of each graph step. When the graph is invoked with a thread_id, the checkpointer (e.g., InMemorySaver) loads the persisted AgentState containing the message list. The model then receives these messages. After the model call or after each tool execution, the state is written back to the checkpointer. On failure—for instance, if the message list exceeds the LLM’s context window—no model output is produced; the checkpointer does not update until a successful step completes, leaving the thread in its prior state. Middleware such as @before_model can intercept before the model runs (e.g., trim_messages), and @after_model middleware can process the output. If the trimming or summarization middleware fails, the graph raises an exception and the state remains unchanged.

The design preserves the invariant of thread resumability: once a state is persisted via the checkpointer, any subsequent invocation with the same thread_id resumes from that saved state, exactly as the source states: “the thread can be resumed at any time.” The checkpointer ensures that the state is durable across crashes within the thread scope; reads at the start of each step and writes after each step guarantee that the agent always sees the latest short-term memory from that conversation. No duplicate or lost messages occur within a correctly functioning graph, because the state is atomically updated.

The key trade‑off is limiting the message list to stay within the LLM’s context window rather than keeping every message. The obvious alternative—keeping all messages—is explicitly rejected because “most LLMs still perform poorly over long contexts; they get ‘distracted’ by stale or off-topic content, all while suffering from slower response times and higher costs.” By rejecting the keep‑all approach, the system avoids the costs of degraded model performance and excessive token usage. Instead, the subsystem adopts either trimming (via trim_messages using RemoveMessage and REMOVE_ALL_MESSAGES) or summarizing (via SummarizationMiddleware) to condense the history.

A concrete failure mode occurs when the message list grows beyond the model’s supported token limit and no trimming middleware is active. In that case, the LLM call raises a context‑length error (e.g., “maximum context length exceeded”). The operator would see this error in the runtime logs, clearly indicating that the input exceeded the model’s capacity. The signal is a direct API error from the LLM provider, and the thread remains unaltered because the state was never written back. This follows directly from the source’s warning that “a full history may not fit inside an LLM’s context window, resulting in a context loss or errors.”

Failure modes — what breaks, what catches it

Summarization Token Threshold Not Reached

Trigger — Messages accumulate but the total token count stays below the trigger value of 4000 (configured in SummarizationMiddleware). No summarization occurs, and eventually the message list exceeds the context window of the gpt-5.5 model, causing a context-length error during the model call.
Guard — None shown. The SummarizationMiddleware has no fallback or proactive trimming when the token count is below the threshold.
Posture — fail-hard. The model call aborts with an exception, stopping the graph run.
Operator signal — An error from the model invocation, e.g. “maximum context length exceeded” or a similar token-limit error from the LLM provider.
Recovery — No automatic retry. The operator must reduce message history manually or lower the trigger threshold to force earlier summarization.

InMemorySaver State Loss on Process Restart

Trigger — The application process crashes or is restarted, destroying the in‑memory dictionary that backs InMemorySaver().
Guard — None shown. The source uses InMemorySaver with no disk or database persistence.
Posture — fail-hard. All thread state is lost; subsequent invocations start with an empty state.
Operator signal — Silent absence of previous context. The agent responds as if the user is new, e.g. not remembering the user’s name from earlier calls.
Recovery — No automatic recovery. The operator must re‑enter the conversation context manually (e.g. repeat previous messages).

after_model Middleware Removes the Last Message

Trigger — The validate_response function detects a STOP_WORD (such as "password" or "secret") in the AI message content, and returns {"messages": [RemoveMessage(id=last_message.id)]}. If that message is the only message in state["messages"], the state becomes empty.
Guard — The validate_response function itself is the guard, but it contains no check that the removal would leave the message list empty.
Posture — fail-soft. The graph continues executing but now has an empty messages list, leading to confusion or failure in subsequent model calls.
Operator signal — The agent may output a nonsensical reply or throw an error when it tries to process an empty state["messages"].
Recovery — No built‑in retry or fallback. The operator must replay the interaction or add a guard that prevents removal when only one message remains.

Summarization Model Call Failure

Trigger — The SummarizationMiddleware attempts to invoke gpt-5.4-mini to create a summary when the token threshold (tokens >= 4000) is reached, but that model call fails (network error, rate limit, or model outage).
Guard — None shown. There is no try/except or retry logic surrounding the summarization model invocation in the source code excerpt.
Posture — fail-hard. The exception propagates and aborts the entire graph run.
Operator signal — An error trace from the summarization model, e.g. RateLimitError, ConnectionError, or a timeout.
Recovery — No automatic retry. The operator must resubmit the invocation, and the middleware will attempt summarization again on the next trigger.

Thread ID Collision Across Different Users

Trigger — Two distinct conversations (different users or sessions) use the same thread_id value (e.g. both set configurable: {"thread_id": "1"}). The InMemorySaver stores state under that single thread ID, mixing messages from both users.
Guard — None shown. The source provides no validation or uniqueness enforcement for thread_id.
Posture — fail‑soft. The graph continues to run, but the state contains interleaved messages from different conversations.
Operator signal — The agent refers to information from the wrong user, e.g. calling user “Bob” when the current user is someone else.
Recovery — No automatic recovery. The operator must assign unique thread_id values per session or use a different checkpointer that prevents collisions.

02. Message History And Window

A conversation is held as an ordered list of messages inside a single thread. Each message carries a role, like human or assistant, along with its content. These messages alternate back-and-forth, so the list grows steadily longer over time. The model sees this whole list as its context. But that context has a hard limit called the context window, a fixed budget of tokens you cannot exceed. As the list grows, it eats up more and more of that budget. Keeping the full history can get expensive. Token-rich message lists drive up both latency and cost. Even when the window is big enough, models still perform poorly over very long contexts, because they get distracted by stale or off-topic content. So there is a trade. Keeping the full history preserves every detail of the thread, but it risks blowing the token budget and slowing every response. A bounded window discards old messages and keeps only the most recent ones, which saves tokens and keeps the model focused. The cost is that you lose information from earlier in the thread. This forces a choice between completeness and efficiency. Many applications gain from techniques that deliberately remove or forget stale information, so the conversation stays within budget and the model stays sharp.

Generate it: That context has a hard limit called the context w_____, a fixed budget of tokens you cannot exceed. (cue: w_____; answer: window)

Generate it: A bounded window discards old messages and keeps only the most r_____ ones, which saves tokens and keeps the model focused. (cue: r_____; answer: recent)

Ask yourself: Even when the context window is big enough to hold the whole history, why might a longer message list still hurt the model's answers?

Recall check (try before reading the answer):

As the message list grows, what two costs rise — beyond just filling the token budget? Answer: Token-rich message lists drive up both latency and cost.

Why do models perform poorly over very long contexts even when the window fits? Answer: They get distracted by stale or off-topic content.

A bounded window trades away one thing to gain efficiency — what exactly is lost, and what is gained? Answer: A bounded window saves tokens and keeps the model focused, but the cost is that you lose information from earlier in the thread.

A middleware removes old messages to keep the conversation within a bounded window, discarding the earliest ones as the list grows.

python

from langchain.messages import RemoveMessage
from langchain.agents import create_agent, AgentState
from langchain.agents.middleware import after_model
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.runtime import Runtime

@after_model
def delete_old_messages(state: AgentState, runtime: Runtime) -> dict | None:
    """Remove old messages to keep conversation manageable."""
    messages = state["messages"]
    if len(messages) > 2:
        # remove the earliest two messages
        return {"messages": [RemoveMessage(id=m.id) for m in messages[:2]]}
    return None

agent = create_agent(
    "gpt-5-nano",
    tools=[],
    middleware=[delete_old_messages],
    checkpointer=InMemorySaver(),
)

ELI5 — the plain-language version

Imagine a conversation as a backpack that can only carry a limited number of sticky notes—each note is a message (like “human says hi” or “assistant replies”). As you chat, the backpack fills up. This subsystem actually tracks every message in an ordered list within a single thread, where each message has a role (human or assistant) and content. But the backpack has a strict token budget called the context window. To stay within that budget, the system uses a trim mechanism, like @before_model middleware, which cuts old notes and keeps only the most recent ones (for example, the last three human-assistant pairs). Without this trimming, the backpack would burst—the model would either crash because the token limit is exceeded, or it would struggle to focus, getting distracted by ancient remarks, while response times drag and costs skyrocket. Even if the backpack were huge, the model still performs poorly with too many old notes. So you’d feel the failure directly: the agent forgets your name mid-conversation or rambles about stale topics, turning a helpful chat into a frustrating mess.

System design — mechanism, invariant, trade-off

The subsystem for message history and window management operates as a sequential pipeline within a single thread. The ordered mechanism begins when a human input arrives as a message, which is appended to the agent’s state, an ordered list of alternating messages. At each step, the runtime reads the full state from a checkpointer (e.g., InMemorySaver) to restore the conversation context. The state is then passed through middleware: first, @before_model middleware runs, where functions like trim_messages may inspect the message list and decide to truncate it by issuing RemoveMessage actions with REMOVE_ALL_MESSAGES to discard everything except the first message and the last few. After that, the model processes the surviving messages, generating a response. The response is appended, and then @after_model middleware can run to further process or store data. On failure, such as an LLM error from exceeding the context window, the graph step halts before the model call; the checkpointer ensures the thread is not corrupted, and the operator can retry after the interactive trimming.

The invariant preserved by this design is thread-scoped short-term memory that guarantees the conversation history can be resumed at any time. The source explicitly states: “State is persisted to a database using a checkpointer so the thread can be resumed at any time.” This means the system provides exactly-one resume capability: after any failed step, the checkpointer contains the last consistent state, so the next invocation will restart from that saved point and re-run only the failed step. The invariant prevents data loss across invocations while enforcing a strict token budget: no message list may exceed the LLM’s maximum supported context window. The system never silently truncates mid-step; instead it relies on explicit middleware to enforce the budget before the model sees the list.

The key trade-off is trimming versus summarizing to stay within the context window. The obvious alternative is to keep the full history and rely on the LLM’s capacity, but the source rejects this because “most LLMs still perform poorly over long contexts” and it leads to “slower response times and higher costs.” The chosen approach—trimming via trim_messages middleware—deliberately discards older messages, accepting information loss to avoid the cost of latency, expense, and degraded model accuracy. The alternative of summarization (provided by SummarizationMiddleware) is presented as a more sophisticated option that preserves information by compressing it, but it introduces extra LLM calls and complexity. The trimming path rejects both full retention and summarization for simpler, faster, cheaper operation at the expense of possibly forgetting details.

A concrete failure mode occurs when the trim_messages logic is misconfigured or the context window shrinks unexpectedly. For example, if the LLM’s maximum token count is updated downward and the trimming strategy (keeping only a fixed number of messages) still yields a token count above the new limit, the model call will fail with an error such as “context length exceeded” or a timeout. The operator would see an exception in the runtime logs (e.g., from langgraph.runtime.Runtime) showing a ValueError about token budget, and the checkpointer would have saved the state just before the failed model call. The signal is a logged error with the message “Maximum context length exceeded” and the agent’s checkpointer retry count incrementing. The operator would then adjust the max_tokens or the trim_messages strategy (e.g., keep fewer messages or switch to SummarizationMiddleware) and reinvoke.

Failure modes — what breaks, what catches it

Context Window Overflow

Trigger — The message list grows until the total token count exceeds the model’s strict context‑window budget.
Guard — SummarizationMiddleware configured with trigger=("tokens", 4000) fires at 4000 tokens to compress older messages and keep the total under the limit. If this middleware is not present, no guard exists.
Posture — Fail‑soft if the SummarizationMiddleware is active (history is summarized, details may be lost). Fail‑hard without it: the model call would raise a token‑limit error and abort the run.
Operator signal — With the middleware, a silent degradation of long‑term recall; without it, an explicit token‑limit error from the model API.
Recovery — The SummarizationMiddleware automatically invokes the summarization model on each step once the trigger token count is reached. If no guard is present, the operator must manually restart with a trimmed message list or add the middleware.

Poor Model Performance on Long Contexts

Trigger — The message list is long enough to still fit in the context window (no overflow) but distracts the model, causing degraded output quality, increased latency, and higher cost.
Guard — The same SummarizationMiddleware; by collapsing older messages into a summary it shortens the effective context length and reduces distraction. Alternatively, manual trimming with RemoveMessage can be applied in after_model middleware.
Posture — Fail‑soft. The model still responds, but the quality may be poor and costs/latency elevated until the guard reduces the context.
Operator signal — Observable as slower response times, higher per‑call token counts, and outputs that “get distracted by stale or off‑topic content” (as stated in the source).
Recovery — Automatic if SummarizationMiddleware or trimming middleware is in the pipeline; otherwise the operator must manually adjust the message history strategy.

Information Loss from Message Trimming

Trigger — The use of a trimming technique (e.g., RemoveMessage in after_model middleware) that discards older messages to stay within the context window.
Guard — No explicit guard is shown in the source for the loss of trimmed content. The source warns that “you may lose information from culling of the message queue.” The alternative SummarizationMiddleware is offered as a different approach, but it does not protect against loss if trimming is already chosen.
Posture — Fail‑soft. The conversation continues, but previously discussed facts or user preferences may be permanently absent from future responses.
Operator signal — The agent fails to recall information that was trimmed. No error is raised; the loss is silent.
Recovery — Manual: the operator must switch from trimming to summarization (e.g., replace RemoveMessage with SummarizationMiddleware) or increase the keep=("messages", 20) threshold to retain more messages.

Thread ID Misconfiguration

Trigger — The configurable thread_id value in the RunnableConfig does not match the intended session (e.g., a typo, reuse of a stale ID, or no ID provided).
Guard — None. The source shows the thread_id is simply passed to the agent via config, and the InMemorySaver persists state per thread ID, but no validation or default fallback is described.
Posture — Fail‑soft. The agent starts a new thread (or retrieves an unrelated one) and loses access to previous short‑term memory, but continues executing without raising an error.
Operator signal — The agent responds as if it has no history of the current conversation. For example, after three turns the agent “what’s my name?” yields a blank response instead of “Bob!”.
Recovery — Manual: the operator must correct the thread_id to match the original session. The InMemorySaver retains the original state under that ID.

Summarization Model Failure

Trigger — The SummarizationMiddleware calls its configured model (e.g., "gpt-5.4-mini") to produce a summary, but the call fails (API error, timeout, invalid response).
Guard — No guard is shown in the source. The middleware’s trigger and keep parameters are configuration options, but error handling (retry, fallback) is not mentioned.
Posture — Based on the absence of a guard, the failure likely propagates as an unhandled exception, causing the agent run to abort (fail‑hard).
Operator signal — An exception trace or error log from the model invocation, halting the agent’s execution mid‑turn.
Recovery — Manual: the operator must restart the run after the model API is restored, or switch to a different summarization model (e.g., change the model parameter of the middleware). Retry logic is not specified in the source.

03. Trimming The History

Trimming keeps only the most recent messages, so the prompt stays inside the token budget. Chat applications pile up a long list of messages over time, and that list grows with every single exchange. Context windows are limited and token-rich lists are costly. So developers trim away older turns and keep only the latest. But dropping early turns has a price. You lose information from the opening of the conversation. If a user shared their name or a key preference early on, that detail vanishes the moment you trim the message queue. The model can no longer refer back to that earlier context. It may simply forget what was said. The system prompt, however, is always kept, because it holds the instructions for the agent. Unlike human messages, it is not really part of the conversation history. It is the set of rules and guidance that tells the model how to behave. So even when old human turns are removed, the system prompt stays put. That keeps the agent on track. Trimming lets you manage the token budget, but you trade away the memory of earlier parts of the chat.

Generate it: Trimming keeps only the most recent messages, so the prompt stays inside the token b_____. (cue: b_____; answer: budget)

Generate it: The system prompt is always kept, because it holds the i____________ for the agent. (cue: i____________; answer: instructions)

Ask yourself: Trimming is cheap and keeps you under budget — so what concrete kind of early information are you risking when you drop old turns?

Recall check (try before reading the answer):

If a user shared their name early on, what happens to that fact when you trim — and why? Answer: That detail vanishes the moment you trim the message queue, because trimming drops older turns; the model can no longer refer back to that earlier context.

Why is the system prompt kept even when old human turns are removed? Answer: The system prompt is always kept because it holds the instructions for the agent; unlike human messages, it is not really part of the conversation history.

State the core trade-off of trimming in one line. Answer: Trimming lets you manage the token budget, but you trade away the memory of earlier parts of the chat.

Looking back: Two chapters ago we said a single model call is stateless — what makes a request start from scratch without memory? Answer: It sees only the messages you hand it at that moment, so without memory every request starts from scratch.

Trim message history using @before_model decorator, keeping the system prompt and the most recent messages.

python

from langchain.messages import RemoveMessage
from langgraph.graph.message import REMOVE_ALL_MESSAGES
from langchain.agents.middleware import before_model
from langchain.agents import AgentState

@before_model
def trim_messages(state: AgentState, runtime) -> dict | None:
    messages = state["messages"]
    if len(messages) <= 3:
        return None
    first_msg = messages[0]
    recent_msg = messages[-3:] if len(messages) % 2 == 0 else messages[-4:]
    new_messages = [first_msg] + recent_msg
    return {
        "messages": [
            RemoveMessage(id=REMOVE_ALL_MESSAGES),
            *new_messages
        ]
    }

ELI5 — the plain-language version

Imagine you’re taking notes in a small notebook that only has room for the last three pages. Each time you write a new note, you have to tear out the oldest pages to make space, so the notebook never overflows. That’s exactly what trimming does for an AI agent’s conversation history. It keeps only the most recent messages—using a mechanism like the @before_model middleware that calls RemoveMessage on older turns—so the prompt stays inside the LLM’s strict token budget. Without trimming, the message list would keep growing with every exchange, eventually exceeding the context window and causing errors or wildly slow responses. But tearing out those early pages also means you lose the information written on them. If a user introduced themselves by name in the first turn, that detail vanishes the moment trimming chops off the beginning of the conversation. Later, when the agent is asked “what’s my name?”, it has no idea—just as you’d feel frustrated if you tore out the page where someone told you their name and then couldn’t remember it.

System design — mechanism, invariant, trade-off

The Trimming The History subsystem operates as a middleware step executed before every model invocation, enforced by the @before_model decorator applied to the trim_messages function. The ordered mechanism begins when the runtime calls trim_messages with the current AgentState and a Runtime instance. The function inspects state["messages"]: if the message count exceeds three, it preserves only the first message and the last few turns (the final three if even, otherwise four) to keep the prompt concise. It then returns a dictionary that emits a RemoveMessage operation with the REMOVE_ALL_MESSAGES identifier, effectively replacing the entire message history with the truncated list. On a normal path, the reduced message set is written to the checkpointer (e.g., InMemorySaver) and passed to the model. On a failure — for instance, if the tokenizer miscalculates or the model's context window is smaller than the trimmed token count — the model call raises a "context length exceeded" error, and the graph stops without updating the persisted state.

The design preserves the invariant that the message list always fits within the LLM’s maximum supported token budget, preventing context overflow and the degradation (poor performance, higher latency, increased cost) that comes from stuffing a full history into a finite window. This invariant is enforced by a simple count‑based heuristic (the len(messages) <= 3 guard), not by exact token counting, which means the guarantee is approximate but sufficient for many chat applications. The subsystem relies on the checkpointer to maintain the thread’s state across invocations, so the truncated history is the only state the model sees for that thread.

The key trade‑off is information loss versus context integrity: trimming discards early messages — including critical details like a user’s name or preferences — to avoid exceeding the token limit. The obvious alternative it rejects is keeping the full message list, which would eventually overflow the context window, triggering errors, high inference costs, and declining model quality as the model becomes “distracted” by stale content. By choosing trimming, the subsystem avoids the cost of processing an ever‑growing history (both latency and monetary expense) and the complexity of managing long contexts. The rejected approach would also require a different middleware, like SummarizationMiddleware, which preserves information via summarization but adds model calls and complexity. Trimming is simpler and faster, accepting the loss of early turns as the price for stable, low‑cost operation.

A concrete failure mode occurs when the trimmed messages still exceed the LLM’s context window in tokens, even though the message count is ≤4. For example, if each message contains thousands of tokens, the naive count‑based trim might produce a prompt that is still too long. The signal an operator would see is an exception from the LLM provider, such as OpenAIError: This model's maximum context length is 4097 tokens. However, your messages resulted in 5000 tokens. Please reduce the length of the messages. The graph would crash at the model invocation step, and the operator would observe a failed agent.invoke() call with the error trace referencing the token limit, indicating that the trim_messages middleware’s heuristic was insufficient.

Failure modes — what breaks, what catches it

Loss of Early User-Provided Information

Trigger — The conversation accumulates many messages, and the trimming logic removes older turns without preserving key details such as a user’s name or a preference that was stated early.
Guard — None from source. The chapter only acknowledges the problem and points to SummarizationMiddleware as an alternative, not as a guard for trimming.
Posture — fail‑soft: the agent continues to respond, but with degraded memory; it may produce incorrect or nonsensical answers.
Operator signal — The operator observes the final response "Your name is Bob!" when a trimmed thread is resumed, but for a different user the model might reply “I don’t know your name” — a silent absence of correct recall.
Recovery — No automatic recovery. The user must re‑state the missing information, or the operator must switch to a configuration that uses SummarizationMiddleware to retain summaries instead of trimming raw messages.

Context Window Overflow Despite Trimming

Trigger — The trimming policy is too conservative (e.g., keeps too many messages or does not consider token count), so the total token count still exceeds the LLM’s context limit after trimming.
Guard — None from source. The chapter mentions SummarizationMiddleware has a trigger=("tokens", 4000) parameter, but that applies to summarization, not to trimming. No token‑budget guard is shown for the trimming path.
Posture — fail‑hard: the LLM API call aborts with a context‑length error, stopping the run.
Operator signal — An API error such as "maximum context length exceeded" or a similar error field returned by the model provider.
Recovery — Manual step required: the operator must re‑configure the trimming parameters (e.g., reduce keep count) or adopt SummarizationMiddleware to avoid overflow.

Trimming Removes Messages Needed for Tool Execution

Trigger — A tool accesses the agent’s state via the runtime parameter (typed as ToolRuntime) and expects to find earlier messages (e.g., a user‑supplied value). However, those messages have already been trimmed.
Guard — None from source. The example shows tool and ToolRuntime but no validation that the required messages still exist before the tool runs.
Posture — fail‑soft: the tool executes but may receive an empty or incomplete state, leading to incorrect results or a runtime error.
Operator signal — The tool returns an unexpected output or raises an error such as KeyError when accessing a missing message. The operator sees no explicit log line about missing state.
Recovery — No automatic recovery. The operator must redesign the tool to not rely on trimmed messages, or increase the keep count to preserve the relevant messages.

Checkpoint Persistence Inconsistency After Trimming

Trigger — Trimming modifies the messages list in the agent’s state, but the InMemorySaver checkpointer may persist the state before the trimming operation completes, or the trimming happens in a context where the checkpoint is not updated. This causes the persisted state to differ from the actual runtime state.
Guard — None from source. The agent is created with checkpointer=InMemorySaver() but no explicit transaction or synchronization between trimming and checkpointing is shown.
Posture — fail‑soft: the agent continues, but when the thread is resumed later, the restored state may contain stale or missing messages, leading to confusing behavior.
Operator signal — On resuming a thread with a thread_id, the operator sees messages that do not match the last interaction (e.g., a message that was trimmed still appears, or a message that was kept is absent).
Recovery — Manual step: the operator must replay the conversation from scratch or manually correct the persisted state.

Unintended Removal of System Messages or Instructions

Trigger — The trimming logic does not distinguish between human/model messages and system‑level messages (e.g., a system prompt stored as a message). When trimming removes the oldest messages, it may delete the system message, stripping the agent of its instructions.
Guard — None from source. The chapter only refers to “messages alternate between human inputs and model responses” and does not show any filtering to protect system messages.
Posture — fail‑soft (or fail‑hard depending on the model): the agent continues without instructions, possibly generating incoherent or unsafe outputs.
Operator signal — The model produces responses that lack adherence to the original instructions (e.g., it ignores formatting rules or security constraints). No error is raised — the operator notices the behavioral change.
Recovery — Manual step: the operator must restart the thread and ensure that system messages are either excluded from trimming or re‑inserted after trim.

04. Summarizing The History

Chats grow longer over time as messages move between human inputs and model replies. Context windows are limited, and longer message lists cost more. Even if a model supports the full context, it often does poorly with very long histories. The reason is that it gets distracted by stale or off-topic content. Response times slow down, and costs climb. Many apps gain from ways to remove or forget stale text. One simple way is to trim older messages, which saves tokens and cuts distraction. But the trade is losing detail. Key facts from earlier turns might vanish, and staleness becomes a problem too. If you cut old messages, the agent forgets what happened. A different way uses a running summary instead. This summary packs earlier turns into a short form that is added cheaply, keeping the gist at a much lower token cost. But a summary might miss details or drift out of date. You have to weigh these trade-offs. A summary keeps the main ideas but can quietly lose nuance, while plain trimming is simpler yet risks dropping context. Both ways help manage memory within a single thread. The goal is to keep the chat efficient without losing too much.

Generate it: A running summary packs earlier turns into a short form that is added cheaply, keeping the g___ at a much lower token cost. (cue: g___; answer: gist)

Generate it: A summary keeps the main ideas but can quietly lose n_____, while plain trimming is simpler yet risks dropping context. (cue: n_____; answer: nuance)

Ask yourself: Both trimming and summarizing fight the same problem — so what does a summary preserve that simple trimming throws away, and what does it risk in return?

Recall check (try before reading the answer):

How does a running summary keep the gist while costing far fewer tokens? Answer: The summary packs earlier turns into a short form that is added cheaply, keeping the gist at a much lower token cost.

Name the two ways a summary can fail you. Answer: A summary might miss details or drift out of date.

Contrast summarizing versus plain trimming in one sentence. Answer: A summary keeps the main ideas but can quietly lose nuance, while plain trimming is simpler yet risks dropping context.

To keep message history efficient, a running summary can be generated automatically with the built-in summarization middleware, which compresses older turns into a short version that is prepended to the conversation.

python

from langchain.agents import create_agent
from langchain.agents.middleware import SummarizationMiddleware
from langgraph.checkpoint.memory import InMemorySaver
from langchain_core.runnables import RunnableConfig

agent = create_agent(
    "gpt-5-nano",
    tools=[...],
    middleware=[SummarizationMiddleware()],  # automatically summarizes old messages
    checkpointer=InMemorySaver(),
)

config: RunnableConfig = {"configurable": {"thread_id": "1"}}

ELI5 — the plain-language version

Imagine you’re jotting down notes from a long lecture rather than keeping every single word recorded. That’s what summarizing the history does for a conversation: it condenses the back‑and‑forth between a person and an AI into a tight summary, so the AI doesn’t have to hold onto every message. Concretely, the system uses a built‑in SummarizationMiddleware that calls a chat model to boil down the conversation’s key facts and decisions while discarding the exact wording. This saves tokens, keeps the context window from overflowing, and helps the AI stay focused on what matters. Without this mechanism, the conversation would either have to be brutally trimmed—losing important details like the user’s early preferences or instructions—or the full history would balloon, slowing responses, raising costs, and letting stale or off‑topic messages distract the AI. A beginner would feel the failure when the AI suddenly forgets that they said their name was “Bob” ten turns ago, or starts rambling about cats when the whole thread was about dogs. Summarizing keeps the essence alive without drowning the AI in noise.

System design — mechanism, invariant, trade-off

The subsystem for managing short-term memory during conversations relies on a middleware-driven pipeline that executes before each model invocation. When an agent is created with a checkpointer—for example, InMemorySaver()—the graph’s state, which contains the full message history, is read at the start of every step. The ordered mechanism begins with a @before_model middleware function, such as trim_messages, which inspects the AgentState’s "messages" list. If the list exceeds three messages, the function preserves only the first message and the most recent messages (the last three or four depending on parity), then returns a dict containing a RemoveMessage with REMOVE_ALL_MESSAGES and the new subset. This dict is applied to the state before the model processes the input. If no truncation is needed, the function returns None and the state passes unchanged. On failure—for instance, if the middleware crashes or the message list is malformed—the agent’s runtime halts with an exception, and the checkpointer ensures the prior checkpoint is not overwritten, allowing the thread to be resumed.

The invariant preserved by this design is thread-level persistence: the state is persisted to a database using a checkpointer so that the thread can be resumed at any time, even across failures. This guarantee means that every successful invocation leaves the graph in a consistent state, and any partial updates (e.g., after a middleware mutation but before the model call) are atomically committed only upon successful completion of the step. The design rejects the naïve alternative of storing all messages indefinitely, which would quickly overflow context windows and cause the LLM to become “distracted by stale or off-topic content, all while suffering from slower response times and higher costs.” Instead, it chooses to trade away historical detail for token efficiency. The obvious alternative—keeping the full history—is rejected because it would make the system unusable for long conversations; the cost avoided is the model’s performance degradation and the exponential growth of input tokens, which would make both latency and expense prohibitive.

The key trade-off is between preserving important facts and maintaining a manageable context window. Trimming older messages (as done by trim_messages) saves tokens and reduces distraction, but it can lose detail—an earlier user statement such as “hi, my name is bob” may be discarded, leaving the agent unable to answer “what's my name?” in a later turn. The more sophisticated alternative, SummarizationMiddleware, compresses history via a chat model instead of simply deleting messages, which preserves information at the cost of additional inference time and complexity. The design chooses the simpler trimming approach as a baseline, rejecting the middle ground of summarization in many cases because it adds a second model call per step, increasing both latency and cost. The trade-off is explicitly acknowledged in the source: “The problem with trimming or removing messages … is that you may lose information from culling of the message queue. Because of this, some applications benefit from a more sophisticated approach of summarizing.” This shows the designers chose to expose both options, leaving the trade-off to the application developer.

A concrete failure mode occurs when the trim_messages middleware culls too aggressively, removing the first turn that contained the user’s name. In a thread with thread_id = "1", the agent first receives “hi, my name is bob”, then later “what's my name?”. After several exchanges, the middleware retains only the first message and the last three messages; if the first message is the system instruction rather than the user’s greeting, the name “bob” is lost entirely. The operator would see the agent respond with something like “I don’t recall your name” or a hallucinated name, logged as an error in the model’s output. The signal is a message in the application logs showing an AiMessage that fails to reference the correct user identifier, often accompanied by an increase in user retries or complaints. This failure is directly detectable by monitoring the agent’s response accuracy against known user data stored in long-term memory, though that lies outside the scope of this subsystem.

Failure modes — what breaks, what catches it

Summarization Model Call Failure

Trigger – The SummarizationMiddleware attempts to call model="gpt-5.4-mini" when the token count exceeds trigger=("tokens", 4000) .
Guard – No explicit exception handler, retry, or fallback is shown in the source for the summarization model call.
Posture – fail‑hard – the unhandled exception propagates from the middleware, aborting the current invoke() run.
Operator signal – A Python traceback of a network error or API error (e.g., ConnectionError, RateLimitError) from the model call.
Recovery – No automatic retry; the operator must fix the underlying issue (network, credentials) and re‑invoke the agent.

keep Parameter Causing Loss of Critical Information

Trigger – After summarization, the middleware retains only the most recent keep=("messages", 20) messages plus the summary. An important fact that existed only in an earlier message is discarded.
Guard – No validation or fallback is shown; the SummarizationMiddleware simply applies the keep count.
Posture – fail‑soft – the agent continues to run but may produce incorrect answers because the lost fact is no longer in the state.
Operator signal – The agent’s responses contradict earlier user‑supplied information (silent absence of correct behavior).
Recovery – No automatic recovery; the operator must increase the keep count or improve the summarization model’s accuracy to preserve critical details.

InMemorySaver State Loss

Trigger – The checkpointer InMemorySaver stores state in volatile memory. A process restart, crash, or memory pressure destroys the saved state, including the conversation for thread_id "1".
Guard – No external persistence or recovery mechanism is shown; InMemorySaver is explicitly an in‑memory store.
Posture – fail‑soft – the agent remains operational but the entire thread history is lost; subsequent invoke() calls start a new thread.
Operator signal – The agent replies as if it has no memory of the conversation (e.g., “I don’t know your name”).
Recovery – No automatic recovery; the operator must accept state loss or replace InMemorySaver with a durable checkpointer.

Token Trigger Never Fires, Causing Context Overflow

Trigger – The conversation accumulates messages but never reaches the threshold trigger=("tokens", 4000) (e.g., the threshold is too high relative to the context window of model="gpt-5.5").
Guard – No guard is shown that caps total tokens or forces summarization before the model rejects the input.
Posture – fail‑hard – eventually the model’s context limit is exceeded, and the call to invoke() fails with a “too long” error.
Operator signal – Increasing response times and costs, then a ValueError or model‑specific length error from the invoke() call.
Recovery – No automatic recovery; the operator must lower the trigger value or add a separate token‑limiting step.

Configuration Error – Missing thread_id

Trigger – The config dictionary is missing the key "thread_id" inside "configurable", e.g., RunnableConfig is {} or {"configurable": {}}.
Guard – No input validation for config is shown; config is passed directly to agent.invoke().
Posture – fail‑hard – a KeyError is raised because the code expects config["configurable"]["thread_id"].
Operator signal – A Python KeyError traceback: KeyError: 'thread_id' or 'configurable'.
Recovery – No automatic recovery; the operator must supply a valid config with {"configurable": {"thread_id": "..."}}.

Summarization Model Returns Malformed Output

Trigger – The call to model="gpt-5.4-mini" returns output that the SummarizationMiddleware cannot parse (e.g., None, empty string, or non‑text).
Guard – No guard for malformed summary output is shown; the middleware likely expects a plain text str.
Posture – fail‑hard – the parsing error propagates as an unhandled exception, aborting the invoke() run.
Operator signal – A Python AttributeError or TypeError traceback (e.g., 'NoneType' object has no attribute 'content').
Recovery – No automatic retry; the operator must investigate the model response and possibly retry with a different model or prompt.

05. Threads And Checkpointers

A checkpointer saves the conversation state so the graph can pause and resume. This state is connected to a thread identifier. A single thread scopes all interactions within one session. That means messages and responses are kept together in one ongoing chat. The checkpointer writes state to a database. Because of this, the thread can be stopped at any point and started again later exactly where it left off. This makes it reliable. The state is read at the start of each step, so you can see what the graph knows at that moment. Durable checkpoints let you inspect the current state before continuing. You can also go back to an earlier checkpoint if you need to correct a mistake. That gives you both review and rollback. The trade-off is that saving state adds some work, but it gives you control over the conversation. Short-term memory stays within the thread, so each session is separate. This design keeps one conversation from mixing with another. It also helps manage context length because you can trim the list of messages while still having a saved history to fall back on. So a checkpointer is a simple but powerful tool for building reliable agents.

Generate it: A checkpointer saves the conversation state so the graph can pause and r_____. (cue: r_____; answer: resume)

Generate it: You can go back to an earlier checkpoint if you need to correct a mistake — that gives you both review and r________. (cue: r________; answer: rollback)

Ask yourself: A checkpointer adds extra work on every step — what two capabilities does that durable, thread-scoped state buy you in return?

Recall check (try before reading the answer):

After a thread is stopped, what lets it start again later exactly where it left off? Answer: The checkpointer writes state to a database, so the thread can be stopped at any point and started again later exactly where it left off.

Because short-term memory stays within the thread, what does that guarantee about separate sessions? Answer: Short-term memory stays within the thread, so each session is separate and one conversation does not mix with another.

How does the checkpointer make trimming safe? Answer: You can trim the list of messages while still having a saved history to fall back on.

Looking back: Back in "Trimming The History" — what is the single thing trimming keeps to stay under the token budget? Answer: Trimming keeps only the most recent messages, so the prompt stays inside the token budget.

A thread scopes interactions using a checkpointer that saves state for pause/resume.

python

from langchain.agents import create_agent
from langgraph.checkpoint.memory import InMemorySaver

def get_user_info() -> str:
    return "No user profile on file."

agent = create_agent(
    model="google_genai:gemini-3.5-flash",
    tools=[get_user_info],
    checkpointer=InMemorySaver(),
)

thread_config = {"configurable": {"thread_id": "1"}}
response = agent.invoke(
    {"messages": [{"role": "user", "content": "Hi! My name is Bob."}]},
    thread_config,
)["messages"][-1].content

ELI5 — the plain-language version

Think of a video game save point: when you pause and turn off the console, a save file records your exact position, health, and inventory, so next time you load that file, you continue right where you stopped—not back at the starting screen. In this chapter, the “checkpointer” does the same for a conversation. It takes the agent’s current state (the list of messages exchanged so far) and writes it to a database (like LangGraph’s InMemorySaver), linking that state to a unique thread identifier. A single thread acts as one continuous session: every message and response stays together in one ongoing chat. Each time the agent takes a step, it reads the state from that saved file, so it always knows exactly what has been said. This means you can stop the conversation at any moment and pick it up later without missing a beat.

Without this checkpointer, your agent would be like a game with no save feature. Every new question would arrive as a blank slate—it would forget your name, the last topic, and any instructions you gave. You’d have to repeat yourself constantly, and the conversation could never build on previous turns. The thread would be useless, because each interaction would exist in its own isolated bubble, and the agent would never learn from what came before. That failure is exactly what a beginner would feel: frustration from having to start over again and again.

System design — mechanism, invariant, trade-off

The subsystem operates through a precise ordered mechanism. When a graph is invoked with a thread_id, the checkpointer reads the persisted state from the database at the start of each step, making the current message history available to the agent. After the model call and any tool invocations, the state is updated and written back to the database, either when a step completes or when the graph is fully invoked. Middleware hooks like @before_model (e.g., the trim_messages function) and @after_model allow custom processing of the state at the beginning or end of the model call, respectively, but the fundamental sequence is: read state → process → write state. If a failure occurs during processing (e.g., a model call error or tool exception), the write does not happen; the previous checkpoint remains intact, so on the next invocation with the same thread_id, the graph resumes from the last successful state.

The design preserves the invariant of thread-scoped persistence. The state—maintained as part of the agent’s state—is persisted using a checkpointer (such as the built-in InMemorySaver) so that the thread can be paused and resumed exactly where it left off. This guarantee means that all interactions within a single session, linked by the thread_id, are consistently recoverable; the message history is never lost between invocations. The State is read at the start of each step, ensuring that the agent always sees the conversation as it was at the moment of the last successful checkpoint.

The key trade-off is between storing the complete thread state forever and bounding the size of that state to fit within an LLM’s limited context window. The obvious rejected alternative is to keep all messages without any truncation. That approach is discarded because LLMs suffer from context loss or errors when the history exceeds the maximum token limit, leading to poor performance, slower response times, and higher costs. Instead, the system embraces techniques like message trimming via @before_model hooks—for instance, the trim_messages function uses RemoveMessage(id=REMOVE_ALL_MESSAGES) to discard all but the first and a few recent messages, keeping only first_msg + recent_messages. The cost avoided is the degradation of model accuracy and the expense of processing many tokens; the accepted cost is that older context may be forgotten, but the agent remains within the model’s safe operational window.

A concrete failure mode occurs when the token count of the message history surpasses the LLM’s maximum context window and no trimming middleware is installed. The operator would see context loss or errors in the model’s response—for example, the LLM might return an empty reply, produce a token-limit error from the API, or generate nonsensical output because it was “distracted” by stale content. If the trim_messages function is misconfigured (e.g., the threshold is too high or the strategy is wrong), the operator might observe the agent forgetting user-provided information from earlier in the same thread because the trimmed messages removed the critical context. The signal is a clear degradation in relevance or outright error messages from the model provider, such as a “maximum token length exceeded” exception.

Failure modes — what breaks, what catches it

Missing thread_id in configuration

Trigger — A user invokes the agent without setting "thread_id" inside the configurable key of the RunnableConfig.
Guard — None shown in the source. The example always supplies "thread_id": "1", but no validation or default is provided for a missing value.
Posture — Fail-hard: the invocation would abort because the checkpointer cannot associate state with a thread.
Operator signal — A KeyError or similar runtime error from the internal persistence layer, containing the string "thread_id".
Recovery — Manual: the operator must re-run the agent.invoke call with a properly populated config object that includes the "thread_id" key.

InMemorySaver state loss across process restarts

Trigger — The process using InMemorySaver stops and later restarts, or the Python runtime is reset; all persisted state is lost.
Guard — None in the source. The example exclusively uses InMemorySaver, and no fallback to a database-backed checkpointer is shown.
Posture — Fail-soft: the agent continues but begins a new thread with empty state, effectively ignoring any prior conversation.
Operator signal — Silent absence of expected state: the next agent.invoke returns responses that do not reference earlier messages or stored user information.
Recovery — Manual: the operator must switch to a durable checkpointer (e.g., a database-backed saver) and re‑inject any lost data through a new thread.

SummarizationMiddleware model invocation failure

Trigger — The model specified in SummarizationMiddleware(model="gpt-5.4-mini", ...) returns an error or times out while the total token count exceeds the trigger threshold (4000 tokens).
Guard — None shown. The middleware has no retry logic, fallback model, or exception handler in the snippet.
Posture — Fail-hard: the entire agent step fails and the run aborts because the summarization call is required to manage the message history.
Operator signal — An HTTP‑level or API error (e.g., 5xx, timeout) raised by the language model client, propagated up uncaught.
Recovery — Manual: the operator must retry with a lower trigger value, use a different model, or implement a fallback middleware.

after_model middleware IndexError on empty messages

Trigger — The validate_response middleware runs after a model call and the state["messages"] list is empty (for example, after all messages have been deleted by a previous step).
Guard — None. The code state["messages"][-1] directly indexes into the list with no length check.
Posture — Fail-hard: an unhandled IndexError crashes the agent invocation.
Operator signal — A Python IndexError: list index out of range traceback in the logs.
Recovery — Manual: the operator must add a guard (e.g., if not state["messages"]: return None) or ensure middleware ordering never leaves the message list empty.

Tool runtime state access when state is uninitialized

Trigger — A tool annotated with ToolRuntime (via the runtime parameter) attempts to read from AgentState (e.g., state["messages"]) before the graph has produced any initial state.
Guard — None shown. The excerpt only demonstrates the tool’s signature and comments, not any validation that the state exists.
Posture — Fail-hard: accessing AgentState fields on a None or empty state object raises an attribute or key error.
Operator signal — An AttributeError or KeyError referencing the missing state field, logged during tool execution.
Recovery — Manual: the operator must either initialize the graph’s state before tool invocation or add a condition inside the tool (e.g., if runtime.state is None: return "State not available").

06. Failure Modes To Watch

Short-term memory for AI agents has several common failure modes. As conversations grow, the list of messages gets longer and longer. Eventually, the full history may not fit inside the language model's limited context window, and that leads to lost information or outright errors. Even when the model technically supports the full length, it often struggles with long prompts. The model gets distracted by stale or off-topic content. Important facts mentioned early get buried in the middle, and the agent loses track of them. All of this drives slower response times and higher costs. So there is a real trade. You want to keep enough history for the agent to understand the context. But holding onto too many old messages makes the system less efficient and less accurate, and stale facts can lead the agent to act on outdated information. Many applications therefore use techniques to remove or forget stale messages, trimming the history down to only the recent or relevant parts. Thread-level persistence helps keep data separate between different conversations. But if you do not manage memory carefully, state can leak across threads, mixing up one user's context with another's. The key is to balance completeness with performance.

Generate it: Important facts mentioned early get b_____ in the middle, and the agent loses track of them. (cue: b_____; answer: buried)

Generate it: If you do not manage memory carefully, state can l___ across threads, mixing up one user's context with another's. (cue: l___; answer: leak)

Ask yourself: Two failures here pull in opposite directions — keeping too much history versus letting state cross threads. What goes wrong in each case?

Recall check (try before reading the answer):

When the model technically supports the full length, what still degrades its performance, and what gets lost? Answer: It often struggles with long prompts and gets distracted by stale or off-topic content; important facts mentioned early get buried in the middle and the agent loses track of them.

Beyond running out of window, why is keeping too many old messages actively risky for correctness? Answer: Stale facts can lead the agent to act on outdated information, making the system less accurate.

Thread-level persistence is supposed to isolate conversations — what failure happens if memory is not managed carefully? Answer: State can leak across threads, mixing up one user's context with another's.

This code shows how to automatically delete old messages from the conversation history to avoid context window overload and stale information.

python

from langchain.messages import RemoveMessage
from langchain.agents import create_agent, AgentState
from langchain.agents.middleware import after_model
from langgraph.runtime import Runtime

@after_model
def delete_old_messages(state: AgentState, runtime: Runtime) -> dict | None:
    """Remove old messages to keep conversation manageable."""
    messages = state["messages"]
    if len(messages) > 2:
        # remove the earliest two messages
        return {"messages": [RemoveMessage(id=m.id) for m in messages[:2]]}
    return None

agent = create_agent(
    "gpt-5-nano",
    tools=[...],
    middleware=[delete_old_messages],
    checkpointer=InMemorySaver(),
)

ELI5 — the plain-language version

Imagine trying to hold a long conversation in a noisy room where every new word pushes the first words out of your memory. That’s exactly what happens to an AI agent’s short-term memory when conversations grow too long. The agent keeps every user message and its own reply in a growing list, but the underlying language model has a strict context window—like a sticky note that can only hold so many words. When that list overflows, the model either loses earlier information outright or, even if it technically fits, gets distracted by stale, off‑topic content buried in the middle. Important facts you mentioned at the start—like your name or a critical instruction—get buried and vanish from the agent’s awareness. Without a system to forget or summarise old messages, you’d experience the agent forgetting who you are mid‑conversation, repeating questions, and taking longer to respond because it’s wading through a cluttered, oversized prompt—all while costing more to run. It’s like your friend forgetting the beginning of your story because they kept interrupting themselves.

System design — mechanism, invariant, trade-off

The short-term memory subsystem in LangChain’s agent runtime operates as a stateful graph managed by a checkpointer. On each invocation, the agent’s AgentState—which contains the full message history—is read at the start of every step. The ordered mechanism proceeds as follows: first, any @before_model middleware, such as the trim_messages function, can modify the state before the model call. Next, the language model processes the resulting messages. Then @after_model middleware, like SummarizationMiddleware, runs after the model output. On failure—for instance, if the message history exceeds the model’s context window—the system may throw an error or produce incoherent output. The checkpointer (e.g., InMemorySaver) persists the entire state to a database after each step, so the thread can be resumed at any later time.

The design preserves a thread-scoped persistence invariant: the agent’s short-term memory is strictly scoped to a single thread, identified by a thread_id in the RunnableConfig. This ensures that conversation history is isolated per session and can be resumed exactly where it left off. The guarantee is not exactly-once delivery but rather resumable state consistency—the full history is stored in the checkpointer, and the graph always starts each step from the persisted state, avoiding data loss across invocations within the same thread.

The key trade-off is between complete history retention and practical model performance. The obvious alternative—keeping every message verbatim—is explicitly rejected because long histories cause “context loss or errors” when the prompt exceeds the LLM’s context window, and even when it fits, models “get distracted by stale or off-topic content” and suffer “slower response times and higher costs.” Instead, the subsystem uses middleware like SummarizationMiddleware (triggered at 4000 tokens, keeping the last 20 messages) or trim_messages to remove old messages, sacrificing perfect recall for reliability, speed, and lower cost.

A concrete failure mode occurs when a conversation grows long without truncation: the message list exceeds the model’s context window, and the system either silently truncates or throws an error. The signal an operator would actually see is an incorrect answer, such as the agent forgetting the user’s name—e.g., in the documented example, after three long exchanges, the agent fails to answer “what’s my name?” with “Bob” unless the middleware correctly trims or summarizes. The operator might observe a response like “I don’t know your name” or a model error logged in the runtime, indicating the history was too large or too noisy.

Failure modes — what breaks, what catches it

Context Window Overflow

Trigger – The conversation grows beyond the language model’s maximum context window. The chapter states: “the full history may not fit inside the language model's limited context window.”
Guard – SummarizationMiddleware with trigger=("tokens", 4000) is intended to prevent overflow by summarizing before it happens. No secondary guard (retry, fallback, validation) is shown in the source.
Posture – fail‑hard. Without a working summarization, the model call aborts with an error.
Operator signal – A model error (e.g., “context length exceeded”) – no exact string is given in the source – or “outright errors” as the chapter describes.
Recovery – None automatic. The operator must reduce the message list manually or adjust the trigger or keep parameters.

Information Loss Due to Aggressive Summarization

Trigger – SummarizationMiddleware fires when token count reaches 4000 and discards all but the most recent 20 messages (keep=("messages", 20)). Important facts from earlier in the conversation are removed.
Guard – The source shows no guard against loss; the middleware itself is the cause. No validation, exception handler, or fallback preserves removed content.
Posture – fail‑soft. The agent continues to respond, but with degraded recall of earlier facts.
Operator signal – Silent absence of information. The agent may later fail to answer questions about facts mentioned before the summarization.
Recovery – No automatic recovery. The operator can raise keep or lower the summarization threshold, or rely on long‑term memory (not shown in this chapter) for critical data.

Checkpointer State Loss on Restart

Trigger – The agent uses checkpointer = InMemorySaver(). This persists state only in memory. A process crash or restart destroys all short‑term memory for that thread.
Guard – No guard is shown. The source does not include a persistent checkpointer (e.g., SqliteSaver) or any data‑loss prevention.
Posture – fail‑hard. The thread’s entire state is lost; the agent cannot resume the conversation.
Operator signal – Upon restart, the agent has no memory of previous interactions – a silent absence of all prior state.
Recovery – Manual step: the user must restart the conversation from scratch, or the operator must deploy a durable checkpointer (not provided in the source).

Summarization Trigger Mismatch with Model Context Limit

Trigger – trigger=("tokens", 4000) is configured, but the actual model (e.g., model="gpt-5.5") may have a smaller context window (e.g., 2048 tokens). The middleware never fires before overflow occurs.
Guard – The source shows no validation that the trigger threshold is less than the model’s maximum context length. No guard exists.
Posture – fail‑hard. The model call fails because tokens exceed its limit.
Operator signal – A model error (likely “context length exceeded”) or a silent truncation, depending on the provider. The chapter mentions “errors” for such cases.
Recovery – Manual reconfiguration: the operator must set trigger to a value below the model’s actual limit.

Summarization Model API Failure

Trigger – The summarization step uses a separate model (model="gpt-5.4-mini"). An API error (rate limit, network failure, authentication) prevents summary generation.
Guard – The source shows no exception handling, retry logic, or fallback for the summarization call. No try/except or alternative path is provided.
Posture – The source does not specify; likely fail‑soft (skip summarization and continue with full history, eventually leading to overflow). Could also be fail‑hard if the middleware halts the graph.
Operator signal – An error from the gpt-5.4-mini API (exact field unknown). If fail‑soft, the agent continues silently until later overflow errors occur.
Recovery – No automatic recovery shown. The operator must ensure the summarization model is available and may need to implement retry logic externally.

07. Testing And Operations

Managing token counts is a real concern. The source explains that message lists grow long over time, and context windows are limited. Token-rich lists become costly. So many applications use techniques to remove stale information. That is the trade-off: you save tokens but risk losing useful context.

Now consider testing recall. The source talks about long-term memory, which stores facts across sessions. To test if an agent remembers, you could ask for a fact many turns later. But the source does not describe a specific testing method. It does say that long-term memory is saved in custom namespaces and can be recalled at any time in any thread.

Finally, tracing the exact message list sent to the model makes the system observable and debuggable. The source explains that short-term memory is part of the agent's state, and that state is persisted using a checkpointer. This means you can see the full message history at each step. The source also mentions that the store lets you save and recall memories, which helps you understand what the model sees. By tracing which messages go in, you can spot problems. The source gives an example of removing messages with sensitive words, showing how you can inspect and modify the list. So tracing makes the memory system transparent.

Generate it: Short-term memory is part of the agent's state, and that state is persisted using a c___________. (cue: c___________; answer: checkpointer)

Generate it: Tracing the exact message list sent to the model makes the system observable and d__________. (cue: d__________; answer: debuggable)

Ask yourself: If you can't see inside the model, how does persisting state with a checkpointer turn the message history into something you can actually inspect and debug?

Recall check (try before reading the answer):

What does tracing the exact message list sent to the model give you? Answer: Tracing the exact message list sent to the model makes the system observable and debuggable, so you can see the full message history at each step and spot problems.

The source gives a concrete example of inspecting and modifying the list — what is it? Answer: The source gives an example of removing messages with sensitive words, showing how you can inspect and modify the list.

Why can't you fully test recall from this source alone? Answer: The source does not describe a specific testing method, though it says long-term memory is saved in custom namespaces and can be recalled at any time in any thread.

Looking back: From "Threads And Checkpointers" — what does a checkpointer save, and what does that let the graph do? Answer: A checkpointer saves the conversation state so the graph can pause and resume.

Remove old messages to keep conversation manageable.

python

from langchain.messages import RemoveMessage
from langchain.agents import create_agent, AgentState
from langchain.agents.middleware import after_model
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.runtime import Runtime

@after_model
def delete_old_messages(state: AgentState, runtime: Runtime) -> dict | None:
    messages = state["messages"]
    if len(messages) > 2:
        return {"messages": [RemoveMessage(id=m.id) for m in messages[:2]]}
    return None

agent = create_agent(
    "gpt-5-nano",
    tools=[...],
    system_prompt="Please be concise and to the point.",
    middleware=[delete_old_messages],
    checkpointer=InMemorySaver(),
)

ELI5 — the plain-language version

Managing token counts is like clearing a messy desk by tossing old sticky notes—you have to keep the most important notes in front of you, but you risk losing details that were on the ones you threw away. In practice, a LangChain agent stores every message in a growing list, but LLMs have a limited context window (token count). The source shows two concrete mechanisms: trimming—using RemoveMessage to delete all but the first message and the three most recent ones—and summarizing, where a smaller model compresses old conversation into a short summary via SummarizationMiddleware. Without these techniques, a long chat would blow past the token limit, causing errors, slowdowns, or the model forgetting your earlier name. For testing recall across sessions, the source mentions long-term memory that saves facts (e.g., “Bob”) to a custom namespace, but it doesn't give a specific test method—just that the memory persists. If you never trim or summarize, the agent’s context window fills up, it can’t process new messages, and you’d see “context length exceeded” instead of a helpful reply. Or, without cross-session memory, after a few turns the model asks “What’s your name?” like you’ve never met.

System design — mechanism, invariant, trade-off

The ordered mechanism for managing short-term memory begins when the graph is invoked: the agent’s state—containing the message list—is read at the start of each step. Before the model executes, the @before_model middleware runs, and in the provided example the trim_messages function checks the message count; if more than three messages exist, it keeps the first system‑style message and the last few (three or four) recent ones, then issues a RemoveMessage with the identifier REMOVE_ALL_MESSAGES before adding the retained messages back. The model then generates a response, after which the @after_model middleware can process the output. If tool calls are needed, control passes to tool nodes and loops back to the model. On failure—for instance, if the trimmed messages still exceed the LLM’s context window or the model produces an error—the checkpointer (e.g., InMemorySaver) allows the thread to be resumed from the last persisted checkpoint, but the lost portion of the history cannot be recovered.

The design preserves what the source calls “thread‑level persistence”: state is persisted to a database using a checkpointer “so the thread can be resumed at any time.” This invariant guarantees that short‑term memory is durably saved at each invocation or step completion, ensuring exactly‑once‑like recoverability within a single conversation thread. The agent can pick up from the exact point where it left off, even if a crash or error interrupts execution. This is a write‑boundary guarantee: the state is atomic with respect to graph steps, and the checkpointer ensures no partial updates are retained.

The key trade‑off is between keeping a complete message history and fitting the LLM’s limited context window. The obvious rejected alternative is to retain all messages indefinitely, which the source explicitly warns against because it leads to context window overflow, poor model performance over long contexts, slower response times, and higher costs. Instead, the system accepts the risk of “losing useful context” by trimming or summarizing early messages. The SummarizationMiddleware (triggered on token count with the trigger=("tokens", 4000) option) and the trim_messages function are the mechanisms chosen to save tokens, avoiding the cost of unnecessarily large token bills and degraded latency. This rejection of a full‑history policy is the reason the middleware pattern exists: it trades perfect recall for operational efficiency.

A concrete failure mode occurs when the trim_messages function removes the very first system message or a critical early user input because it misidentifies the boundary. In the example code, the function keeps only messages[0] (the first message) plus three or four recent ones; if the first message is not a system instruction but a user greeting, the agent loses the proper system persona or an important directive. An operator would see the agent failing to follow a system‑level instruction or producing off‑topic responses when queried—for instance, the agent might no longer remember the user’s stated name after several turns, even though the name was in a message that was removed. The signal is an unexpected or contradictory answer, observable by comparing the agent’s output against known context. The source does not prescribe a testing method for long‑term memory across threads, but within the thread an operator can detect this failure by asking the agent to recall information given earlier in the same conversation.

Failure modes — what breaks, what catches it

1. Token Limit Exceeded

Trigger — The message list grows beyond the LLM’s context window, causing “context loss or errors” (exact phrase from the source).
Guard — SummarizationMiddleware with trigger=("tokens", 4000) and keep=("messages", 20).
Posture — Fail-soft: the middleware removes or summarizes older messages, degrading recall but allowing the conversation to continue.
Operator signal — The source states this failure directly produces “context loss or errors” in model responses; the operator would observe incomplete or incorrect answers.
Recovery — The SummarizationMiddleware automatically applies summarization when the token trigger is reached; no retry or backoff is shown. The operator can adjust the trigger or keep parameters to trade off token usage for recall.

2. Summarization Losing Critical Context

Trigger — The SummarizationMiddleware discards messages beyond the last 20 (or whatever keep is set to), removing early but still‑relevant information.
Guard — No guard exists for this loss; the same middleware that prevents token overload is the source of the failure.
Posture — Fail-soft: the agent continues, but with degraded memory of earlier turns.
Operator signal — The source notes that LLMs get “distracted by stale or off-topic content” when context grows long, but here the opposite problem occurs: the operator would observe missing details in later responses that were present in the truncated history.
Recovery — The source does not specify a recovery mechanism. The operator must manually adjust the keep parameter or implement a separate long‑term memory store to preserve critical facts.

3. InMemory Checkpointer Fails on Restart

Trigger — The process restarts or the in‑memory database is cleared; the InMemorySaver() checkpointer loses all persisted state.
Guard — No guard is provided in the source. The code uses InMemorySaver which has no durability across restarts.
Posture — Fail‑hard: the thread’s state (short‑term memory) is lost, and the agent run cannot be resumed from where it left off.
Operator signal — The source says state is persisted so “the thread can be resumed at any time”; a restart would silently break that promise—the operator would see a new thread with no conversation history.
Recovery — No automatic recovery. The operator must restart the conversation from scratch or switch to a persistent checkpointer (e.g., a database-backed saver) if resume capability is required.

4. Long-Term Memory Write Failure in the Hot Path

Trigger — While writing memories in the hot path (e.g., via store.put), the underlying store becomes unavailable or returns an error.
Guard — The source does not show any exception handler, retry, or validation around store.put or store.search.
Posture — Fail‑hard: because the write is in the hot path (synchronous during runtime), a failure would abort the agent’s step, potentially crashing the run.
Operator signal — The source does not specify a log line or error field. The operator would observe an unhandled exception or a non‑functional agent invocation.
Recovery — No recovery is documented. The operator must manually restart the agent and, if needed, re‑enter the lost memory data.

5. Long-Term Memory Search Failure Over Document Collections

Trigger — The agent’s long‑term memory is stored as a collection of documents; a search (e.g., via store.search) fails to locate the relevant fact because of an indexing issue, schema mismatch, or sheer volume.
Guard — The source does not provide a guard for search failures. It only notes that “search over the list” adds complexity.
Posture — Fail‑soft: the agent likely returns no memory or defaults to an empty result, continuing without the recalled information.
Operator signal — The source describes this as “complexity to memory search over the list”; the operator would observe missing recalled facts in responses, with no explicit error.
Recovery — No automatic recovery. The operator may need to manually inspect the store, re‑index memories, or tune the search logic—details not provided in the source.

Glossary — the domain terms, grounded in the code

16terms, each defined from this subsystem’s real source.

short-term memory

Short-term memory is the part of the agent's state that remembers previous interactions within a single thread, persisted via thread-scoped checkpoints, updating when the agent is invoked or a step is completed and read at the start of each step.

Memory hook Short-term memory is the agent's per-thread sticky note, saved after each step and read at the next.

From langchain-short-term-memory.md

checkpointer

A checkpointer is a persistence mechanism passed to `create_agent` that saves agent state (short‑term memory) to memory (e.g., `InMemorySaver`) or a database (e.g., `PostgresSaver`), enabling a thread identified by `thread_id` to be resumed across invocations.

Memory hook A checkpointer is your agent's save point, bookmarking state so you can pick up the same thread later.

From langchain-short-term-memory.md

InMemorySaver

InMemorySaver is a checkpointer from LangGraph's checkpoint.memory module that persists the agent's short-term memory (state) to an in-memory dictionary, enabling the thread to be resumed across multiple invocations when passed as the checkpointer parameter to create_agent.

Memory hook InMemorySaver works like a game save point, freezing the agent’s state in memory so the thread can continue later.

From langchain-short-term-memory.md

PostgresSaver

PostgresSaver is a checkpointer from langgraph.checkpoint.postgres that persists agent state to a PostgreSQL database, automatically creating tables via setup() and used in production to allow threads to be resumed at any time.

Memory hook PostgresSaver is a database lifeguard that dives into PostgreSQL to keep agent state safe for thread resumption.

From langchain-short-term-memory.md

thread

A thread is a conversation session identified by a thread_id that organizes multiple interactions, with state persisted to a database via a checkpointer so the conversation can be resumed at any time.

Memory hook A thread is a conversation’s spool: its thread_id keeps state reeled in to resume anytime.

From langchain-short-term-memory.md

thread_id

thread_id is a configuration key passed to the agent's `invoke` method that identifies a unique conversation session, allowing the `InMemorySaver` checkpointer to preserve and retrieve short‑term memory (state) across multiple calls.

Memory hook thread_id is the conversation ID card you show at the agent's door so it remembers you from previous chats.

From langchain-short-term-memory.md

AgentState

AgentState is the base class used to define the agent's short-term memory (state) schema, which can be extended with custom fields like `user_id` and `preferences`, and is passed to `create_agent` via the `state_schema` parameter; tools access and modify this state through `runtime.state`.

Memory hook AgentState is the agent's memory locker: tools use runtime.state to store and fetch custom fields.

From langchain-short-term-memory.md

state_schema

state_schema is a parameter passed to create_agent that defines a custom state class (subclassing AgentState) with additional fields like user_id and preferences, enabling the agent to store and access those custom fields during invocation and tool execution.

Memory hook state_schema sketches custom slots like user_id into the agent's memory blueprint.

From langchain-short-term-memory.md

create_agent

create_agent is a function that initializes an agent by accepting a model, optional tools, a checkpointer like InMemorySaver, and optional middleware, and the resulting agent can be called with messages and a thread config to return a response while preserving conversation history across invocations.

Memory hook create_agent is the librarian that uses a thread_id card to recall your name from a memory cabinet.

From langchain-short-term-memory.md

messages

messages is a key in the agent's state holding the conversation history, accessed via state["messages"] and managed with the add_messages reducer, allowing operations like trimming or deleting individual messages using RemoveMessage.

Memory hook Messages are the conversation scroll in agent state – snip old lines with RemoveMessage.

From langchain-short-term-memory.md

conversation history

Conversation history is the most common form of short-term memory, stored as a list of alternating human and model messages within the agent's state, which is persisted via thread-scoped checkpoints so the agent can access the full context of a single thread.

Memory hook Conversation history is the agent's ping-pong rally of messages, saved in the thread's checkpoint to recall the whole chat.

From langchain-short-term-memory.md

context window

Context window is the limited number of tokens or messages an LLM can process at once, which makes long conversation histories costly and prone to poor performance, so applications must manually remove or forget stale information to stay within that limit.

Memory hook Think of a context window as a tiny window — you must shove out stale messages to see new ones.

From langgraph-memory.md

persistence

Persistence is the storage of an agent's state, including short-term memory, to a database via a checkpointer, allowing the thread to be resumed at any time.

Memory hook Persistence is the checkpoint that saves the thread's state to a database, like a permanent bookmark for resuming later.

From langgraph-memory.md

langgraph.checkpoint.memory

langgraph.checkpoint.memory is the module that provides the InMemorySaver checkpointer, which persists the agent's state in memory within a single thread to enable short-term memory across invocations.

Memory hook Like a save point in a game, langgraph.checkpoint.memory holds thread state in memory for quick resume.

From langgraph-memory.md

langgraph-checkpoint-postgres

langgraph-checkpoint-postgres is a LangGraph library that provides the PostgresSaver checkpointer class, used in production to persist agent state to a PostgreSQL database via a connection string and automatic table setup.

Memory hook Langgraph-checkpoint-postgres: the save point for agent state, auto-creating tables in PostgreSQL.

From langchain-short-term-memory.md

checkpointer libraries

Checkpointer libraries are the available database backends (such as SQLite, Postgres, and Azure Cosmos DB) that implement the checkpointer interface for persisting an agent's short-term memory across threads.

Memory hook Checkpointer libraries are database backends you plug in, like memory save points for your agent's threads.

From langchain-short-term-memory.md