Autonomous agents

🎧 Listen: audio guide 🎧 Complete guide — all three LangGraph pages in one narration

Autonomy is a dial, not a switch. A basic agent already picks its own next step; an autonomous agent keeps doing that over hours or days — which is only safe when the run can survive a crash, pause for a human before irreversible actions, remember across conversations, and split work across specialized agents when one stops scaling.

This page builds on Agents with LangGraph — read that first for the agent loop, tools, and state. Here we turn the dial up: the autonomy spectrum, durable execution, human-in-the-loop guardrails, long-term memory, and multi-agent systems.

Each piece below leads with a plain-language ELI5, then the system-design detail, then real example code. Every part is generated by LlamaIndex, grounded in the official LangGraph and LangChain agents documentation — not paraphrased from memory.

Explain it like I'm 5

Think of an autonomous AI agent like a delivery driver given a route and a truck. Instead of calling the boss before every turn, the driver plans the route, drives, checks the map if lost, and adjusts. But you wouldn't give a new driver total freedom—you'd start with a limited route, require check-ins at key points, and provide a memory of past deliveries. Autonomy is a dial you turn up as the agent proves reliable, using save points, human override, and stored knowledge to prevent costly mistakes. Without these safeguards, the agent might take wrong turns or never finish the task efficiently.

The system-design view

At the core of production autonomy is the agent loop—a model calling tools in a loop until a task is complete—but reliable operation at scale demands three layered mechanisms beyond that basic cycle. First, durable execution is achieved through LangGraph’s graph model: a workflow is expressed as a State, Nodes, and Edges, where Nodes do the work and Edges determine the next node. Execution proceeds in discrete super‑steps; after each node completes, the graph state is saved via the persistence layer (/oss/python/langgraph/persistence). This checkpointing under a thread_id (implied by “persist through failures and resume from where they left off”) allows a crashed or paused run to resume exactly where it stopped—no recomputation of completed steps. Human‑in‑the‑loop (HITL) builds on this by issuing an interrupt before a sensitive tool call. The interrupt halts execution, the graph state persists, and a human returns one of four decisions: approve, edit, reject, or respond. The decision then determines whether the action executes, is modified, is skipped with explanation, or is replaced by a human message. This design ensures that the graph can safely pause for minutes or hours without losing context.

Long‑term memory and multi‑agent decomposition address the fundamental constraint of a fixed context window. Memory is provided by MemoryMiddleware which loads persistent instructions across sessions, while SkillsMiddleware surfaces domain knowledge on demand rather than stuffing everything into the prompt. For cases where a single agent’s tool set and context grow too large, the subagents pattern is used: a main agent coordinates subagents as tools, delegating sub‑tasks that each run in their own isolated context. This keeps the main agent’s context clean and allows parallel execution. The trade‑off is explicit: more model calls (higher latency and cost) in exchange for better context management and distributed development. The rejected alternative is a monolithic single agent “with the right (sometimes dynamic) tools and prompt”—the docs state that this can often achieve similar results, but when the tool set is too large or the context window would overflow, the subagent approach becomes necessary.

The autonomy spectrum runs from fixed workflows (predetermined code paths) to fully agentic systems (dynamic tool usage). LangGraph is designed to support both, mixing deterministic logic (Edges) with agentic behavior within Nodes. A critical failure mode that pushes designs back toward the workflow end is poor tool selection when a single agent has too many tools, leading to wrong decisions. Another is context overflow from accumulation of history, tool results, and intermediate steps—without summarization or memory, the agent breaks. HITL interrupts also introduce a failure mode: if the policy is too permissive, unsafe actions may execute; if too restrictive, the system stalls waiting for human input that may never come. Multi‑agent systems introduce their own edge cases: mis‑routing between subagents, stale state across threads, and the overhead of coordinating multiple LLM calls that can compound latency. These failure modes directly motivate using workflows (i.e., more deterministic edges) to constrain agent autonomy where appropriate, balancing flexibility against reliability.

In depth, piece by piece

Each piece below: the plain-language take, the system-design detail, then real example code from the official docs.

The autonomy spectrum

In plain terms. Think of it like planning a road trip versus exploring a new city. A workflow is a fixed route—you know every turn, exit, and stop before you leave. It’s predictable, cheap, and fast for routine trips. An agent is like wandering without a map, deciding each street based on what looks interesting—flexible but slower, costlier, and less predictable. Real systems blend both: they follow a safe route for most of the journey, then let the agent explore only when the destination is uncertain. Without that balance, you either waste time overcomplicating simple tasks or lose control on open-ended ones.

System design. The spectrum from workflows to agents is captured directly by LangGraph’s graph model: at one extreme, workflows have “predetermined code paths and are designed to operate in a certain order” (from langgraph-workflows-agents.md); at the other, agents are “dynamic and define their own processes and tool usage”. LangGraph provides the low‑level machinery to build both and everything in between by composing State, Nodes, and Edges. The concrete mechanism is a graph execution engine where “nodes do the work, edges tell what to do next” (langgraph-graph-api.md). Execution proceeds in discrete super‑steps: nodes that run in parallel belong to the same super‑step, while sequential nodes are in separate super‑steps. A node becomes active when it receives new messages on its incoming edges (channels). For workflows, edges are fixed (deterministic branches); for agents, edges can be conditional functions that call an LLM with a structured output schema (e.g., Route with Literal["poem", "story", "joke"]) to decide the next node. The Send API (langgraph-workflows-agents.md) allows dynamic creation of worker nodes with isolated state, enabling the orchestrator‑worker pattern—a hybrid where the orchestrator node is a predetermined workflow step that spawns autonomous workers.

The built‑in trade‑off is predictability versus flexibility. Workflows are cheap, deterministic, and easy to debug because every path is known at compile time. Agents, by contrast, incur higher latency and cost because each LLM call can branch arbitrarily. The source repeatedly warns that you should “start with the simplest design that works, add autonomy only where the task is open‑ended” (this guidance is a direct synthesis of the workflow‑vs‑agent distinction: “Workflows have predetermined code paths … Agents are dynamic …”). The performance comparison table in langchain-multi-agent.md quantifies the cost: for a multi‑domain task, the router pattern uses 5 model calls and ~9K tokens, while the handoffs pattern uses 7+ calls and ~14K+ tokens. The router is a workflow (the routing step is a fixed classification), whereas handoffs involve agent‑to‑agent transfers with higher overhead. The table advises “Best fit” for subagents and router when predictability and cost are key.

A rejected alternative implied by the source is using a single monolithic agent with a large toolset instead of a multi‑agent architecture. The langchain-multi-agent.md document explicitly states: “a single agent with the right (sometimes dynamic) tools and prompt can often achieve similar results.” This alternative is rejected when the agent has “too many tools and makes poor decisions about which to use” or when tasks require “specialized knowledge with extensive context (long prompts and domain‑specific tools)”. The skills pattern is presented as a middle ground: a single agent loads specialized context on‑demand, staying in control while still limiting overhead. The subagents pattern is another alternative that centralizes routing through a main agent, sacrificing direct user interaction (rated “⭐” for direct user interaction) but gaining distributed development and parallelization.

A concrete failure mode is context‑window overflow in an agent that accumulates too much history or intermediate state. The source in langchain-agents.md describes the problem: “As an agent runs — accumulating history, tool results, and intermediate steps — that window fills.” LangGraph addresses this with persistence (stateful agents with short‑term and long‑term memory), summarization middleware (SummarizationMiddleware compresses history before overflow), and memory middleware that loads persistent instructions at startup. Without these mitigations, a long‑running agent would either hit the context limit and produce degraded results or fail completely. Another edge case is parallel‑worker contention in the Send API: all workers write to a shared state key annotated with operator.add, which works correctly only for commutative, associative reducers. If workers produce conflicting updates (e.g., overwriting instead of appending), the final state will be corrupted. LangGraph’s operator.add pattern assumes safe accumulation, but mis‑designed state schemas can cause silent data loss.

The conditional edge shows model-directed routing on top of a fixed workflow skeleton.

python

def should_continue(state: MessagesState) -> Literal["tool_node", END]:
    """Decide if we should continue the loop or stop based upon whether the LLM made a tool call"""
    messages = state["messages"]
    last_message = messages[-1]
    if last_message.tool_calls:
        return "tool_node"
    return END


agent_builder = StateGraph(MessagesState)
agent_builder.add_node("llm_call", llm_call)
agent_builder.add_node("tool_node", tool_node)
agent_builder.add_edge(START, "llm_call")
agent_builder.add_conditional_edges(
    "llm_call",
    should_continue,
    ["tool_node", END]
)
agent_builder.add_edge("tool_node", "llm_call")
agent = agent_builder.compile()

Durable execution

In plain terms. Think of a checkpointer like a video game that auto-saves after every level. Without it, a crash or restart would force you back to the title screen. With it, the agent saves a snapshot of its current state—what it knows and what step it's on—under a unique thread ID (like a save file). When you resume with the same ID, the agent skips already-completed steps and re-runs only the ones that follow, including any LLM calls or tool uses. This means a power outage, a pause for human approval, or a node failure doesn’t lose progress. The saved state is the full state values and the next node to execute, so the agent picks up exactly where it left off, not from scratch.

System design. The persistence layer in LangGraph works by compiling a graph with a BaseCheckpointSaver (e.g., InMemorySaver or AsyncPostgresSaver). On each super-step — a single tick where all nodes scheduled for execution run, potentially in parallel — the checkpointer writes a full checkpoint (StateSnapshot) containing values (the current state dictionary), next (tuple of node names to execute next), metadata (including source, writes, and step counter), tasks (with per-node id, name, and any interrupts), and parent_config linking to the prior checkpoint. Additionally, as each node within a super-step finishes, its outputs are durably persisted as pending writes in the checkpointer’s checkpoint_writes table. These per-task writes are not full StateSnapshot objects; they serve as intermediate state so that if another node in the same super-step fails, the successful nodes’ writes are not recomputed on resume. When the graph is re‑invoked with the same thread_id (and optionally a specific checkpoint_id), the checkpointer loads the matching checkpoint via .get_tuple, skipping all nodes before that point. Nodes after the checkpoint are re‑executed exactly — including any LLM calls, API requests, or interrupts — which are always re‑triggered during replay. This mechanism is what enables crash recovery, pause/resume, and time travel.

The design trades storage and checkpoint write overhead for fault tolerance and human‑in‑the‑loop capabilities. By persisting a snapshot after every super-step, the system can restart from the last consistent state without replaying the entire history. The addition of pending writes introduces a further trade‑off: they use extra disk space and write latency, but they ensure that even a partial failure within a super-step does not force re‑execution of already‑completed nodes. This is especially valuable in long‑running agents where a single node failure (e.g., a transient API timeout) would otherwise waste the work of all parallel nodes. The checkpoint also stores metadata like writes and step, which allows operators to inspect exactly what a node produced before it was committed. Without this layered approach, every crash would require restarting from the initial input, losing all intermediate computations and making human intervention impossible.

The context implicitly rejects the alternative of no persistence — e.g., running a graph without a checkpointer. Without a checkpointer, there is no way to pause execution for human approval, no conversational memory across invocations, and no ability to resume after a crash. The documentation explicitly states that checkpointer is required for human‑in‑the‑loop, memory, time travel, and fault tolerance. Another rejected alternative would be a naïve approach that only saves state at the end of the entire graph run; that would lose all progress if a failure occurs mid‑execution. The chosen design of per‑super‑step checkpoints plus pending writes provides granular recovery points while keeping the number of saved checkpoints manageable (one per super‑step boundary, not per node).

A concrete failure mode arises from non‑deterministic nodes on replay. During replay, nodes after the checkpoint re‑execute unconditionally — including all LLM calls, API requests, and interrupts. If a node depends on mutable external state (e.g., a random number generator, a counter in an external database, or the current time), the replay may produce different outputs than the original run. This can lead to divergent state unless the node is designed to be idempotent or its side effects are carefully managed. Another edge case involves interrupts during replay: because interrupts are always re‑triggered, re‑invoking a graph that previously halted for human approval will pause again at the same node, even if the human already approved the action. This means replay from a checkpoint before an interrupt essentially forces the approval workflow to repeat, which may be undesirable in production monitoring scenarios. The system therefore expects that application logic handles replay carefully — for instance, by using deterministic seeds or caching external results — while the persistence layer itself provides a robust, idempotent checkpointing mechanism.

Simple graph compiled with a checkpointer for durable execution, persisting state after each super-step under a thread_id.

python

from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import InMemorySaver
from langchain_core.runnables import RunnableConfig
from typing import Annotated
from typing_extensions import TypedDict
from operator import add

class State(TypedDict):
    foo: str
    bar: Annotated[list[str], add]

def node_a(state: State):
    return {"foo": "a", "bar": ["a"]}

def node_b(state: State):
    return {"foo": "b", "bar": ["b"]}

Pausing for a human

In plain terms. Think of human-in-the-loop like a safety checkpoint before a teammate can send a sensitive email or move money. Without it, an automated assistant might act instantly on something risky. This guardrail pauses the system right before that action, shows you what it wants to do, and asks for your okay, a change, or a rejection. Because the system saves its place—like bookmarking a paused video—it can wait forever for your decision and then resume exactly where it left off, carrying out your choice. Without this checkpoint, you'd lose control and the assistant could run ahead on its own.

System design. Human-in-the-loop (HITL) is implemented as a middleware (HumanInTheLoopMiddleware) that wraps the agent's tool‑calling logic. The middleware is registered in the agent's middleware list during creation and configured with an interrupt_on dictionary mapping tool names to approval settings. Each entry can be True (interrupt with default decisions), False (auto‑approve), or an InterruptOnConfig object that specifies allowed_decisions (e.g., 'approve', 'edit', 'reject', 'respond') and an optional when predicate. The when predicate is a callable receiving a ToolCallRequest and returns True to interrupt or False to auto‑approve, gating interruption on the call’s arguments. Under the hood, when a tool call matches, the middleware calls LangGraph’s interrupt() to halt execution. The graph’s state, including the pending tool request, is persisted via a checkpointer (e.g., InMemorySaver for prototyping, AsyncPostgresSaver for production). The thread ID in the invocation config ties execution to a conversation thread so the interruption and resumption are scoped.

The control flow after interruption is asymmetric. The human must provide decisions via Command(resume=...), where the resume value is a dictionary with a "decisions" key containing an ordered list of decision objects. Each decision has a "type" — approve, edit, reject, or respond — and, for edit, an "edited_action" with modified tool name/args; for reject or respond, a "message". The order of decisions must exactly match the order of tool calls in the interrupt batch. Upon resumption, the graph continues from the saved checkpoint, executing approved calls as normal, skipping rejected calls and injecting the human’s message as a ToolMessage, or substituting edited arguments. The middleware ensures only allowed decision types (from InterruptOnConfig) are accepted.

The key trade‑off is safety versus latency. Interrupting introduces human latency—a run can wait indefinitely for review—but prevents the model from autonomously executing side‑effecting operations (file writes, SQL mutations, email sends). The checkpointed state makes indefinite waiting feasible because no in‑memory runtime state is lost; the graph is fully serialized after each step. The price is operational complexity: you must manage a persistent checkpointer, handle the interrupt cycle in your application’s UI, and ensure the human reviewer receives enough context to make a decision. The design explicitly prioritises user‑in‑the‑loop guarantees over throughput, which is appropriate for high‑stakes actions but overkill for read‑only or idempotent operations—hence the when predicate can skip interruption for safe calls.

The source implies a rejected alternative: monolithic server‑side tool execution without middleware. Without HITL, every tool call runs automatically, trusting the model entirely. That pattern is simpler but offers no control over destructive actions. Another alternative visible in the docs is the headless‑tool pattern for browser‑based flows: a tool is defined with only a schema (name, description, args_schema) and no Python implementation; when the model calls it, the run interrupts, the frontend inspects the payload via SDK hooks, performs the action (e.g., a human clicking a button), then resumes with the result. This moves the human decision to the client side and treats the tool as a placeholder for interactive approval, a different architectural split that still relies on the same interrupt/resume mechanism.

A concrete failure mode is mismatched decision ordering when multiple tools are paused simultaneously. The middleware enforces that “decisions must be provided in the same order as the actions appear in the interrupt request.” If the application presents the actions to the reviewer in a different order, the wrong decision will be applied to the wrong tool. Another edge case is the warning about editing: “make changes conservatively. Significant modifications to the original arguments may cause the model to re‑evaluate its approach and potentially execute the tool multiple times or take unexpected actions.” This happens because the model sees the edited args as the actual tool result, which might trigger further reasoning loops. Additionally, using respond to deny a side‑effecting tool (e.g., “reject the SQL DELETE”) is incorrect—respond treats the message as a successful tool result, so the model believes the action completed. The documentation explicitly says “do not use respond to deny side‑effecting tools.”

Pause the agent before a sensitive tool call and resume with human approval.

python

config = {"configurable": {"thread_id": "some_id"}}
result = agent.invoke(
    {
        "messages": [{"role": "user", "content": "Delete old records from the database"}]
    },
    config=config,
    version="v2",
)
# result.interrupts shows the pending action (e.g., execute_sql)
agent.invoke(
    Command(resume={"decisions": [{"type": "approve"}]}),
    config=config,
    version="v2",
)

Long-term memory

In plain terms. Imagine a file drawer with a labeled folder for each person you talk to. Long-term memory works like that: the agent stores facts, preferences, and past experiences in a folder (a namespace) and can pull them out at the start of any new conversation, then add new notes during or after the chat. Short-term memory is just the current chat’s message history—it disappears when the thread ends. Without long-term memory, the agent would forget you completely each time you start a new conversation, making every interaction feel like meeting a stranger.

System design. An agent remembers across conversations through a long-term memory store — a persistent key-value document database organized by custom namespace (like a folder) and key (like a filename). In LangGraph, this is provided by a BaseStore implementation such as InMemoryStore (for development) or a production database adapter. Memories are JSON documents; cross-namespace searching is supported through content filters. At the start of a run, the agent reads relevant memories by calling store.get(namespace, key) or store.search(namespace) inside a node — for example, a call_model node that fetches stored instructions and injects them into a prompt template. Across runs (different threads with different thread_id), the same store is accessed, so facts, preferences, and past experiences are available in any session, unlike short-term memory which is only available within a single thread and is persisted as part of the agent’s state via a checkpointer.

Writing new memories can happen in the hot path or in the background. In the hot path, the agent decides to save a memory during runtime — typically via a tool call (e.g., ChatGPT’s save_memories tool) that upserts a content string into the store. This gives real‑time updates: the next interaction immediately sees the new memory. Alternatively, memories can be written as a background task, completely separate from the agent’s main execution. This eliminates latency in the primary application and decouples memory management from agent logic, but introduces a design decision: how frequently to write (e.g., every N minutes, on a cron schedule, or on specific triggers) and how to avoid redundant work. The official reference templates are memory-agent (hot path) and memory-service (background).

The trade-off between hot path and background is classic latency vs. consistency. Hot path ensures newly learned information is immediately usable — critical for interactive agents that must adapt within a single conversation turn. But it forces the agent to multitask: while responding, it must also reason about what to save, increasing complexity and agent latency. Background writing avoids that overhead and lets the agent focus on its primary task, but risks staleness — if writes are too infrequent, other threads may proceed without relevant context. A concrete failure mode of the background approach is missing a critical memory because the background task has not yet triggered, leaving a user’s preference unremembered across sessions. Conversely, hot path failures occur when the agent incorrectly decides what to save (or fails to save) due to poor reasoning about memory relevance — a problem noted for tools like save_memories.

The alternative to long-term store-based memory is short-term thread-scoped memory, which maintains message history and other stateful data only for the current conversation thread. That history is part of the agent’s state, persisted by a checkpointer so the thread can be resumed later, but it is not shared across threads. Short-term memory is typically managed by trimming or summarizing messages when the context window is exceeded. The store‑based approach explicitly solves the cross‑thread recall problem: rather than relying on ever‑growing message lists, it stores structured facts, preferences, procedural rules, and even few‑shot examples as separate JSON documents. For instance, procedural memory (agent instructions) can be updated by a node like update_instructions that reads the current instructions from the store, asks an LLM to refine them based on conversation history, and writes the revised instructions back — all while keeping the core agent code unchanged. This clean separation of “memory management” from “application logic” is a key architectural advantage over stuffing everything into message history.

Storing user facts and preferences in a namespace-organized memory store for cross-conversation recall.

python

from langgraph.store.memory import InMemoryStore
from langgraph.store.base import IndexConfig

def embed(texts):
    return [[1.0, 2.0] for _ in texts]

store = InMemoryStore(index=IndexConfig(embed=embed, dims=2))
user_id = "my-user"
application_context = "chitchat"
namespace = (user_id, application_context)
store.put(namespace, "a-memory", {
    "rules": ["User likes short, direct language", "User only speaks English & python"],
    "my-key": "my-value",
})
item = store.get(namespace, "a-memory")
items = store.search(namespace, filter={"my-key": "my-value"}, query="language preferences")

Many agents, one system

In plain terms. Imagine a kitchen where one chef has every recipe and tool—soon they get confused, use the wrong knife, and mess up orders. Multi-agent architectures solve this by splitting the workload: a head chef (supervisor) delegates tasks to specialized cooks (subagents), each with a limited set of tools. When a task changes hands, the chef passes control directly (handoffs), keeping some shared notes like what’s in the oven but letting each worker work without seeing every other recipe. Without this, a single agent overloaded with too many tools or too much information makes poor decisions, just like the overwhelmed chef.

System design. A multi-agent architecture decomposes a monolithic agent into specialized components that coordinate via structured control flows. The primary mechanism is the subagents pattern: a main agent (supervisor/orchestrator) holds a set of tools, each of which is itself a full agent with its own isolated tools, prompt, and context. Routing passes through the main agent, which decides which subagent to invoke—e.g., a search subagent vs. a code_generation subagent. The subagent executes with its own context, returns a result to the main agent, and the main agent synthesizes the final response. This is implemented via middleware such as SubAgentMiddleware and TodoListMiddleware (from deepagents). An alternative is handoffs, where agents transfer control to each other directly via tool calls (e.g., handoff_to_coffee_agent)—no main agent sits in the middle; each agent can respond to the user or hand off to another. Skills pattern keeps a single agent but loads specialized prompts/knowledge on-demand, avoiding multiple agent instances while still isolating context. Router uses a dedicated classification step to direct input to one of several specialized agents, then synthesizes results.

The trade-off is centralized control versus stateless isolation vs. stateful efficiency. Subagents provide strong context isolation (each subagent sees only its domain, keeping the main agent’s context clean) and enable parallel execution (subagents can run concurrently) and distributed development (teams own independent subagents). However, this comes at a cost: each request incurs extra model calls (main agent routing + subagent execution), and subagents are stateless by design—each invocation repeats the full flow, even for identical repeat requests. The source shows that for a repeated “Buy coffee” query, subagents use 4 calls per turn (8 total), while stateful handoffs or skills use only 2-3 calls per repeat turn (5 total). Handoffs sacrifice parallelization and isolation for lower latency on multi-turn conversations, because agent state persists across handoffs.

The rejected alternative is the single-agent approach—loading all tools and context into one agent. The source explicitly states: “a single agent with the right (sometimes dynamic) tools and prompt can often achieve similar results” and that multi-agent is needed when “a single agent has too many tools and makes poor decisions about which to use.” The skills pattern is a lighter alternative that still keeps one agent but avoids dumping all context upfront. Subagents go further by creating separate agent instances entirely—appropriate when context isolation is paramount or parallel subtasks are required.

A concrete failure mode: state explosion in the main agent when it holds too many subagent tools. The main agent itself can suffer from the same problem it was meant to solve—too many tools degrade its routing decisions. The source warns that multi-agent is valuable when “a single agent has too many tools and makes poor decisions about which to use.” If the main agent’s tool list (each tool being a subagent) becomes large, it may struggle to choose correctly. Another edge case: repeat requests with subagents leads to wasted cost—each invocation reinitializes the subagent (stateless), even though the main agent retains conversation history. The source’s performance comparison shows subagents: 4 calls per turn, every turn. For high-frequency identical queries, handoffs or skills would be far more efficient by reusing loaded state or avoiding handoff overhead entirely. Additionally, mixing patterns is supported (e.g., a subagent internally uses skills), but misdesign—like putting a stateful agent as a stateless subagent—could cause inconsistent behavior.

Delegating tasks to subagents with SubAgentMiddleware for parallel execution.

python

from deepagents.backends import StateBackend
from deepagents.middleware import FilesystemMiddleware
from deepagents.middleware.subagents import SubAgentMiddleware
from langchain.agents import create_agent
from langchain.agents.middleware import TodoListMiddleware
from langchain.tools import tool

@tool
def search(query: str) -> str:
    """Search for a query and return a short summary."""
    return f"Search results for: {query}"

Keeping autonomy honest

In plain terms. Imagine an autonomous agent as a delivery robot. Without safeguards, it could drive in circles forever (unbounded loop), burn battery (cost), forget where it’s going after a long trip (drifting off‑task), make a dangerous move without asking permission, or lose all its progress if it crashes. Controls act like safety features: step limits give the robot a maximum number of turns to complete delivery; recursion limits stop it from repeating the same block eternally; summarization continuously compresses its roadmap so only the most relevant instructions stay visible; human approval gates require a person to press “OK” before any risky action; and durable checkpoints save the robot’s state after every street corner so it can resume exactly where it left off after a reboot. Without these, the agent quickly becomes unreliable and expensive.

System design. Unbounded agent loops and runaway cost are controlled through explicit fault tolerance middleware: Retries, Fallbacks, and Call Limits (from langchain-agents.md "Fault tolerance"). The performance comparison tables in langchain-multi-agent.md quantify the cost as model calls and total tokens per task. For example, the Subagents pattern makes 5 calls for a multi-domain query; repeating the same request ("Buy coffee again") shows that subagents are "stateless by design", meaning each turn re-executes the full flow, accumulating calls without natural bounding. The control mechanism is a hard step limit enforced at the agent or middleware level – a call limit on the number of LLM invocations per session prevents the system from running indefinitely. Without this limit, a single buggy agent could spawn a cascade of tool calls (e.g., a subagent that loops calling itself via a misunderstood tool) until quota is exhausted.

Drifting off-task on long horizons is addressed by context management middleware. SummarizationMiddleware compresses history before the context window overflows, MemoryMiddleware loads persistent instructions at startup, and SkillsMiddleware surfaces domain knowledge on-demand rather than loading everything upfront. The Skills pattern in particular ("Skills: A single agent loads specialized prompts and knowledge on-demand while staying in control") implements focused context: the agent only sees relevant documentation per call (~2000 tokens per skill). The rejected alternative is the Router pattern, which classifies input once and delegates to a specialized agent – but if the routing decision is wrong or the classification drifts over time, the agent may repeatedly fetch irrelevant skills or fall into an incorrect sub-agent. The summarization middleware prevents the agent from losing track by compressing the conversation history (as stated: "summarization compresses history before overflow hits"), keeping the model's attention on the current task.

Irreversible actions are gated by steering middleware: "Human-in-the-loop approval before high-impact actions" (from the "Steering" card in langchain-agents.md). LangGraph further provides human-in-the-loop capabilities that let developers "inspect and modify agent state at any point". The concrete mechanism is a tool or middleware that pauses execution, surfaces the intended action to a human operator, and only proceeds upon explicit approval. The Subagents pattern, where "results flow back through the main agent", inherently adds a coordination step that could serve as a natural halting point for oversight. A failure mode occurs when a subagent is granted direct user interaction (the "Direct user interaction" column in the choosing-pattern table shows Subagents at ⭐, meaning they generally do not talk to the user) – if misconfigured, a high‑impact action (e.g., deleting a production database) could be executed without a human gate. The control is to never allow such tools without an approval middleware layer.

Lost work on infrastructure failure is mitigated by Persistence in LangGraph: "Build agents that persist through failures and can run for extended periods, resuming from where they left off." The langgraph-durable-execution.md file shows concrete state snapshots with thread_id, checkpoint_id, and parent config – a checkpointed state that can be restored after a crash. LangGraph also supports streaming and debugging with LangSmith to trace failures. The alternative rejected is relying on an in‑memory agent that loses all state on crash. A failure mode is that persistence introduces latency (writing checkpoints) and storage costs; if the agent produces a large intermediate state (e.g., a long file from FilesystemMiddleware), the checkpoint might become too large to restore quickly. The checkpoint mechanism is designed for long‑running agents (as per LangGraph's "durable execution") and must be paired with the SummarizationMiddleware to keep state size manageable.

RemainingSteps proactively halts loops when steps run low, preventing runaway cost.

python

from langgraph.graph import StateGraph, START, END
from langgraph.managed import RemainingSteps
from typing import Annotated, Literal, TypedDict

class State(TypedDict):
    messages: Annotated[list, lambda x, y: x + y]
    remaining_steps: RemainingSteps

def agent_with_monitoring(state: State) -> dict:
    remaining = state["remaining_steps"]
    if remaining <= 2:
        return {"messages": ["Approaching limit, returning partial result"]}
    return {"messages": [f"Processing... ({remaining} steps remaining)"]}

def route_decision(state: State) -> Literal["agent", END]:
    if state["remaining_steps"] <= 2:
        return END
    return "agent"

builder = StateGraph(State)
builder.add_node("agent", agent_with_monitoring)
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", route_decision)
graph = builder.compile()

Autonomous agents

Explain it like I'm 5

The system-design view

In depth, piece by piece

The autonomy spectrum

Durable execution

Pausing for a human

Long-term memory

Many agents, one system

Keeping autonomy honest

See also