← all lessons/ Phase 6 · LangChain & LangGraph/#47
Lesson 5 of 9 in Phase 6 · LangChain & LangGraph

LangGraph Multi-Agent Topologies: Supervisor, Swarm & Hierarchical Teams

Phase 6 · LangChain & LangGraphIntermediate~8 min read
Recommended prerequisite:#46 LangGraph Human-in-the-Loop: Interrupts, Approval Gates & Time Travel
← PreviousLangGraph Human-in-the-Loop: Interrupts, Approval Gates & Time TravelNext →LangGraph Checkpointing on Cloudflare D1: Durable State & Resumable Graphs

One model with twenty tools degrades fast: the prompt bloats, tool selection gets noisy, and a single failure poisons everything. The fix is to split the work across several specialized agents and give them a coordination structure. LangGraph expresses each agent as a node (often a subgraph) and the coordination as edges, so a multi-agent system is just a larger graph over the model from LangGraph. This lesson covers the three canonical topologies — supervisor, swarm, and hierarchical teams — and the handoff mechanics that connect them, deployed on Cloudflare Workers. For the general theory see Multi-Agent Systems and Agent Orchestration; for single-agent design see Agent Architectures.

Mental Model

What problem does it solve?

A monolithic agent must hold every tool, instruction, and constraint in one context window. As capability grows, accuracy falls: the model confuses similar tools, exceeds token budgets, and cannot be evaluated per skill. Decomposition restores it. Give a "researcher" only retrieval tools, a "coder" only the sandbox, a "writer" only formatting. Each has a short, sharp prompt and is independently testable. The remaining problem — who works next, and how does control pass — is what the topology answers.

The org-chart analogy

Three structures map to org charts. A supervisor is a manager who reads each result and assigns the next worker; workers never talk to each other. A swarm is a flat team of peers who hand the task directly to whoever is most relevant — no manager, control flows laterally. A hierarchy nests these: a top supervisor delegates to team supervisors, each running their own sub-team. Pick the flattest structure that still keeps routing decisions tractable.

A supervisor in ~12 lines

python
from langgraph.graph import StateGraph, START, END
from langgraph.types import Command

def supervisor(state) -> Command:
    nxt = route_llm(state["messages"])           # "researcher" | "writer" | "FINISH"
    if nxt == "FINISH":
        return Command(goto=END)
    return Command(goto=nxt)

g = StateGraph(State)
g.add_node("supervisor", supervisor)
g.add_node("researcher", researcher_subgraph)
g.add_node("writer", writer_subgraph)
g.add_edge(START, "supervisor")
g.add_edge("researcher", "supervisor")          # always report back
g.add_edge("writer", "supervisor")
app = g.compile()

The supervisor returns a Command(goto=...) that both updates state and names the next node — routing is a value, not a side effect. Workers loop back to the supervisor, which decides again until FINISH. The diagram shows the topology.

Core Concepts

Handoffs via Command

A handoff is one agent transferring control and a payload to another. In LangGraph the carrier is Command(goto="agent_b", update={...}): it routes and writes state atomically, so the receiving agent sees the context the sender chose to pass. Handoffs can be implemented as tools (transfer_to_writer) the model calls, which lets the LLM itself decide routing — the basis of the swarm topology.

Shared vs private state

Every agent reads and writes a shared state object, but flooding it with one agent's scratch work confuses the others. The discipline: keep a small shared channel (the task, the running answer) and scope verbose intermediate reasoning to a subgraph's private state that does not propagate up. This message-filtering choice is the single biggest driver of multi-agent reliability, and it ties directly into Context Engineering.

How It Works

Swarm: peer handoff without a manager

In a swarm there is no router node. Each agent has handoff tools to its peers and decides, after acting, whether to finish or pass control. Control flows along whatever path the agents choose at runtime; the graph just provides the edges. Swarms minimize latency (no manager round-trip) but make global behavior harder to reason about — favor them when agents are few and roles are crisp.

Hierarchical teams

When even the supervisor's routing prompt gets unwieldy, nest. A top supervisor delegates to team supervisors; each team is itself a supervisor-plus-workers subgraph. This bounds every routing decision to a small option set at each level, scaling to dozens of agents — the structure behind complex research agents and the orchestration patterns in Agent SDKs.

Runtime Internals

Each agent node runs its own Pregel super-steps; a subgraph's internal steps are invisible to the parent except through the state keys it writes back. On Cloudflare, parallel branches (a supervisor fanning to two independent workers) execute as concurrent fetch calls bounded by max_concurrency, and a single checkpointer keyed by the parent thread_id records the whole tree so the entire multi-agent run is resumable and replayable — the same persistence used for human-in-the-loop gates between agents.

Failure isolation and degraded modes

The strongest argument for decomposition is not accuracy — it is blast radius. In a monolith, one bad tool result or a model that loops corrupts the entire run. In a topology, a failing agent fails locally: the supervisor sees an error result and can retry that agent, route around it to a fallback, or return a partial answer with the failed sub-task flagged. This requires designing every agent boundary as a contract — a typed result that explicitly encodes success, partial, or failure — rather than assuming each agent always returns clean output. A research agent that times out should hand back {status: "partial", found: [...]}, letting the supervisor decide whether the writer can proceed with what exists or must escalate. Degraded-but-useful beats all-or-nothing for user-facing systems, and the per-agent result contract is also what makes each agent independently testable, the evaluation property that connects multi-agent design to Agent Evaluation and the safety gates in LangGraph Human-in-the-Loop. The same boundary that isolates failure also isolates cost: a runaway agent is capped by its own step budget instead of consuming the whole run.

Common Pitfalls

Shared-state pollution. Writing every agent's chain-of-thought to the shared channel collapses accuracy; filter aggressively. Supervisor ping-pong. A vague router prompt loops the supervisor between two workers forever — add a step counter and a FINISH bias. Over-decomposition. Five agents for a two-step task adds latency and failure surface with no gain. Lost handoff context. Command(goto=...) without an update hands off with no instructions; always pass the task framing. No global termination. Swarms with no agent owning "done" never end; designate a terminal path.

Comparison

Supervisor versus swarm: the supervisor centralizes control for predictability and easy evaluation at the cost of a manager round-trip per step; the swarm is faster and more flexible but harder to constrain and audit. Hierarchy versus flat: hierarchy scales routing past what one prompt can handle, at the cost of more nodes and deeper latency. Versus a single mega-agent from Agent Architectures, any multi-agent topology trades coordination overhead for sharper per-role prompts, isolated failures, and per-agent evaluation — worth it once one agent can no longer hold the whole job reliably.

Cross-References

← PreviousLangGraph Human-in-the-Loop: Interrupts, Approval Gates & Time TravelNext →LangGraph Checkpointing on Cloudflare D1: Durable State & Resumable Graphs