First Principles

🧱 The agentic-sales platform from the ground up — first the problem (many sales workflows on one shared runtime), then the three planes that solve it: control (graph identity), data (worker pools & routing), and observability (one trace), each with the real code.

This is a first-principles primer on the agentic-sales platform — the "why, from the ground up" companion to the Agents & Workflows reference. It starts from the problem the platform exists to solve and derives the architecture.

The problem. Many independent sales workflows — lead discovery, contact and company enrichment, classification and scoring, outreach email, durable campaigns — need to run without standing up a service per flow. A microservice per workflow would pay the platform tax N times, so they all share one LangGraph runtime. The deciding axis on every trade-off is failure-domain cost, not lines of code saved.

The three planes. The control plane is one registry that owns graph identity — a stable name maps to one graph definition and one input shape — and stays cheap to load. The data plane is per-capability worker pools reached through one routing contract, so a noisy graph's blast radius is its pool, not the platform. The observability plane is one distributed run tree across the TypeScript↔Python hop, so one user action is one debuggable thing.

Each piece below leads with a plain-language ELI5, then the system-design detail; the code-backed pieces also show the real code they come from. Every part is generated by LlamaIndex, grounded in the platform's own specs and control/data-plane source — not paraphrased from memory.

Explain it like I'm 5

Think of a shared commercial kitchen instead of each chef building their own backyard grill. The agentic-sales platform runs all those independent sales workflows—finding leads, classifying them, writing outreach—in one system. The control plane is like the head chef who knows every recipe and routes each order to the right station. The data plane groups workers by job (prep, grill, plating) so if one station catches fire, the rest keep running. The observability plane is the manager who can trace an entire order from ticket to plate. Sharing one kitchen avoids duplicating stoves, fridges, and cleanup—and keeps the whole operation debuggable and resilient.

The system-design view

The platform organizes its shared LangGraph runtime into three non-negotiable planes defined by the constitution in specs/agentic-sales/mission.md. The control plane owns graph identity and the routing contract: every workflow has one stable name mapping to exactly one graph definition and one expected input shape, and every call must pass through that contract — "There is no other way in." The concrete mechanism is a pure routing matrix lifted from apps/agentic-sales/src/lib/langgraph-client.ts (the WORKER_ROUTES / routeFor block at lines 53–114) into apps/ai-engineer-roadmap/roadmap-kg/kg/route_for.py. This file is deliberately pure: no httpx, no environment variables, no Workers globals, so that non-TypeScript callers (e.g., the ai-engineer-roadmap Cloudflare Worker) can reach the graph fleet via the unchanged /runs/wait contract without importing TypeScript. The control plane stays cheap to load by forbidding module-level imports inside graph submodules; the JSON generator builds the registry without compiling fifty graph definitions, and there are no LLM or database dependencies at load time.

The data plane implements per-capability worker pools: EMAIL, CLASSIFY, DISCOVERY (extendable). The DISCOVERY pool owns the consultancy, GitHub, Ashby, and YouTube graphs plus the company/product/contact intelligence surface. A noisy or expensive graph stays inside its own pool; the blast radius is at the pool level, not the platform. The routing contract in route_for.py directs incoming requests to the correct pool by matching public-core prefixes (/runs, /threads, /assistants, /linkedin/) after authentication, while internal prefixes (/_ml/*, /_research/*) return 403 before even the auth check, so an unauthenticated probe never learns the internal surface exists. Adding a new sub-worker is a single wrangler-vars edit on the dispatcher — no Vercel redeploy required.

The observability plane ensures that one user action shows up as a single debuggable thing across the TypeScript ↔ Python hop. The mechanism is a distributed run tree carried by four header families (not invented for this codebase; they are standards LangGraph, OpenTelemetry, and LangSmith already understand). All instrumentation is gated by environment variables (OTEL_EXPORTER_OTLP_ENDPOINT, LANGSMITH_TRACING); when unset, the tracer is a strict no-op with zero per-request latency overhead, zero network, and zero PII egress. The failure-mode catalog in observability.md documents concrete breakages:

Worker run shown as standalone root — getCurrentRunTree(true) returned undefined because no LangSmith run was open in the Next.js scope. The fix is to ensure the route opens a traceable() context.
Two separate trace IDs for one user action — a load balancer or proxy stripped the traceparent header. The fix is to verify headers arrive on the worker by checking that parse_inbound parses non-null.
New HTTP client bypasses propagation — a teammate writes a raw fetch(/runs/wait...) instead of going through runGraphWithMeta. The audit should show zero such occurrences in apps/agentic-sales/src/.

The core first principle is failure-domain cost, not lines of code saved. Each plane owns exactly one kind of risk: the control plane can fail independently of graph modules (no imports, no LLM, no DB), the data plane contains noisy graphs within their pool, and the observability plane is purely optional at runtime. The rejected alternative — standing up a microservice per workflow — would pay the platform tax N times: every new flow would need its own deployment pipeline, authentication, health-check endpoints, telemetry configuration, and routing setup. By sharing a single control plane and a small set of worker pools, the platform pays that tax exactly once, and the cost of adding a new graph is one entry in WORKER_ROUTES (or one wrangler-vars edit) plus membership in an existing or new pool. The trade-off is increased coordination complexity in the routing contract and stricter discipline around graph imports, but that overhead is bounded and inspectable, whereas a per-service approach would multiply both operational surface and blast radius.

The platform, from first principles

Each piece below: the plain-language take, then the system-design detail. The code-backed pieces also show the real control- and data-plane source.

The problem — many flows, one platform

In plain terms. Imagine you have one workshop with a central worktable and a single set of tools for all your projects—paint, glue, saws. Each project (lead discovery, email, scoring) comes to the same table, grabs what it needs, and runs. The alternative is a separate workshop for every project: each gets its own table, tools, and cleanup routine. That multiplies the cost of maintaining every workbench and fixing any tool breakage. The deciding factor isn’t how many lines of instructions you save—it’s that when a tool fails in one workshop, you only fix that one, not all of them. Shared runtime keeps failure costs contained.

System design. The mechanism is a three-plane architecture where graph identity and routing are owned by a single control plane, not per-workflow. Graph identity is a stable name for each workflow, and the routing contract (WORKER_ROUTES in apps/agentic-sales/src/lib/langgraph-client.ts) is the single rule that turns an incoming request into a graph invocation: which graph runs, on which worker pool, with which inputs. The pure routing matrix route_for.py is the single source of truth for per-assistant routing across the LangGraph fleet, lifted from the TypeScript source so non-TS callers can reach the graph fleet by speaking the unchanged /runs/wait contract. The data plane then owns per-capability worker pools (EMAIL, CLASSIFY, DISCOVERY – extendable), each containing multiple graphs. For example, DISCOVERY owns the consultancy, GitHub, Ashby, and YouTube graphs plus the company/product/contact intelligence surface. The observability plane provides a distributed run tree that crosses the TypeScript ↔ Python hop so one user action shows up as one debuggable thing across all three planes. Adding a new sub-worker is one wrangler-vars edit on the dispatcher, not a Vercel redeploy.

The trade-off is failure-domain cost, not lines of code saved. The rejected alternative – a microservice per workflow – would pay the platform tax N times: each service would require its own deployment pipeline, monitoring, auth, observability, and routing. Instead, the shared LangGraph runtime centralizes the routing contract and isolates failure at the pool level. A noisy graph stays inside its own pool; the blast radius is at the pool level, not the platform. The control plane is deliberately cheap to load – no LLM dependencies, no database dependencies – because the JSON generator builds the registry without compiling fifty graph modules. Module-level imports inside graph submodules are FORBIDDEN. This keeps the control plane fast and the blast radius manageable: if one graph crashes, only its pool is affected, not the entire platform.

The rejected alternative – a microservice per workflow – is explicitly called out as paying the platform tax N times. The context states the three planes are the constitution every feature must respect, and the data plane’s per-capability worker pools are the mechanism to avoid microservices. Each pool is a LangGraph worker (e.g., services/agentic-sales-*-worker/), not a dedicated service per workflow. The routing matrix route_for.py is pure (no httpx, no env, no Workers globals) so parity tests run unchanged under CPython, mirroring the same harness as the backend dispatcher test. This testability is a direct consequence of not scattering routing logic across N microservices. The alternative would force every new workflow to implement its own dispatcher, auth, and health endpoints, whereas the shared approach lets you add a new sub-worker via a single wrangler-vars edit.

A concrete failure mode: if the wrangler-vars edit on the dispatcher is not updated when a new sub-worker is added, the routing matrix in route_for.py will not know about it, and requests intended for that worker pool will 404 or fall through to the wrong pool. This would break the routing contract and could silently route a DISCOVERY request to the EMAIL pool, corrupting data or triggering incorrect graph execution. Another edge case: if someone bypasses the control plane and invokes a graph directly (e.g., by calling the worker's /runs/wait endpoint without going through the dispatcher), the observability plane loses the single debuggable tree across the TypeScript ↔ Python hop, and the blast radius expands because the request never hit the routing contract. The context emphasises that every call must go through that contract – "There is no other way in." The observability plane also has its own non-negotiable: one user action = one debuggable thing across the TS ↔ Python hop. A direct invocation would create orphaned spans, violating that guarantee.

Graph identity — the control plane

In plain terms. Think of the registry like a restaurant’s menu book. Every dish (workflow) has one unique name, and next to it is the exact recipe card (graph definition) and the ingredients it expects. That menu is the control plane—the single place the kitchen checks to route an order. Without it, a waiter could shout “steak!” and cooks would argue over which pan to use. The menu is deliberately kept light—no pre-chopping or pre-heating—so the host can flip it open instantly without firing up every burner. A duplicate dish name would break the whole system: two recipes for the same order, and nobody knows which steak to cook.

System design. Graph identity is the invariant that every workflow bound to the platform resolves to exactly one stable name—the assistant_id—which maps one-to-one with a graph definition (its builder module, compiled attribute, and input shape). That mapping lives in the control plane, which owns both the registry and the routing contract. The registry is built, per mission.md, by a JSON generator that constructs it without compiling any graph submodules: “Module-level imports inside graph submodules are FORBIDDEN.” This means the generator can read a static description of all graphs (the allowlists in route_for.py, the DEFAULT_CLASSIFY_WORKER_ASSISTANTS and DEFAULT_DISCOVERY_WORKER_ASSISTANTS constants) and produce a pure, lightweight lookup table. The routing contract is then the rule (“which graph runs, on which worker pool, with which inputs”) that every inbound request must satisfy. route_for.py encodes that contract as a set of WorkerRoute dataclass instances, each with a prefix, a url, a secret, and a frozenset[str] of allowed assistant_ids. All calls pass through this contract; there is no other way in.

The trade-off that justifies this design is startup cost versus correctness. If the registry required importing every graph module at load time, the control plane would become heavy: each graph could pull in LLM clients, database connectors, or 50+ dependency trees. That would defeat the goal—stated in mission.md—that the control plane “is cheap to load — no LLM dependencies, no database dependencies.” By forbidding top-level imports inside graph submodules and relying on a precomputed JSON registry, the control plane stays fast enough to serve health checks and route decisions without spinning up any graph logic. The same cheapness allows the JSON generator to run in CI or during deployment to produce the registry artifact, verifying that every assistant_id is unique and that no two graphs claim the same name, without ever invoking the graphs themselves.

The rejected alternative is implicit in the history recorded in route_for.py. That file’s header explains that the routing matrix “used to live in apps/agentic-sales/src/lib/langgraph-client.ts (the WORKER_ROUTES / routeFor block at lines 53–114) and was lifted here.” The TS-only version tied the routing contract to the Vercel-deployed Next.js frontend; adding a new sub-worker required a full Vercel redeploy. By lifting the matrix into a pure Python module with zero HTTP or environment dependencies, route_for.py can be read by the Cloudflare Worker, by bricks (CLI tools), and by future Rust binaries, all speaking the unchanged LangGraph /runs/wait contract. The trade-off is a small duplication of the assistant allowlists (they must stay in sync with the dispatcher’s environment variables) but buys a single source of truth for routing logic that any caller can parse without TypeScript tooling.

A concrete failure mode that would break the design is a duplicate assistant_id appearing in the registry. Because the control plane guarantees “one stable name that maps to one graph definition and one expected input shape,” a collision would make the routing contract ambiguous: two graphs would claim the same key, so a request targeting that assistant_id could hit either worker pool—or worse, silently map to the wrong graph on the wrong pool. The observability plane, described in observability.md, depends on langgraph.assistant_id as the “join key for rolling cost / latency up by capability.” A duplicate would fragment those metrics and break the distributed run tree’s ability to correlate traces across the TS↔Python hop. During operation, a malformed or unrecognised assistant_id surfaces as a LangGraphError with kind "client" (status 400–499) because the input shape or assistant ID fails the routing contract’s check. Such errors are caught by callers, which then decide retry or fallback—but a duplicate would be silent data corruption, not a clear error, making it a particularly dangerous edge case that the registry’s cheap, import-free construction must explicitly guard against.

The module docstring of registry.py defines it as the single source of truth for graph identity, mapping each assistant_id to a graph spec, and mandates a dependency-free import structure to keep the control plane cheap to load.

python

"""Single source of truth for the agentic-sales LangGraph registry.

Both runtimes (the local ``langgraph dev`` server on :8002 and the FastAPI/
Cloudflare Containers app at ``core/app.py``) read graph identity from this
file. ``core/langgraph.json`` is generated from ``GRAPHS`` via
``backend/scripts/gen_langgraph_json.py``; ``core/app.py`` imports ``GRAPHS``
directly and compiles each spec at lifespan startup.

To add a graph: drop a row in ``GRAPHS`` and run ``make gen-langgraph-json``.
That is the only edit needed.

Keep this module dependency-free — it must import nothing from
``agentic_sales.*_graph`` at module top level so the JSON generator can build the
registry without compiling 50+ graphs (and without dragging in optional
LLM/DB deps that some graph modules import at import time).
"""

The routing contract — the only way in

In plain terms. Think of a reception desk in a large office building. Each department keeps a list of authorized visitors. When someone arrives, the receptionist checks the person's name against each department's list, in a fixed order. The first department that has that name on its list and is open gets the visitor. If no department claims them, they go to the main office by default. Every visitor must go through this single desk—there’s no secret side entrance. Without this rule, a visitor could end up in the wrong department, lost, or turned away, causing confusion and duplicate work.

System design. The routing contract is implemented by the pure Python module route_for.py, which was extracted from the TypeScript block WORKER_ROUTES / routeFor at lines 53–114 of apps/agentic-sales/src/lib/langgraph-client.ts. Its job is to map an incoming request (identified by an assistant_id) to a specific worker pool by iterating over a static matrix of worker routes. Each route carries a URL and an allowlist of assistant IDs; the first route whose URL is set and whose allowlist contains the requested assistant_id wins. If no route matches, the request falls through to the default CORE container (the baseline pool). This lookup is performed before any LLM or database dependency is touched—the control plane is deliberately “cheap to load” because the JSON generator pre-builds the registry without compiling the fifty graph submodules. The same matrix also governs path prefixes (/runs, /threads, /assistants, /linkedin/) for authentication and forwarding, making it a single decision point for both graph identity and network routing.

Why every call goes through this one contract—there is no other way in. The constitution (from mission.md) mandates that “every call goes through that contract. There is no other way in.” Centralising the mapping in route_for.py decouples route changes from the service redeploy cycle: adding a new sub-worker now requires only a wrangler-vars edit on the dispatcher, not a Vercel redeploy. This keeps the control plane “cheap to load—no LLM dependencies, no database dependencies” and limits the blast radius of a misconfigured route to a single pool rather than the entire platform. The trade-off is that the allowlist logic is static and manually maintained; there is no in-code switch for differential sampling or per-assistant override, as noted in the observability deep-dive (observability.md, section 10).

The rejected alternative is the previous approach where the routing matrix lived inline in the TypeScript file (apps/agentic-sales/src/lib/langgraph-client.ts). That forced any new worker addition to require a full Vercel deploy, making the dispatch layer brittle and tying the routing contract to a single language runtime. By lifting the matrix to route_for.py—a pure module with no HTTP, env, or Workers globals—the parity tests under tests/test_route_for.py can run unchanged in CPython, mirroring the same harness that used to live in apps/agentic-sales/backend/tests/test_dispatcher_logic.py. This also allows “non-TS callers” (like the ai-engineer-roadmap Cloudflare Worker, bricks, or future Rust binaries) to speak the unchanged /runs/wait contract instead of importing TypeScript.

Concrete failure mode: if an assistant_id appears in no allowlist, the fallback to the default CORE container may route the request to a pool that does not define the requested graph, causing a LangGraphError of kind client (400–499) because the input shape is mismatched. If the id is in the wrong allowlist (e.g., gh_lead_research accidentally placed in the EMAIL pool’s list), the request hits the wrong worker pool, the graph definition does not exist there, and the run either fails with a backend error (500–599) or silently executes a different graph, producing semantically wrong results. In either case, the observability plane will see a run tree that appears to originate from the correct assistant (via langgraph.assistant_id attribute) but the actual graph invoked is wrong—a failure that cannot be caught by a 200 status code alone. The only fix is to correct the allowlist entry in route_for.py and update the dispatcher’s wrangler vars.

The path-level routing decision that returns the default CORE container for public endpoints, forming the fallback part of the contract.

python

def path_decision(method: str, path: str, *, has_auth: bool, auth_matches: bool) -> tuple[int, str]:
    if path in HEALTH_PATHS:
        return (200, "OK")
    if any(path.startswith(p) for p in INTERNAL_PREFIXES):
        return (403, "INTERNAL_ONLY")
    if not has_auth:
        return (401, "MISSING_AUTH")
    if not auth_matches:
        return (401, "BAD_AUTH")
    if any(path.startswith(p) for p in PUBLIC_CORE_PREFIXES):
        return (200, "CORE")
    return (404, "NOT_FOUND")

Worker pools and blast radius — the data plane

In plain terms. Think of the data plane like separate workstations in a busy kitchen. Instead of one huge team doing everything at one counter, you have a station for emails, one for classifying requests, and one for discovery tasks (like researching companies). Each station has its own list of specific jobs it can handle. If the discovery station catches on fire—say a graph goes haywire—the email and classify stations keep working. The damage stays at that workstation, not the whole kitchen. Without this separation, one failing job could crash everything. The trade-off is more moving parts to manage, but you gain safety: one messy station never takes down the whole restaurant.

System design. The data plane consists of three per-capability worker pools—EMAIL, CLASSIFY, and DISCOVERY—as defined in the mission document. Each pool owns a distinct set of workflows: EMAIL handles outbound email composition and related graphs; CLASSIFY processes classification tasks; DISCOVERY owns the consultancy, GitHub, Ashby, and YouTube graphs plus the company/product/contact intelligence surface. The pools are physically separated in the codebase under services/agentic-sales-*-worker/. Requests are routed to the correct pool by the control plane, which uses the WORKER_ROUTES table (defined in apps/agentic-sales/src/lib/langgraph-client.ts and mirrored in agentic_sales.registry in Python) to map an assistant_id to a specific worker pool. A graph’s assistant_id is embedded in every span as the langgraph.assistant_id attribute, making it the join key for cost and latency analysis per capability. The control plane’s routing contract is the only entry point—each incoming invocation is directed to the appropriate pool based on this static mapping, ensuring that a graph in DISCOVERY never accidentally runs in the EMAIL worker.

The trade-off of partitioning into per-capability pools is operational complexity versus blast-radius control. A single shared pool would be simpler to deploy and scale (one worker service to manage, one set of resource limits), but any noisy or failing graph—say, a GitHub lead-research graph that enters an infinite loop or exhausts memory—would degrade performance for all workflows. The product surface would experience timeouts or errors across email, classification, and intelligence simultaneously. With per-capability pools, the blast radius is contained at the pool level: DISCOVERY can crash or saturate its own resources while EMAIL and CLASSIFY continue serving. The mission document explicitly states “A noisy graph stays inside its own pool. The blast radius is at the pool level, not the platform.” This principle is non-negotiable; the data plane is built to enforce this isolation at the cost of running multiple worker fleets and maintaining separate deployment pipelines.

The rejected alternative is a single, monolithic worker pool that runs every graph regardless of capability. Although this would reduce infrastructure overhead and simplify the routing contract (one destination for all runs), it fails the blast-radius requirement. The observability plane would still trace across graphs, but a single failure could bring down the entire platform. The architecture implicitly chooses isolation at the granularity of capability domains (EMAIL, CLASSIFY, DISCOVERY) rather than per-graph or fully shared. This granularity is chosen because capability boundaries align with independent business functions—email composition does not share systemic dependencies with classification—making them natural failure containment units. The mission further notes that the data plane is “extendable,” meaning new domain pools can be added without refactoring the control plane or exposing existing pools to new risks.

Concrete failure mode: Suppose the DISCOVERY pool contains the gh_lead_research graph, and a bug in the LangGraph execution causes repeated unhandled exceptions (status 500), flooding the worker’s HTTP handler. In a shared-pool architecture, the health check or connection pool could be exhausted, causing EMAIL’s email_compose graph to also return 504 timeouts. With per-capability pools, the DISCOVERY worker’s trace shows a spike in github_lead_research failures (visible as orphaned spans if route propagation breaks, or as a cluster of backend-kind LangGraphError objects), but the EMAIL worker continues to serve its own workflows without degradation. The observability plane’s failure catalog lists exactly this kind of containment: “Worker run shows as standalone root (not nested under TS span)” indicates that a failing worker is not polluting the spans of other workers. The blast radius is kept within the pool, and the control plane’s routing contract continues directing email_compose to the healthy EMAIL worker. Remediation involves scaling or fixing only the DISCOVERY pool without taking down the rest of the platform.

Route_for exports constants for per-capability worker pools (CLASSIFY, DISCOVERY) and the WorkerRoute and Decision types, defining the data plane's blast radius boundary.

python

__all__ = [
    "DEFAULT_CLASSIFY_WORKER_ASSISTANTS",
    "DEFAULT_DISCOVERY_WORKER_ASSISTANTS",
    "PUBLIC_CORE_PREFIXES",
    "INTERNAL_PREFIXES",
    "HEALTH_PATHS",
    "WorkerRoute",
    "Decision",
    "build_routes",
    "route_for",
    "path_decision",
]

One action, one trace — the observability plane

In plain terms. Imagine a single customer order that splits into two separate packages, each with its own tracking number. You'd never see the full journey. That's the problem the observability plane solves. When a user action starts in one language and continues in another—like TypeScript handing off to Python—it must stay connected as one traceable path. Without that connection, you get orphaned spans: a bunch of isolated events that belong to the same request but can't be linked. It's like losing the thread of a relay race; you can't tell if the baton was dropped or just delayed.

System design. The observability plane exists because a single user action in the agentic-sales platform traverses a TypeScript Next.js handler, crosses a wire to a Python Pyodide worker, and fans out across per-capability worker pools. Without a distributed run tree that stitches those hops into one debuggable thing, a request that fails semantically—e.g. a model answers incorrectly—would appear to succeed at the transport layer. The mechanism is concrete: the TypeScript side at apps/agentic-sales/src/lib/langgraph-client.ts opens a LangSmith run, then the wire carries up to four header families (LangGraph, OTel, LangSmith, and traceparent) across the boundary. On the Python side services/_shared/tracing/_tracing.py parses the inbound headers via parse_inbound, re-attaches the same trace context, and emits spans that nest under the original Next.js root. The two non-negotiables ensure this works: (1) one user action maps to one debuggable thing across the TS↔Python hop, and (2) zero per-request latency overhead when env vars like OTEL_EXPORTER_OTLP_ENDPOINT or LANGSMITH_TRACING are unset—strict no-op, no network, no PII egress.

The trade-off is deliberate: separating observability into its own plane—distinct from the control plane (graph identity + routing contract via WORKER_ROUTES) and the data plane (per-capability worker pools like EMAIL, CLASSIFY, DISCOVERY)—forces the correlation logic to live in shared infrastructure rather than being duplicated per graph. The control plane is cheap to load (no LLM dependencies) but knows nothing about trace shape; the data plane isolates noisy graphs to their own pool but has no cross-pool visibility. Making the observability plane a first-class citizen means the same langsmith_run_id can be persisted on a D1 row and later used for async feedback (e.g. submit_feedback(env, run_id, "reply_received", score=1.0) in chapter 12 of the audio guide). The cost is that every new graph must mirror the pattern; the gain is that a single trace tree can cross all three planes without manual joining.

The rejected alternative is the traditional HTTP health-check model. As the source states, "Traditional HTTP health checks tell you the API returned 200. They say nothing about whether the model answered the user's question correctly. That gap—between a transport-layer success and a semantic-layer failure—is the problem the observability plane exists to close." Instead of treating success as a binary HTTP status, the system uses LangSmith's run tree to carry the full LLM call chain, prompt versions, and cost data. This is not a simple request-id header; it requires active propagation of the OTel trace context and the LangSmith-specific run_id across the language boundary. The wire protocol does not invent new headers but reuses standards LangGraph + OTel + LangSmith already understand.

Concrete failure modes are cataloged in section 11.1 of the observability reference. The most common is a propagation breakage where the worker run shows as a standalone root (not nested under the TS span). The cause: getCurrentRunTree(true) returned undefined—no LangSmith run was open in the Next.js scope. This produces an orphaned span that cannot be traced back to the originating user action. Another failure occurs when a load balancer or proxy strips the traceparent header, causing two separate trace IDs for one user action. A third case is when a teammate adds a fetch-shaped wrapper that bypasses propagation.inject—the symptom is a flat orphan span tree. The audit rule is strict: apps/agentic-sales/src/ must contain zero raw fetch(/runs/wait...) calls; every cross-process hop must go through runGraphWithMeta. Without these safeguards, the "one debuggable thing" principle collapses into disconnected fragments that a senior engineer cannot follow end to end.

Durable workflows — campaigns that resume

In plain terms. Think of it like saving progress in a long video game vs. playing a one‑shot arcade game. For durable workflows like a multi‑step campaign or follow‑up sequence, you need "save points" (checkpoints) so that if the server restarts mid‑step, it picks up where it left off instead of starting over. That's why resumable=True exists on those graphs. A stateless one‑shot graph—like a quick lookup—doesn't need that overhead; it just runs and finishes. Without checkpoints, a multi‑step campaign that gets interrupted would lose all its work, forcing a full restart and potentially duplicating actions or missing delays. Durability is opt‑in per graph to keep cheap graphs cheap and be explicit about which workflows need the safety net.

System design. The provided context does not contain any reference to GraphSpec, resumable=True, builder_attr, checkpointer, campaign engine, followup graphs, per-graph durability opt-in, or the consequences of running a multi-step campaign without a checkpointer. The files cover only:

The architectural mission (three planes: control, data, observability).
The content pipeline (audio guide generation).
Observability deep-dive (wire protocol, failure-mode catalog, sampling, feedback).

None of these sections describe how long-running workflows survive restarts or the mechanism you asked about. Without that information I cannot write a grounded explanation; any attempt would require inventing APIs, control flow, or trade‑offs not present in the source.

Topic-outreach graph opts out of checkpointing (resumable=False), showing that durability is per‑graph and why multi‑step campaigns would need it.

python

GraphSpec("topic_outreach", "graphs.topic_outreach_graph"),
    # ... resumable=False (long-running streaming job, invoked via CLI or /runs/wait, not checkpointed).

First Principles

Explain it like I'm 5

The system-design view

The platform, from first principles

The problem — many flows, one platform

Graph identity — the control plane

The routing contract — the only way in

Worker pools and blast radius — the data plane

One action, one trace — the observability plane

Durable workflows — campaigns that resume

See also