Agentic Sales — Audio Guide

📄 Full transcript · 12 chapters · read at your own pace — every chapter paired with the real agentic-sales source code.

The complete narration, chapter by chapter — each one paired with the real agentic-sales source it describes. New here? Start with the Agents & Workflows architecture → Want to listen along instead? Open the audio guide →

01. Why this system exists

This system runs ten independent sales workflows on shared infrastructure. Each one discovers contacts, enriches data, scores fit, reaches out, and learns from the result. A new workflow ships as a graph plus a prompt, not a fresh service. Three goals drove the design. Uniform observability everywhere. No single flow able to sink the rest. And no cross team coordination to add a flow.

The architecture rests on three planes. The control plane owns graph identity and the routing contract. It stays cheap to load, free of any model or database. The data plane runs one worker pool per capability, covering email, enrichment, classification, and discovery. A jammed graph stays inside its own pool, so the blast radius equals the pool, never the platform. The observability plane builds a distributed run tree. That tree crosses the hop from TypeScript to Python, turning one user action into one debuggable thing.

The trade off decided the shape. A monolith with branching prompts has the smallest footprint, but its failure domains collapse into one, so any change risks every flow. Microservices per flow give clean isolation, yet each flow pays a platform tax. So operations cost grows with every addition. The chosen design is a shared runtime fronted by a registry plus per capability workers. Isolation sits at the pool level, and growth is purely additive. The deciding axis was failure domain cost, not lines of code saved.

Adding a capability means three small edits. You add one registry row, one builder, and one route. No service scaffolding, no per team wiring, no coordination call.

One failure mode is worth naming. The registry can become a merge bottleneck as the team count climbs. The first symptom shows up as conflicting pull requests on a single shared file.

Code references

Adding a sales workflow is one row in a frozen dataclass tuple — no new service, no new wiring. That tuple is the whole fleet.

apps/agentic-sales/backend/infra/registry.py

Graph identity lives in exactly one frozen dataclass. Both runtimes (FastAPI + langgraph dev) read it; the resumable flag decides whether a graph gets the shared checkpointer.

python

@dataclass(frozen=True)
class GraphSpec:
    assistant_id: str  # public id used in /runs/wait, langgraph.json, TS client
    module: str  # dotted import path, e.g. "graphs.email_compose_graph"
    compiled_attr: str = "graph"  # module-level symbol referenced in langgraph.json
    # Callable(checkpointer) -> CompiledGraph. ``None`` means the module only
    # exposes a pre-compiled ``compiled_attr`` instance (built at import time
    # with no checkpointer); the FastAPI runtime uses that instance directly
    # and the graph runs without persistence. Most graphs implement
    # ``build_graph(checkpointer)``; the ones that don't need durability use
    # the precompiled form.
    builder_attr: str | None = "build_graph"
    # True only for graphs that genuinely need to resume from a checkpoint —
    # i.e. ones invoked with a stable ``thread_id`` so a SIGKILL mid-run can
    # pick up where it left off. Every other graph is invoked with a random
    # UUID thread_id (cron + /runs/wait), so its checkpoint rows are written
    # and never read. ``core/app.py:_compile_one`` passes ``checkpointer=None``
    # when this is False, which keeps ``checkpoint_blobs`` / ``checkpoint_writes``
    # from blowing past the Neon storage cap.

apps/agentic-sales/backend/infra/registry.py

The fleet is one tuple of 35 GraphSpecs (first rows shown). To add a capability you add a row — that is the only edit.

python


GRAPHS: tuple[GraphSpec, ...] = (
    # ── Email graphs ────────────────────────────────────────
    GraphSpec("email_compose", "graphs.email_compose_graph"),
    GraphSpec("email_opportunity", "graphs.email_opportunity_graph"),
    GraphSpec("email_reply", "graphs.email_reply_graph"),
    GraphSpec("email_outreach", "graphs.email_outreach_graph"),
    # Durable-thread campaign engine (reactive, CF-cron driven). One thread per
    # (campaign, contact) with a stable ``campaign-<cid>-<contactId>`` thread_id:
    # check_reply → guard → generate_touch (email_outreach) → send_touch → record
    # → schedule_next → interrupt(wake_at). resumable=True — the D1 checkpointer
    # (infra/checkpointer.py) persists state between touches so a CF cron can
    # resume the thread days later via Command(resume). See
    # specs/2026-06-04-durable-campaign-engine/.
    GraphSpec(
        "campaign",
        "graphs.campaign_graph",
        resumable=True,
        builder_attr="build_campaign_graph",
    ),

02. Beyond traditional observability

A 200 status code with a wrong answer is the hardest bug to catch. The dashboard looks healthy, yet the model gave the wrong output. Traditional observability misses this kind of semantic failure.

The design fix is to capture the model decision on every call. Then it links that decision to the transport trace. This needs tracing for latency and graph shape. It needs structured run metadata holding the model output. It also needs cross process propagation, so the trace survives the hop into the Python backend.

The core component is a client that wraps every backend call in a named span. Inside that span it injects W3C trace context into the outgoing headers. The backend forwards those headers to whichever worker runs the graph. The caller trace stays joined end to end. After the call returns, the client reads the run identifier and the peer trace id. It attaches them as span attributes. An operator can then pivot from a trace into the full model decision.

The whole setup bootstraps once per server runtime. It wires the trace exporter, the propagators, and global fetch instrumentation. So no per call site code is needed. When the exporter endpoint is unset, the setup returns immediately. Local development pays zero overhead.

The trade off is clear. Logging alone is cheapest but blind to model output. Tracing alone catches latency but loses the decision. The chosen approach combines both with cross process propagation. The constraint was reaching the model decision without adding request latency.

One failure mode dominates. A load balancer that strips trace headers breaks the cross process link. The backend then starts a fresh trace. The signal is a missing peer trace id on the caller span. The blast radius is every request through that balancer.

Start with structured logging when your team cannot maintain propagation. Add tracing only after cost containment exists.

Code references

The hop from TypeScript to Python carries the same trace. Every /runs/wait response also hands back the ids to pivot trace ⇄ LangSmith run — and emits nothing when telemetry is off.

apps/agentic-sales/backend/infra/tracing_headers.py

A pure, side-effect-free header builder. Returns {} when neither OTel nor LangSmith has anything to contribute — the response is then byte-identical to having no middleware.

python

def build_link_headers() -> dict[str, str]:
    """Assemble the cross-link headers available for the current request.

    Pure / side-effect-free so the header set is unit-testable without an ASGI
    round-trip. Returns ``{}`` when neither OTel nor LangSmith has anything to
    contribute — the response is then byte-identical to having no middleware."""
    headers: dict[str, str] = {}
    trace_id = _otel_trace_id_hex()
    if trace_id:
        headers["x-trace-id"] = trace_id
    run_id = _langsmith_root_run_id()
    if run_id:
        headers["x-langsmith-run-id"] = run_id
        url = _langsmith_run_url(run_id)
        if url:
            headers["x-langsmith-run-url"] = url
    return headers

03. The run tree

One user action can fan out into dozens of graph steps, model calls, and tool invocations. You need to debug that whole cascade as a single unit. The run tree is how this platform makes that possible.

Three designs competed. Flat logs sharing one correlation identifier are simple to query, but they lose all structure between the root call and its children. Passing the original call identifier by hand at every site stays auditable, but any new code path can forget it. The chosen design is a tree of typed runs under one root. It buys richer debugging at the cost of a coherent identifier scheme that must hold everywhere.

The clever part is that the tree builds itself. The web layer auto instruments every outgoing call inside the process, so parent and child links flow through request headers with no work at each site. At the one network boundary into the Python backend, the client injects the active trace context exactly once. The backend reads those headers and continues the same trace, so its run nests under the caller rather than starting orphaned. Engineers never thread parent identifiers manually, because the framework does it and the single boundary cannot be forgotten.

The response carries two identifiers back. One is the run identifier and one is the trace identifier. The client records both on the active span. Now an operator can query by run, by trace, by assistant, or by status, and reconstruct the full debug graph.

One trade off is worth naming. Some work must not block the user. A background memory write fires the graph and never waits for it, because a slow or disabled memory backend must never stall the email pipeline. That run still exists in the tree, but the caller sees nothing synchronously.

The main failure mode is the orphaned span. If a header is dropped on one hop, child runs appear with no parent. The blast radius stays small, isolated to that one invocation, and the signal is clear: root spans under the web service with no parent mean propagation broke somewhere.

Reach for this model when one action spawns many operations across async boundaries. Avoid it when your runtime cannot propagate context across hops, or when the storage cost of deep trees is more than you can bear.

Code references

One user action becomes one tree: the TS client opens a span and injects W3C context on the wire; the backend resolves the spec and compiles the graph as a child of that span.

apps/agentic-sales/src/lib/langgraph/index.ts

A single shared SDK client. Trace context is injected per-request in onRequest, at fetch time inside this call's active context — so the headers reflect THIS request even though the client is shared.

typescript

const langgraphClient = new Client({
  apiUrl: LANGGRAPH_DISPATCHER_URL,
  apiKey: null,
  defaultHeaders: LANGGRAPH_DISPATCHER_SECRET
    ? { Authorization: `Bearer ${LANGGRAPH_DISPATCHER_SECRET}` }
    : {},
  callerOptions: { maxRetries: 0, maxConcurrency: MAX_CONCURRENCY },
  onRequest: (_url, init) => {
    // Forward W3C trace context (traceparent/tracestate) so the dispatcher
    // routes it to whichever sub-worker / container runs the graph, joining this
    // caller's trace; plus LangSmith trace context when a run tree is active so
    // the worker's run nests under it. Built fresh per request.
    const carrier: Record<string, string> = {};
    propagation.inject(context.active(), carrier);
    injectLangSmithHeaders(carrier);
    if (Object.keys(carrier).length === 0) return init;
    const headers = new Headers(init.headers);
    for (const [k, v] of Object.entries(carrier)) headers.set(k, v);
    return { ...init, headers };
  },
});

apps/agentic-sales/backend/app.py

Server side: import + compile one graph, cached. The shared D1 checkpointer is wired only when the spec is resumable; everything else compiles with None.

python

async def _compile_one(assistant_id: str) -> Any:
    """Import + compile one graph (cached). Raises on import/compile failure."""
    cached = _COMPILED.get(assistant_id)
    if cached is not None:
        return cached
    spec = _SPECS[assistant_id]
    mod = importlib.import_module(spec.module)
    checkpointer = await get_checkpointer() if spec.resumable else None
    if spec.builder_attr is not None and hasattr(mod, spec.builder_attr):
        graph = getattr(mod, spec.builder_attr)(checkpointer)
    else:
        # Precompiled module-level instance (no checkpointer wired).
        graph = getattr(mod, spec.compiled_attr)
    _COMPILED[assistant_id] = graph
    return graph

04. Spans and sampling

A trace from months ago must still explain itself. But recording everything bloats storage and slows every query. So the design keeps only three classes of attribute on each span, and drops the rest.

The first class is request identity. It joins one run to other systems. The client reads response headers and writes the LangSmith run link and a peer trace identifier. An operator can then pivot straight into the deeper tool. The second class is call shape, which lets you replay the exact model call. The load bearing field here is the prompt version. Without it, you recorded the call but cannot reproduce it, because the template may have shifted. The third class is cost. Token counts let you sum spend by model or by prompt, with no pricing lookup needed.

The key privacy decision is what stays off the span. The graph never writes the email body or the raw claim text. It serializes only scores, counts, and a grounded flag. The trade off is plain. You lose the literal output, but the trace stops being a compliance risk. One failure mode still bites. If you rewrite a template yet keep the old version number, the span lies, and the call cannot be reproduced.

At high volume you cannot keep every trace. The design composes three sampling strategies rather than choosing one. Head sampling decides at the start. It is cheap but biased toward average runs, so an outage can vanish. Tail sampling buffers the whole run, then always keeps anything that errored, ran slow, or cost too much. That gives full failure coverage. The price is buffering and an eviction policy. Tagged sampling always keeps experiment runs, so evaluation cohorts survive.

The two layers share one trick. Both hash the trace identifier the same way, so an in process tail sampler and an upstream collector never drop the same span twice. Combine head and tail sampling once volume swamps your budget. Stick to the three attribute classes, and every trace stays cheap and still debuggable.

Code references

Keep every error and every slow trace; sample the routine successes deterministically by trace-id hashing — no random(), so the same trace decides the same way everywhere.

apps/agentic-sales/backend/infra/otel_setup.py

The tail decision itself: always keep errors (1) and slow spans (2); for routine traces, keep/drop deterministically by hashing the low 64 bits of the trace-id against the base-rate bound — no random(), so the same trace decides the same way everywhere.

python

    def _should_export(self, span: Any) -> bool:
        try:
            from opentelemetry.trace import StatusCode

            # 1. Always keep error spans.
            if span.status and span.status.status_code == StatusCode.ERROR:
                return True

            # 2. Always keep slow spans.
            if span.end_time and span.start_time:
                if (span.end_time - span.start_time) >= self._threshold_ns:
                    return True

            # 3. Routine trace: deterministic keep/drop via trace-ID hash.
            trace_id: int = span.context.trace_id if span.context else 0
            low64: int = trace_id & 0xFFFF_FFFF_FFFF_FFFF
            return low64 < self._id_upper_bound
        except Exception:  # noqa: BLE001 — never drop a span on instrumentation error
            return True

05. Datasets and evaluators

A graph stays a black box until you pin down its inputs and outputs. Without a typed boundary, a change on either side stays silent. The damage surfaces only when a production call breaks. So the design treats the dataset as a contract. Every input and output validates against a versioned schema, and golden examples replay against it.

The contract sits between graphs as a typed model. Each model carries a version string and a strict policy on extra fields. Input models forbid unknown fields, so a stale producer fails loudly rather than dropping data. Output models ignore unknown fields, so a newer producer can add output ahead of its consumers. The version lives in the payload alone. With no duplicate header, there is no split brain between header and body.

The trade off is rollout discipline. A major version bump means a wire incompatible change, so both sides redeploy in lock step. A minor bump is additive and ships on one side first. This pays off once more than two services call each other. Below that, when you can redeploy every caller in one cycle, a shared state dictionary is enough.

Evaluators share one simple shape. Each takes a run output and a reference, and returns a single score. The function must stay pure, with no side effects and no network calls. The same input always yields the same score. That purity lets one evaluator run in both continuous integration and production scoring.

Three families fit the shape. Deterministic checks cover format and exact match for almost no cost. Reference comparison measures drift from a known good answer. Model as judge handles subjective quality at higher cost and added bias. All three map one output plus one reference to one score.

The sharpest failure mode is a hidden side effect. A contributor sneaks a write or a shared counter into an evaluator. Continuous integration then turns nondeterministic, and the same run scores differently across retries. You catch it by running each evaluator in isolation and watching for unstable scores.

Code references

Typed dataset contracts per graph, and pure code evaluators that return the same score whether they run in CI or watch production.

apps/agentic-sales/backend/eval/langsmith_datasets.py

final_response datasets, one per core graph, versioned and named by convention (agentic-sales:<graph>:final_response) so they group in the LangSmith UI. Fixtures only — strip_pii scrubs anything real.

python


INBOUND_EMAIL_VALID_LABELS = (
    "interested",
    "not_interested",
    "auto_reply",
    "bounced",
    "info_request",
    "unsubscribe",
    "spam",
    "partnership",
    "meeting_scheduled",
)

apps/agentic-sales/backend/eval/text_to_sql_evaluators.py

A pure evaluator: (run, example) -> {key, score, reason}. Deterministic — same input, same score — so one function both gates merges and audits prod.

python

def select_only_evaluator(run: Any, example: Any = None) -> dict[str, Any]:
    """Assert that the SQL is read-only (SELECT-only, no write/DDL keywords).

    Checks:
      1. Non-empty SQL.
      2. Starts with SELECT or WITH (CTE prefix).
      3. No write/DDL keyword anywhere (regex matches at word boundaries).

    Returns a dict compatible with LangSmith EvaluationResult:
      {"key": "select_only", "score": 1.0|0.0, "reason": str}
    """
    sql = _extract_sql(run)

    if not sql:
        return {"key": "select_only", "score": 0.0, "reason": "empty SQL"}

    head = sql.lstrip("(").lower()
    if not (head.startswith("select") or head.startswith("with")):
        return {
            "key": "select_only",
            "score": 0.0,
            "reason": f"SQL does not begin with SELECT/WITH: {sql[:60]!r}",
        }

    m = _WRITE_RE.search(sql)
    if m:
        return {
            "key": "select_only",
            "score": 0.0,
            "reason": f"Forbidden keyword {m.group()!r} found at position {m.start()}",
        }

    return {"key": "select_only", "score": 1.0, "reason": "SELECT-only confirmed"}

06. LLM as a judge

Some outputs cannot be graded by deterministic checks. An email draft is a judgment call. So this platform uses one LLM to grade the work of another. That judge becomes a versioned, monitored part of the system, not a casual afterthought.

The core design decision is to use a judge from a different model family. If the production model is one vendor, the judge comes from another. A judge of the same family tends to bless its own kind of mistakes. The trade off is real cost and a second provider. The payoff is a score you can trust across versions.

A judge drifts when its provider updates the model. So agreement against human labels gets measured on a schedule, with a floor. A score under that floor means the judge is treated as noise until it is recalibrated. The judge prompt is pinned to a version on every call, so comparisons across time stay honest.

The second big decision is pairwise comparison over absolute scoring. You do not ask for a score from one to ten. You run each input through both versions and ask which output is better. The same judge grades both in the same call, so anchor drift cancels out. The result is a win rate, the fraction of cases where one version wins.

A single win rate can mislead, so it is bootstrapped. The dataset is resampled many times, and a confidence interval is read off the results. A new version ships only when the lower bound clears one half. The team also segments by persona, because a judge can favor longer answers or one slice. Every critical segment must clear the bar, not just the average.

The whole thing runs in a harness with a few knobs: how many evaluators run at once, and how many repetitions smooth out a flaky system. Each evaluator is wrapped so one crash never loses the run.

One failure mode dominates the concurrency design. Set the limit too high and you trigger a retry storm. The fix is a hard ceiling plus gentle additive backoff, never exponential backoff. That keeps the evaluation honest and the rate limit respected.

Code references

When a check needs judgment rather than a rule, an LLM scores it — but the judge is opt-in behind its own marker and drawn from a different model family, so it never blesses its own output.

apps/agentic-sales/backend/pyproject.toml

LLM-judge evals are gated markers, not part of the default run. The fast CI gate is pytest -m 'not llm'; the deepeval judge runs only with EVAL=1 + a gateway token.

toml

markers = [
    "hypothesis: property-based tests using the Hypothesis library",
    "live: live e2e probes against a deployed LangGraph URL — gated by RUN_LIVE_E2E=1",
    "leadlive: live calls to academic-paper APIs (papers_clients/) — opt in via -m leadlive",
    "deepeval: DeepEval-based LLM-judge evals — gated by EVAL=1 + CF_AIG_TOKEN + deepeval installed",
    "redteam: DeepTeam adversarial red-teaming of the agent graphs — gated by REDTEAM=1 + CF_AIG_TOKEN + deepteam installed",
    "llm: tests that call a real LLM (enrich/features LangSmith judge evals) — skipped unless EVAL=1; CI fast gate: pytest -m 'not llm'",
]

07. Closing the feedback loop

A feedback loop turns downstream outcomes back into tuning signals. Three designs competed for where feedback should live. You could leave it as application data, but then the loop never truly closes. You could run a separate feedback service, but a small team pays a steep operational cost. We chose the third path. Feedback becomes a first class part of graph state. The orchestrator routes on prior outcomes, and interface actions emit feedback events right inside the running graph.

The trade off is simple. Embedding feedback in state keeps the loop tight and the latency low, since the graph already runs on every interaction. The risk is schema coupling. Many heterogeneous services without a strong integration pattern would break this approach. The one failure mode worth naming is late feedback. A meeting booked weeks after a run must still find its origin, so the system attaches every record by thread identifier.

That feedback then becomes a reward signal for the lead ranker. A booked meeting is the ideal outcome, but it arrives too slowly to guide outreach. Fast proxies like opens and replies arrive quickly, yet they are easy to game. Train only on opens and the model overfits, so meeting rates fall. The answer is a weighted blend of events, with weights tuned online from sparse labels. It converges slower, but it resists the failure of any single signal.

The most dangerous failure here is positive scarcity. When real replies are rare, updates pile onto negative examples and the ranker stops responding to good news. Weight drift is the cousin problem, where one feature dominates and the ranker goes blind to engagement quality.

A final guard sits inline as an online quality gate. It scores each record on the fly, with no reference answer and a budget of a few milliseconds. It raises flags, adjusts the score, and passes only records that clear a floor. Alerts fire on rolling rates, never single records, which holds back alert fatigue. Together these three layers turn raw production traffic into a learning system.

Code references

Production outcomes — bounces, replies, results — are joined back to the classifier decision that picked the entity, so we can ask whether its confidence was calibrated against reality.

apps/agentic-sales/backend/scripts/feedback_common.py

The three-polarity outcome model in one pure function: every raw label maps to POSITIVE / NEGATIVE / NEUTRAL. Unknown/missing → NEUTRAL ('no signal'), so it's excluded from positive-rate denominators — missing data, not evidence the pick was bad.

python

def outcome_polarity(label: str | None) -> str:
    """Map a raw outcome label to POSITIVE / NEGATIVE / NEUTRAL.

    Unknown or missing labels are NEUTRAL — treated as "no signal", not as a
    bad outcome, so they never drag a classifier's measured positive-rate down.
    """
    if not label:
        return NEUTRAL
    key = label.strip().lower()
    if key in _POSITIVE_LABELS:
        return POSITIVE
    if key in _NEGATIVE_LABELS:
        return NEGATIVE
    return NEUTRAL

apps/agentic-sales/backend/scripts/feedback_common.py

The three canonical polarity constants outcome_polarity resolves to; calibration_report later groups (verdict, confidence, polarity) into confidence bands, where a high-confidence + low positive-rate band is the decision reality is contradicting.

python


POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"

08. Drift and the CI gate

Distribution drift costs money before any business metric moves. So scoring needs a per record quality signal that catches input shift the moment it arrives. The design here is a single composite score for each company record. It blends two axes. Completeness measures how many fields are present. Freshness measures how stale the data has become.

The trade off is deliberate. An older approach computed drift per feature on a nightly batch. It needed a baseline window and many samples. It was expensive and slow. The composite runs inline, on a single database row, with no outside dependencies. It costs a few milliseconds. The result maps to standard PSI bands. A low value is stable. A middle value asks for investigation. A high value signals retraining.

One failure mode matters most. The half lives are tuned for a single upstream source. A field that updates quarterly looks stale under a short decay window, even when it stays healthy. A new source with a different cadence will misfire the freshness score and raise false alarms.

The second pillar is the gate that guards merges. A prompt or graph change can reach production in minutes and break a critical metric for everyone. So every pull request runs the full eval suite before merge. If any critical metric regresses past its threshold, the merge is blocked. Override stays possible, but it demands written justification in the audit log. Three designs competed. Manual review is thorough yet easy to skip. Advisory checks get ignored under deadlines. The hard block wins because it is enforced and traceable.

The deepest lesson came from a segmentation incident. An aggregate metric stayed flat while one valuable segment suffered a large slowdown. The average buried it. So segmentation became a first class primitive. Every evaluator reports per segment and in aggregate. A regression on any single slice blocks the merge. That guards against the regression hiding in plain sight.

Two guards keep segments honest. Enforce a minimum sample size, so a tiny slice cannot fire on noise. Version each segment definition, so yesterday's valuable tier still compares like for like today.

Code references

That same calibration check is an enforceable gate. Where the other feedback scripts report, this one exits non-zero when production contradicts a classifier — so a regression blocks the merge.

apps/agentic-sales/backend/scripts/feedback_gate.py

The gate's exit contract: a source it couldn't fully evaluate returns 2, any real flag returns 1 (so the merge is blocked), clean returns 0 — and --warn-only downgrades a flag to a warning. No new analysis; it just turns the calibration reports into a CI signal.

python

    if errors and not flags:
        return 2  # could not fully evaluate; distinct from a real failure
    if flags and not args.warn_only:
        return 1
    return 0


if __name__ == "__main__":
    sys.exit(main())

09. Safe rollout

Shipping a new graph version safely needs two phases. First you prove it works without risking users. Then you shift live traffic in slow steps. Both lean on the same version registry in the backend.

Shadow mode comes first. The new version runs on real production input, but its output is never returned. You capture results and compare them offline against the live version. The comparison runs pairwise. The same input flows through both, and a judge picks a winner. You promote only when the new version wins across every critical segment, with the lower bound of the confidence interval clearing one half.

The big design choice here is fidelity against load. Full shadow mode duplicates live traffic in real time, so it catches drift. But it doubles every external API call. That leads to the failure you must watch. Both versions hit the same upstream service, the request rate doubles, and throttling errors spike on the real path. Sample a smaller slice when that risk is too high.

Once shadow mode passes, the canary takes over. You register the new version beside the old one, each with its own assistant identity. The typed client routes to the chosen version by hashing the request. The new version starts at one percent of traffic and climbs in committed steps, toward five, ten, twenty five, then full.

Every step gates on per segment metrics. If one regresses past its threshold, automatic rollback flips the share back a step. That rule is fixed before the ramp begins, so nobody negotiates it mid incident. This catches scaling faults that only surface past a traffic threshold, like contention in a connection pool at twenty five percent.

The key decision was a registry of versions over a dynamic flag system. A flag store needs a network lookup on every request, adding real latency at high request rates. The registry keeps routing deterministic and cheap, using a simple hash with no external call. The trade is sticky sessions, which a hashed split cannot promise.

Code references

Routing is an allowlist, not a deploy. Point one assistant's traffic at a new sub-worker to canary it; clear that URL to roll back. One env edit on the dispatcher — no Vercel redeploy.

services/agentic-sales-langgraph-dispatcher/src/route_for.py

One per-prefix sub-worker entry. url=None means the route is inactive and traffic falls through.

python

@dataclass(frozen=True)
class WorkerRoute:
    """One per-prefix sub-worker entry. ``url=None`` means inactive (fall through)."""

    prefix: str
    url: str | None
    secret: str | None
    assistants: frozenset[str]

services/agentic-sales-langgraph-dispatcher/src/route_for.py

First route whose URL is set AND whose allowlist contains the assistant wins; otherwise the FastAPI container (CORE). Flipping a URL ramps or rolls back a single assistant in isolation.

python

@dataclass(frozen=True)
class Decision:
    url: str
    token: str | None
    prefix: str  # "CLASSIFY" | "DISCOVERY" | "CORE"


def route_for(
    assistant_id: str,
    *,
    default_url: str,
    default_token: str | None,
    routes: list[WorkerRoute],
) -> Decision:
    """Pick the downstream for a /runs/wait dispatch.

    Mirrors langgraph-client.ts:104–114 — first sub-worker whose URL is set AND
    whose allowlist contains ``assistant_id`` wins; otherwise the default
    (container) route applies.
    """
    for r in routes:
        if r.url and assistant_id in r.assistants:
            return Decision(url=r.url, token=r.secret, prefix=r.prefix)
    return Decision(url=default_url, token=default_token, prefix="CORE")

10. Cost as a metric

Cost is a first class metric here, not an afterthought. Without hard limits, one buggy loop or eager retry can burn hundreds of dollars in minutes. The damage lands before anyone notices. So the system watches and enforces cost at three timescales: per request, per workflow each day, and per acquired outcome.

The first defense is a global kill switch. One environment flag halts every model call when set. It is checked at the two hubs where all calls funnel through, in both backends. Flip one variable and a cost spiral stops in seconds, with no deploy. The trade off is that you trust a single flag to fail safe.

The second layer captures cost at the source. Every model call passes through a wrapper that returns the result plus a small telemetry record. That record holds tokens, dollars, and latency. Each step rolls these up by graph, by node, and by user. So you can slice spend by any attribute, with no gaps in the billing data.

The third layer enforces a daily budget per workflow. Alerts fire and spend shuts off automatically when a workflow crosses its limit. So one runaway flow cannot starve the others. Crucially, that alert routes to the same channel as on call paging, never a quiet dashboard.

The number that truly matters is cost per acquired outcome. That means total spend divided by replies or meetings booked. The same call is cheap or costly depending on what it produces.

One failure mode anchors the design. A refine step that keeps failing validation could retry forever. The defense is a hard ceiling on token spend per request, not just a retry count. It tries twice, then falls back to the draft. The blast radius stays inside one request.

Code references

Every graph's terminal node already records token + cost totals. One annotation turns that into per-graph, per-feature, per-vertical spend on every span — with zero per-graph wiring.

apps/agentic-sales/backend/infra/cost_telemetry.py

The attribute schema: GenAI semantic-convention keys plus agentic_sales.* custom dimensions, stamped from graph_meta() so cost-by-graph / feature / vertical needs no per-graph code.

python

    gen_ai.usage.input_tokens
    gen_ai.usage.output_tokens
    gen_ai.usage.total_tokens
    gen_ai.request.model
    agentic_sales.graph            <- per-graph cost attribution dimension
    agentic_sales.feature          <- per-feature (pillar) cost attribution dimension
    agentic_sales.vertical         <- per-vertical cost attribution dimension (O40)
    agentic_sales.cost_usd
    agentic_sales.llm_latency_ms
    agentic_sales.llm_calls

apps/agentic-sales/backend/infra/cost_telemetry.py

The feature dimension is never null — unmatched graphs fall through to 'uncategorized', so a dashboard group by feature surfaces unclassified spend instead of dropping it.

python

def feature_for_graph(graph: str | None) -> str:
    """Best-effort map a graph name to its product pillar (the ``feature``
    cost-attribution dimension). Returns ``"uncategorized"`` when no prefix
    matches so the dimension is never null on a span."""
    if not graph:
        return FEATURE_UNCATEGORIZED
    name = graph.strip().lower()
    for prefix, feature in _FEATURE_PREFIXES:
        if name.startswith(prefix):
            return feature
    return FEATURE_UNCATEGORIZED

11. Prompts humans and design

A prompt has two owners. The engineer owns the structural template, the format, and the tool protocol. A non engineer owns the variable content, the persona, and the tone. The design choice splits this ownership and pins a version on every run. Non engineers edit through a user interface. Engineers edit through code review. Both edits land in one versioned artifact. The trade off is real. Pure code blocks the writers. A pure database bypasses review. Splitting ownership serves both groups, but it needs strict field rules. Each run records its prompt version, so a later trace can reconstruct exactly what ran. The main failure mode is a content edit sneaking past review. The fix restricts editable fields to variable content only, so structural changes still need engineers.

Human review is a first class part of the graph, not a side queue. Safe actions run on their own. Low confidence or policy gated actions pause for a person. The design choice is a checkpointed pause inside the graph. State persists to the checkpointer, backed by a stateless rest database. Later an API call resumes the run with the human input. Context stays intact, and the loop closes inside the system. The trade off is latency. This pattern suits decisions that must survive restarts, not reviews that need an answer within one second. The key failure mode is an orphaned checkpoint after a graph revision. You version the checkpoint schema and migrate when you read it. A safety gate also guards every send against a central do not contact list.

A platform that holds together is designed, not assembled. Seven through lines run across every layer. The run tree is the unit of debug. The checkpoint is the unit of resumability. The prompt version is the unit of provenance. Each primitive has one job, and changing one never touches the others. The user interface follows the same shape, from tokens to recipes to components to pages. The design choice favors orthogonal primitives over one giant coupled library. A small team cannot maintain sixty tangled components where every feature breaks another. The failure mode is bureaucracy. Through lines exist, but nobody uses them, and the system looks healthy while delivering nothing. The telemetry signal is the count of unused registrations. When it climbs, the abstraction has become friction.

Code references

Prompts are versioned, classifiers must return strict JSON, and untrusted text never reaches the system role — inbound-derived memory is allowed only where a human reviews before send.

apps/agentic-sales/backend/graphs/email_compose_graph.py

PROMPT_VERSION is stamped on every run and persisted alongside the email, so any prompt change is traceable to the output it produced.

python

GRAPH_NAME = "email_compose"

PROMPT_VERSION = "compose-v6-2026-05"  # bumped: stage-aware CTA + calendar link insertion (V83)

apps/agentic-sales/backend/memory/email_memory.py

Trust tiers. Only first-party sources are interpolated, and only into the user block as data. Memory distilled from raw inbound email is allowed only where a human reviews before send.

python


INJECTABLE_SOURCES = frozenset({"sent_email", "human_note", "graph_derived"})
# Sources additionally allowed only where a human reviews before send.
HUMAN_REVIEWED_SOURCES = INJECTABLE_SOURCES | {"inbound_unverified"}

apps/agentic-sales/backend/graphs/country_classify_graph.py

Classifiers are contract-first: a fixed JSON shape, no prose — so the output is parseable and scorable by the deterministic evaluators above.

python

SYSTEM_PROMPT = (
    "You are a strict location-to-country classifier. Given a LinkedIn "
    "company HQ string, return the ISO 3166-1 alpha-2 country code in "
    "uppercase, or null if the string is not a real location or you cannot "
    "determine the country with high confidence.\n\n"
    "Rules:\n"
    "- Return only valid ISO 3166-1 alpha-2 codes (e.g. US, DE, GB, FR, NL).\n"
    "- US states (full name or 2-letter code) imply US.\n"
    "- UK regions (England, Scotland, Wales, Northern Ireland) imply GB.\n"
    "- Non-English country names (Deutschland, España, Italia) map to the "
    "  standard ISO code (DE, ES, IT).\n"
    "- Strip parenthetical suffixes like '(remote-friendly)' before deciding.\n"
    "- Return null for: 'Remote', 'Worldwide', 'Global', industry strings, "
    "  empty/whitespace input, or anything you cannot confidently map.\n\n"
    "Examples:\n"
    "'San Francisco, California' → US\n"
    "'San Francisco, California, United States' → US\n"
    "'Berlin, Berlin, Germany' → DE\n"
    "'Cambridge, MA (remote-friendly)' → US\n"
    "'Karlsruhe, Baden-Württemberg' → DE\n"
    "'Courbevoie, Île-de-France' → FR\n"
    "'London, England' → GB\n"
    "'Cork, Munster' → IE\n"
    "'Remote (global)' → null\n"
    "'Technology, Information and Internet' → null\n\n"
    "Return STRICT JSON, no prose: "
    '{"country": "XX" or null, "confidence": number_between_0_and_1, '
    '"reasons": [<=2 short strings]}'
)

12. Orchestration and edge cache

Two patterns hold this platform together. The first is orchestration. The second is an edge cache. Both reuse the primitives built earlier.

The orchestration layer is a state graph. It has durable checkpointing and conditional routing. The email orchestrator shows the shape. It hydrates context, checks a safety gate, recalls memory, then decides an action. A skip reason short circuits the run early. A compose action drops into a smaller subgraph that drafts and refines. Each branch depends on a runtime value, so routing must be dynamic.

The key component is the checkpointer. It saves state after every node. Even a process kill resumes from the last completed step. This buys resilience without rerunning earlier work.

The big design decision was decomposition. The team considered one giant graph. They rejected it. The constraint was team size. A graph with fifty nodes is hard to test and review. Named phases let several engineers work in parallel. Each phase carries its own contract and evaluator. The trade off is more moving parts for far easier reasoning.

Reach for this pattern when a workflow branches and every step must survive a crash. Skip it for a simple linear pipeline. There a plain function chain is cheaper to write and debug.

The second pattern is caching at the gateway edge. Much inbound model traffic repeats work already done. Paying full token cost for answered prompts wastes money and time. Caching short circuits identical requests.

The platform once ran its own proxy worker with two cache layers. A scoped token leak forced the smarter semantic layer out, leaving only an exact match key value layer. That custom worker is now retired. Model traffic routes through the provider's native AI gateway instead, which handles caching and observability as a managed layer, so the platform carries no cache code of its own.

The trade off the old design taught still holds. An exact match layer misses paraphrases. That is fine when repeat prompts are truly identical. It fails when users rephrase the same intent. Then you need embeddings, which add inference cost and complexity.

One principle outlived the custom worker. A cache must never block real traffic, so every path fails open. A cache fault falls through to the upstream model uncached rather than failing the request. The system stays simple, cheap, and durable.

Code references

One Cloudflare Worker fronts the whole fleet: it authenticates with a constant-time hash, forwards trace headers downstream, routes per assistant — and the TS client enforces per-graph timeout budgets in one auditable place.

services/agentic-sales-langgraph-dispatcher/src/entry.py

Trace context is forwarded downstream and LangSmith ids are echoed back. The auth secret is compared as a SHA-256 via constant-time compare; the caller's token never leaves this worker.

python

VERSION = "1"
SERVICE = "agentic-sales-langgraph-dispatcher"

SECRET_VAR = "LANGGRAPH_DISPATCHER_SECRET_SHA256"


_FORWARD_INBOUND_HEADERS = (
    "traceparent",
    "tracestate",
    "baggage",
    "langsmith-trace",
    "x-trace-id",
)

# Headers echoed verbatim from downstream → caller (LangSmith cross-link).
_ECHO_RESPONSE_HEADERS = (
    "x-langsmith-run-id",
    "x-langsmith-run-url",
    "x-trace-id",
)

apps/agentic-sales/src/lib/langgraph/index.ts

Per-graph timeout budgets, centralized so the SLA is auditable in one place instead of scattered as magic numbers across wrappers.

typescript

const TIMEOUT = {
  background: 20_000, // fire-and-forget memory reflect, inbound classify
  retrieval: 30_000, // rag_retrieve (no LLM)
  default: 60_000, // single-shot LLM (gh_quick_brief, company_problems, …)
  standard: 90_000, // find_decision_maker, gh_repo_pitch
  heavy: 120_000, // enrichment + extraction + discovery LLM fan-out
  discovery: 180_000, // wide network fan-out (contact_discovery)
  thinking: 180_000, // two-pass DeepSeek thinking compose/orchestrate
} as const;

System design themes behind the LangGraph fleet

One FastAPI process boots ~35 LangGraph graphs on Render's free, memory-capped host — and stays fast and resilient. It does so not through cleverness but through a handful of system-design principles that recur at every layer. The boot mechanics below (render.yaml → backend/app.py → backend/infra/ registry.py) are just those principles made concrete.

Single source of truth. Graph identity lives in exactly one place: registry.py's GRAPHS tuple of GraphSpec dataclasses (assistant_id, dotted module, builder_attr, resumable). The FastAPI runtime imports it directly; the local langgraph dev server on :8002 reads a langgraph.json that is generated from the same tuple. One list, two runtimes, no drift — and a duplicate assistant_id fails loudly with an assert at import time rather than silently shadowing a route.

Defer the expensive work (lazy loading). Boot does almost nothing. registry.py is deliberately dependency-free — it imports none of the graphs.*_graph modules — so uvicorn app:app builds _SPECS = {assistant_id: spec} with an empty _COMPILED = {} and stops. The server knows which graphs exist but has compiled zero. Each graph is imported and compiled only on its first POST /runs/wait (via _compile_one), then cached. Expensive LLM/DB deps load lazily, per-graph, never at startup.

Contain the blast radius (fault isolation). Because compilation is per-graph and deferred, a broken or heavy graph module breaks only its own assistant_id — on the first call to that graph — and never the boot or the other 34. Failure is localized by construction, not by error handling.

Graceful degradation. Anything non-essential is allowed to be absent. Optional observability (init_langsmith(), init_otel()) is wrapped in try/except that only logs a warning, so tracing never blocks boot. The bearer-auth and trace-header middleware are no-ops when their env vars are unset, so the same code runs unchanged in pure-local dev. The system always comes up; optional capabilities light up only when their config is present.

Statelessness and idempotency. Each run gets a fresh random thread id ({"configurable": {"thread_id": str(uuid4())}}), so runs share no state and restart cleanly. Durable checkpointing is opt-in: _compile_one wires the shared D1 checkpointer only when spec.resumable is true — every other graph compiles with None, keeping checkpoint tables from blowing past the storage cap. You pay for persistence exactly where you need to resume, nowhere else.

Design for the constraint. The free, memory-capped host is not an afterthought — it shapes every choice above: light boot, lazy compile, isolated failures, opt-in persistence. And liveness is decoupled from readiness: GET /health (also /ok, /info) returns {"status":"ok","graphs":N,"compiled":M} immediately, independent of whether any heavy graph has compiled yet, so the orchestrator's health check never waits on cold work.

Uniform interface (contract-first). Every graph is reached the same way — POST /runs/wait {assistant_id, input} → flat final state — the standard LangGraph contract shared by the TypeScript client, the Render FastAPI backend, and Studio. New graphs join the fleet by adding one GraphSpec row; no new surface, no new wiring.

The global theme: the same principles — single source of truth, lazy work, isolation, graceful degradation, opt-in state, designing for the constraint — are what every guide on this site keeps returning to, from the run tree to observability to orchestration. System design is less a topic than the lens.