STAR Stories

⭐ The platform retold for the interview room — seven Situation/Task/Action/Result stories (trade-offs, constraints, debugging, a near-miss, evals, cost), each grounded in the real registry, dispatcher, and observability source so every claim survives a follow-up.

This page turns the agentic-sales platform into STAR-method behavioral stories — Situation, Task, Action, Result — ready to deliver in an interview. Each story comes in three layers, mirroring the First Principles primer: the STAR beats you actually say, then the system-design view that survives senior-level follow-ups, then the real code the claim is grounded in (registry.py, route_for.py, langgraph/index.ts, _tracing.py, the eval suite, llm-guard.ts).

How to use it. Aim for ~90 seconds per story: one breath of Situation, one of Task, most of your time on Action, and end on a measurable Result. When the interviewer digs in, you drop one layer down — the system-design view first, the named code artifact if they keep pushing.

1 · The architectural trade-off

Answers: "Tell me about a difficult architectural decision." · "Describe a trade-off you made and how you decided." · "When did you push back on the obvious design?"

Situation. I was building a sales-automation platform that kept growing new AI workflows — lead discovery, contact and company enrichment, classification and scoring, outreach email, durable multi-touch campaigns. Each one was an independent LangGraph graph, and the fleet grew past fifty graphs.

Task. Decide the shape of the platform: one monolith with branching prompts, a microservice per workflow, or something in between — with a small team and no appetite for paying a platform tax N times over.

Action. I made failure-domain cost, not lines of code saved, the deciding axis. A monolith was smallest but collapsed every flow into one failure domain; per-graph microservices isolated cleanly but meant standing up ingress, auth, observability, and deployment N times. I chose the middle: one shared LangGraph runtime fronted by a registry, partitioned into per-capability worker pools (EMAIL, CLASSIFY, DISCOVERY — each its own worker service). Isolation lands at the pool level: a discovery graph stuck in a retry loop can degrade its own pool, but email and classification keep running.

Result. Adding a workflow became purely additive — one registry row, one builder, one route, no new service scaffolding and no cross-team coordination. The platform runs 50+ graphs on shared infrastructure, and no single graph has ever taken down another capability.

Sound bite: "I priced the options in failure-domain cost instead of code size — and that one change of axis made the answer obvious."

The system-design view. The platform decomposes into three planes with clear ownership boundaries. The control plane owns graph identity and the routing contract: every workflow has one stable assistant_id that maps to one graph definition and one expected input shape, and every call enters through one routing rule — there is no other way in. The data plane is the per-capability worker pools (EMAIL, CLASSIFY, DISCOVERY), each a separate services/agentic-sales-*-worker/, so a noisy graph's blast radius is its pool, never the platform. The observability plane stitches one user action into one distributed run tree across the TypeScript↔Python hop. The rejected alternative is explicit in the platform's forbidden patterns: "Per-graph microservices — pay the platform tax N times. Use the shared LangGraph runtime + worker pools." The accepted cost: graphs within one pool share memory and concurrency limits — but the pool is itself the blast-radius boundary, so that is a contained failure domain.

The real code. Graph identity is a frozen dataclass and one tuple in backend/infra/registry.py — to add a graph you "drop a row in GRAPHS and run make gen-langgraph-json. That is the only edit needed":

python

@dataclass(frozen=True)
class GraphSpec:
    assistant_id: str   # public id used in /runs/wait, langgraph.json, TS client
    module: str         # dotted import path, e.g. "graphs.email_compose_graph"
    compiled_attr: str = "graph"
    builder_attr: str | None = "build_graph"
    resumable: bool = False

GRAPHS: tuple[GraphSpec, ...] = (
    GraphSpec("email_compose", "graphs.email_compose_graph"),
    GraphSpec("email_opportunity", "graphs.email_opportunity_graph"),
    GraphSpec("email_reply", "graphs.email_reply_graph"),
    GraphSpec("email_outreach", "graphs.email_outreach_graph"),
    GraphSpec("campaign", "graphs.campaign_graph",
              resumable=True, builder_attr="build_campaign_graph"),
    GraphSpec("email_orchestrator", "graphs.email_orchestrator_graph"),
    GraphSpec("email_followup", "graphs.email_followup_graph", resumable=True),
    # ... 50+ rows
)

2 · Designing for a hard constraint

Answers: "Tell me about working within a tough constraint." · "How do you design for limited resources?" · "Describe optimizing a system's startup or footprint."

Situation. The Python backend — one FastAPI process serving the whole LangGraph fleet — had to run on a free, memory-capped host. Importing and compiling fifty-plus graph modules at boot would drag in every optional LLM and database dependency at once and blow the memory budget before serving a single request.

Task. Keep one process serving the entire fleet, stay inside the cap, and keep the orchestrator's health checks fast.

Action. I made boot do almost nothing. The registry is a dependency-free tuple of frozen GraphSpec dataclasses — it imports none of the graph modules, so startup builds a name→spec map with an empty compile cache and stops. Each graph is imported and compiled lazily on its first POST /runs/wait, then cached. Durable checkpointing is opt-in per graph via a resumable flag, so checkpoint tables only grow for the workflows that genuinely resume. And I decoupled liveness from readiness: GET /health returns {"status":"ok","graphs":N,"compiled":M} immediately, independent of whether any heavy graph has compiled.

Result. The process boots having compiled exactly zero graphs, health checks never wait on cold work, and a broken or heavy graph module breaks only its own assistant id on its own first call — never the boot, never the other graphs. The free host stopped being a limitation and became the design's forcing function.

Sound bite: "The constraint didn't shrink the design — it produced the design: lazy everything, isolation by construction."

The system-design view. Three principles interlock here. Defer the expensive work: the registry module is deliberately import-clean so the JSON generator can enumerate the fleet without compiling it; expensive LLM/DB deps load lazily, per graph, never at startup. Contain the blast radius: because compilation is per-graph and deferred, failure is localized by construction, not by error handling. Opt-in state: every run gets a fresh random thread_id, so runs share no state; the checkpointer is wired only where resumable=True (the durable campaign engine and the follow-up composer, which resume stable thread ids like campaign-<cid>-<contactId> after a SIGKILL or a days-later cron wake) — every other graph compiles with checkpointer=None, which keeps checkpoint_blobs/checkpoint_writes from blowing past the storage cap. The follow-up question this invites is "how do you know a bad row can't break routing?" — answer: a duplicate assistant_id fails loudly with an assert at import time rather than silently shadowing a route.

The real code. The constraint is written into registry.py's module docstring and enforced at import time:

python

"""Keep this module dependency-free — it must import nothing from
``agentic_sales.*_graph`` at module top level so the JSON generator can build
the registry without compiling 50+ graphs (and without dragging in optional
LLM/DB deps that some graph modules import at import time)."""


# runtimes would otherwise silently route to the last-registered builder.
assert len({g.assistant_id for g in GRAPHS}) == len(GRAPHS), (
    "duplicate assistant_id in GRAPHS"
)

3 · Debugging across a language boundary

Answers: "Walk me through a hard production debugging story." · "How do you make a distributed system debuggable?" · "Tell me about improving observability."

Situation. One user action crossed three layers — a TypeScript Next.js app, a routing dispatcher, and Python LangGraph workers. When a run failed, the evidence was scattered: the TS side saw a generic timeout, the Python side had spans nobody could connect to the originating request.

Task. Make one user action show up as one debuggable thing across the TypeScript↔Python hop, without adding latency or a new vendor dependency.

Action. I built the observability plane on standards both sides already understood: W3C Trace Context (traceparent/tracestate) plus LangSmith run-tree headers (langsmith-trace/baggage), injected by the TS client per request and parsed on the Python side to continue the caller's run tree — or self-root if the headers are absent. I classified failures in the client into distinct kinds — auth, timeout, cancelled, backend — so a rotated worker secret (401/403) reads differently from a model call exceeding its abort window. Exceptions are recorded onto the span before re-throwing, so even abandoned runs leave a recoverable tree. All of it is env-var-gated: unset means strict no-op, zero network calls, zero PII egress.

Result. A failure that used to mean grepping two codebases became: open the trace, walk the tree, read the failing span's assistant id. Credential issues and semantic timeouts stopped being confused for each other, and dev environments pay zero overhead because instrumentation off means off.

Sound bite: "I didn't add a tool — I made the existing hop speak one trace language, so one click answers 'what happened.'"

The system-design view. The wire protocol carries two header families that LangGraph, OTel, and LangSmith already understand, so neither side learns a proprietary format. On the TS side, buildTraceCarrier() builds the carrier fresh from the active context per call, and the same carrier is shared by the SDK onRequest hook and the raw-fetch streaming path so both join the caller's trace identically. On the worker side, parse_inbound(headers) extracts the trace id and parent run id; trace_run builds the run tree in memory and flushes children plus root in a single POST /runs/batch (or, on the durable path, POSTs a pending root and PATCHes it later, guarding against eviction mid-run). The two stores cross-link via x-langsmith-run-id response headers. Crucially, when tracing is disabled the context manager yields a no-op stub — the graph code never branches on whether tracing is on.

The real code. The TS side of the hop, from apps/agentic-sales/src/lib/langgraph/index.ts — and the error taxonomy callers branch on:

typescript

function buildTraceCarrier(): Record<string, string> {
  const carrier: Record<string, string> = {};
  propagation.inject(context.active(), carrier);  // traceparent / tracestate
  injectLangSmithHeaders(carrier);                // langsmith-trace / baggage
  return carrier;
}

export type LangGraphErrorKind =
  | "auth"       // 401/403 — secret missing or rotated
  | "timeout"    // abort window exceeded
  | "cancelled"  // caller aborted
  | "backend";   // everything downstream

4 · Removing a deployment bottleneck

Answers: "Tell me about improving team velocity." · "Describe paying down technical debt with measurable payoff." · "When did you redesign something that was technically 'working'?"

Situation. The rule that decides which worker pool serves each workflow — the routing contract — originally lived as a WORKER_ROUTES table inside the TypeScript client (langgraph-client.ts, lines 53–114). Every routing change, even just rebalancing one workflow to a different pool, required a full frontend deploy. And no non-TypeScript caller could dispatch correctly at all.

Task. Make routing changes cheap and make the contract usable from any language, without breaking the invariant that the routing contract is the only way in.

Action. I lifted the routing decision into a standalone, pure Python module: route_for() takes an assistant id and a list of route objects and returns a decision — no I/O, no environment reads, no framework imports. Configuration (which assistants belong to which pool) moved into env vars read once at the worker's edge. Because the function is pure, I backed it with a parity test suite mirroring the original TypeScript behavior, so the lift was provably behavior-preserving.

Result. Adding or rebalancing a worker pool went from a full frontend redeploy to a one-line worker config edit. Non-TS callers could now reach the graph fleet through the same unchanged /runs/wait contract. The parity tests caught drift before it shipped, and the contract stayed single-source-of-truth.

Sound bite: "I turned a deploy-gated table into a pure function with parity tests — routing changes went from a release to a config edit."

The system-design view. Purity is the load-bearing decision. Because route_for touches no env and no network, the parity tests under tests/test_route_for.py run unchanged in plain CPython, mirroring the harness of the backend test suite it superseded. The matching rule is first-route-wins: the first sub-worker whose URL is set and whose allowlist contains the assistant_id takes the dispatch; otherwise it falls through to the default CORE route. That fall-through is also the contract's known failure mode — a graph registered in GRAPHS but missing from every allowlist silently lands on CORE, so the registry and the allowlists are a deliberate, documented coordination point. The module's own docstring states the payoff: adding a new sub-worker is "one wrangler-vars edit on the dispatcher, not a Vercel redeploy."

The real code. The whole contract, from services/agentic-sales-langgraph-dispatcher/src/route_for.py:

python

def route_for(
    assistant_id: str,
    *,
    default_url: str,
    default_token: str | None,
    routes: list[WorkerRoute],
) -> Decision:
    """Pick the downstream for a /runs/wait dispatch.

    Mirrors langgraph-client.ts:104-114 — first sub-worker whose URL is set
    AND whose allowlist contains ``assistant_id`` wins; otherwise the default
    (container) route applies.
    """
    for r in routes:
        if r.url and assistant_id in r.assistants:
            return Decision(url=r.url, token=r.secret, prefix=r.prefix)
    return Decision(url=default_url, token=default_token, prefix="CORE")

5 · The near-miss: PII on a trace span

Answers: "Tell me about a mistake or near-miss." · "Describe a time you caught a security/privacy issue." · "How do you balance debuggability against privacy?"

Situation. While instrumenting the observability plane, the obvious move was to attach model outputs to OpenTelemetry spans so traces would be self-explanatory. The catch: this is a sales platform — a generated email routinely contains names, phone numbers, and addresses. Raw outputs on searchable span attributes would have meant PII flowing to an external trace backend on every run.

Task. Keep traces debuggable without making them a PII exhaust — and make the safe pattern the default so the next engineer doesn't relearn it the hard way.

Action. I drew a hard line on what a span may carry: outputs go in run.outputs — which LangSmith ingests with full PII controls — never on searchable span attributes; prompts are referenced by version, never by text; attribute keys stay stable so cardinality can't blow up the observability bill. I wrote the rule into the observability deep-dive as a named "what you must NOT record" list, and made the egress posture structural: all instrumentation is env-var-gated, and unset means zero network and zero PII egress by definition. Because instrumentation lives in one shared tracing module rather than scattered per service, the audit surface was one file.

Result. The platform kept self-explanatory traces while the non-negotiable "zero PII egress from the worker" held. The hazard is now a written, reviewable rule rather than tribal knowledge — and the shared-module design means enforcing it is one diff, where a microservice-per-flow world would have meant N audits.

Sound bite: "Debuggability and privacy stopped competing once I split the channels: searchable spans carry metadata, controlled ingestion carries content."

The system-design view. The deeper principle is channel separation by sensitivity. The OTel span store is optimized for search — attributes are indexed and broadly visible, so anything on a span should be treated as public-within-the-org metadata: ids, prompt versions, token counts. The LangSmith run store carries content (run.outputs) behind PII controls, and the worker records token usage in outputs.usage_metadata but deliberately no dollar amounts — LangSmith multiplies tokens × model rate at query time, so prices can't rot in stored spans. The cross-link between the two stores is the x-langsmith-run-id header, so an operator can pivot from a searchable span to the controlled content view in one click — full debuggability, no PII on the searchable side.

The real code. The rule as written in the observability deep-dive (specs/agentic-sales/deep-dives/observability.md, "What you must NOT record"):

Raw model output on the span. PII leak risk — a generated email might contain a phone number, address, or SSN. Outputs go in run.outputs (which gets ingested by LangSmith with full PII controls), never on searchable span attributes.

Prompt text as an attribute. Use the prompt version instead.

Per-prompt-variant attribute keys. Keep the attribute name stable, put the variation into the value. Otherwise cardinality blows up your observability bill.

6 · Shipping LLM changes safely

Answers: "How do you ensure quality in ML/LLM systems?" · "Tell me about building an evaluation or testing strategy." · "How do you ship model changes without regressions?"

Situation. The platform's workflows live or die on LLM output quality — enrichment profiles, classifications, outreach drafts, generated SQL. Prompt and model changes were landing on vibes, and an improvement on average could silently wreck a critical segment.

Task. Build an evaluation lifecycle that catches regressions before customers do, including the ones averages hide.

Action. I layered defenses. Deterministic code evaluators first — pure functions following the LangSmith contract (run, example → {key, score, reason}), no LLM calls, so the same evaluator runs identically in CI and production. An LLM-as-judge drawn from a different model family for the judgments code can't make, so the judge can't bless its own family's mistakes. A per-record drift score (completeness × freshness mapped to PSI bands) catching input shift inline. A CI gate that blocks any merge regressing a critical segment, not just the average. And rollout is shadow → canary with automatic per-segment rollback.

Result. Model and prompt changes became routine instead of risky: regressions surface as a blocked merge or an automatic rollback, not a customer report. The segment-level gate catches exactly the failures averages were designed to hide — and the deterministic evaluators double as runtime safety gates.

Sound bite: "Gate on segments, judge with a different model family, keep evaluators pure — quality stopped being a vibe and became a contract."

The system-design view. The key property is evaluator purity: same input, same score, no API calls. That's what lets one evaluator serve three roles — CI gate, production monitor, and runtime guard — without drift between them. The text-to-SQL lane shows the pattern end to end: select_only_evaluator mirrors the same write-keyword regex as the runtime validate_sql node in the graph ("kept in sync deliberately so the evaluator catches exactly what the runtime gate catches"), and valid_sqlite_evaluator runs EXPLAIN on an in-memory SQLite connection, distinguishing "no such table" (valid syntax, missing schema) from real syntax errors. Dataset design encodes the segment thinking: the LangSmith datasets judge bands (lead-score band, propensity band) rather than raw 9-way labels, so the gate measures what the business actually branches on.

The real code. From backend/eval/text_to_sql_evaluators.py — the pure-evaluator contract:

python

"""Both follow the LangSmith code-evaluator contract:
  (run: Any, example: Any) -> {"key": str, "score": float 0-1, "reason": str}

They are fully standalone: with LANGSMITH_TRACING unset, LangSmith upload is
a no-op; the functions themselves run pure-Python with no API calls."""

def select_only_evaluator(run: Any, example: Any = None) -> dict[str, Any]:
    sql = _extract_sql(run)
    if not sql:
        return {"key": "select_only", "score": 0.0, "reason": "empty SQL"}
    # Mirrors _WRITE_RE in text_to_sql_graph — kept in sync deliberately so
    # the evaluator catches exactly what the runtime gate catches.
    ...
    return {"key": "select_only", "score": 1.0, "reason": "SELECT-only confirmed"}

7 · Cost as a first-class metric

Answers: "Tell me about managing costs or budgets." · "How do you keep an LLM system from running away financially?" · "Describe adding guardrails to a production system."

Situation. Fifty-plus LLM workflows sharing one platform means one buggy loop — a retrying enrichment graph, a runaway campaign — can spend real money fast. Cost lived in a monthly invoice, far from the code that incurred it.

Task. Make spend observable and bounded per workflow, with a way to stop everything instantly.

Action. I promoted cost to a first-class metric alongside latency and quality: token usage recorded per run in the shape the trace backend prices at query time, tracked per workflow rather than per platform, with per-workflow daily budgets. Then I added a single env-gated kill switch enforced at every hub LLM calls funnel through — the TS graph client, the direct gateway client, and the Python backend's make_llm() — so one env flip halts every model call in the platform, no code change, no redeploy. The same opt-in philosophy bounds storage: checkpointing is wired only for graphs marked resumable, so checkpoint tables can't silently grow for workflows that never resume.

Result. A runaway workflow now hits its own budget ceiling instead of the credit card; a cost incident is a one-action response (LLM_KILL_SWITCH=1) that API routes surface as a clean 503; and storage stayed inside the host's caps. Cost conversations moved from "what happened last month" to "which workflow, today."

Sound bite: "Budgets per workflow, one kill switch at every hub, persistence only where it's needed — spend became an engineering signal, not an accounting surprise."

The system-design view. Two design choices make the kill switch trustworthy. Enforce at the hubs, not the call sites: every LLM path in the app already funnels through two TS hubs (runGraph() and getAiGatewayClient()) plus the Python make_llm(), so gating those three points covers the platform by construction — including cron and CLI runs that never touch the web app. Read at call time, not import time: the flag is checked per call, so flipping the platform env var takes effect immediately without a restart. The price-rot insight matters in follow-ups too: the worker records token counts, never dollar amounts, because stored prices rot as vendor pricing changes — the trace backend multiplies tokens × current rate at query time.

The real code. From apps/agentic-sales/src/lib/llm-guard.ts:

typescript

/** True when LLM_KILL_SWITCH is set to a truthy value. Read at call time so
 * the switch can be flipped via the platform env without a code change. */
export function isLlmKilled(): boolean {
  const v = process.env.LLM_KILL_SWITCH;
  return typeof v === "string" && TRUTHY.has(v.trim().toLowerCase());
}

/** Throw `LlmDisabledError` when the kill switch is engaged; no-op otherwise.
 * Call at the top of every LLM hub. */
export function assertLlmEnabled(): void {
  if (isLlmKilled()) throw new LlmDisabledError();  // routes map this to 503
}

Delivery notes

Lead with the axis, not the inventory. Every story above turns on one named principle — failure-domain cost, lazy work, single source of truth, channel separation, evaluator purity, opt-in state. Name it early; the detail then sounds like evidence instead of recitation.
Drop one layer at a time. STAR beats first; the system-design view when they probe; the named file and function only if they keep pushing. Going straight to code reads as rehearsed.
Survive the follow-up. Each story cites real mechanisms (GraphSpec, route_for, buildTraceCarrier, select_only_evaluator, LLM_KILL_SWITCH). If pressed, go one level deeper into the same artifact rather than sideways into a new one.
Mind the seniority dial. For senior/staff loops, spend the Result beat on second-order effects (audit surface, coordination cost, what the design made impossible); for mid-level, spend it on the concrete metric.
One story, many prompts. The trade-off story (#1) also answers "disagree and commit"; the PII story (#5) also answers "attention to detail." Map prompts to stories before the interview, not during it.