Data Enrichment — Transcript

📄 10 chapters · read at your own pace

01. What Enrichment Is

Enrichment turns a thin record into a profile you can act on. A lead often arrives as just a company name or a website. A contact might be only a name and a job title. That is rarely enough to decide whether the company is worth pursuing. It is also too little to write a message that lands. Enrichment fills the gaps. It works out what the company does, how big it is, and what software it runs. It also finds the right person to reach and a way to reach them. One discipline rules all of this. Every fact you add must trace back to a real, checkable source rather than a guess. A sales decision built on an invented detail is worse than no detail at all. Enrichment sits in the middle of the pipeline. Discovery surfaces raw candidates before it, and outreach turns the best ones into personalized messages after it. There is a clear cost to this care. Grounding every fact is slower than letting a model improvise freely. Done well, it is the difference between a generic blast and a message that proves you understand the business.

Heuristic classifier enriches a sparse company record by extracting category, tier, and other attributes from scraped web pages, grounding them in detected keywords.

python
def _heuristic_classify(home_markdown: str, careers_markdown: str) -> dict[str, Any]:
    text = (home_markdown + " " + careers_markdown).lower()
    matched: list[str] = []

    tier2 = [k for k in ("llm", "genai", "agent", "rag", "foundation model") if k in text]
    tier1 = [k for k in ("machine learning", " ml ", "data science") if k in text]
    if tier2:
        tier = 2
        matched += tier2
    elif tier1:
        tier = 1
        matched += tier1
    else:
        tier = 0

    if "consult" in text or "services" in text:
        category = "CONSULTANCY"
        matched += [k for k in ("consult", "services") if k in text]
    elif "staff" in text or "recruit" in text:
        category = "STAFFING"
        matched += [k for k in ("staff", "recruit") if k in text]
    elif "agency" in text or "marketing" in text:
        category = "AGENCY"
        matched += [k for k in ("agency", "marketing") if k in text]
    elif any(k in text for k in ("platform", "saas", "product")):
        category = "PRODUCT"
        matched += [k for k in ("platform", "saas", "product") if k in text]
    else:
        category = "UNKNOWN"

    return {
        "category": category,
        "tier": tier,
        "industry": "",
        "remote_policy": "unknown",
        "has_open_roles": bool(careers_markdown),
        "confidence": 0.3,
        "reason": "heuristic fallback (regex keyword match)",
        "evidence": ("matched keywords: " + ", ".join(sorted(set(matched)))) if matched else "no keywords matched",
        "source": "heuristic",
    }
ELI5 — the plain-language version

Imagine you find a business card with just a name and a phone number. Without any other context, you don’t know if that person is a CEO or a receptionist, or even what their company does. Enrichment is like digging into that card to build a full profile—finding the person’s title, the company’s size, the technology they use, and whether they’re hiring or shrinking. Concretely, this subsystem reads the company’s home page and careers page, fencing the text with wrap_untrusted to prevent trickery, then uses a language model to extract specific signals. For example, extract_hiring_velocity classifies whether the company is actively expanding or holding flat by looking for phrases like “200% headcount growth.” Without enrichment, you would only have that bare card: no idea if the company fits your vertical, no sense of its funding stage, no clue whether it’s even worth a call. You’d waste time on dead ends or send messages that miss the mark entirely.

Data flow — one request, in order
  1. grade – Node 3b that audits classification groundedness; returns early when _error, _skip_reason, or an empty classification exist.
    reads / writes – reads _error, _skip_reason, classification, classify_source; writes grade (verdict, issues), grade_attempts, agent_timings.
    branch – happy path proceeds after all early‑return checks pass and classify_source is not "heuristic".

  2. enrich_vertical_fit – Node 5d that emits vertical‑fit fields scoped to the company’s tagged vertical; returns {} if _error, _skip_reason, vertical is empty, or the vertical is not found in MICRO_VERTICALS.
    reads / writes – reads _error, _skip_reason, vertical, company, company_id, home_markdown, careers_markdown, MICRO_VERTICALS; writes vertical_fit (product_summary, icp, ai_native, vertical_fit, provenance), agent_timings.
    branch – happy path continues when vertical is non‑empty and exists in MICRO_VERTICALS.

  3. extract_funding_stage – V20 node that extracts funding stage and team‑size estimate for all companies; returns {} if _error or _skip_reason are set.
    reads / writes – reads _error, _skip_reason, company, company_id, home_markdown, careers_markdown, vertical; writes funding_stage (stage, funding_signals, team_size_estimate, seniority_gate_ok, provenance), agent_timings.
    branch – happy path proceeds when no error/skip.

  4. extract_pi_signals – V14 node that emits PI demand‑letter signal fields for legal‑pi‑demand companies; returns {} if _error, _skip_reason, or vertical != "legal-pi-demand".
    reads / writes – reads _error, _skip_reason, vertical, company, company_id, home_markdown, careers_markdown; writes pi_signals (demand_automation, medical_record_summarization, case_intake with detected, reason, confidence), agent_timings.
    branch – happy path only when vertical == "legal-pi-demand".

  5. analyse_github – Async function that populates github_* columns for companies with a known organization; returns {} if _error, _skip_reason, or missing company_id.
    reads / writes – reads _error, _skip_reason, company_id; writes agent_timings; internally writes GitHub columns after later steps.
    branch – happy path proceeds when company_id is present.

  6. d1_one (within analyse_github) – Database query that retrieves the company row (key, github_org, github_url, github_analyzed_at, tags) from the companies table.
    reads / writes – reads from D1 database; writes the returned row dict to local variable.
    branch – if no row is returned, the function returns early with timings.

  7. analyse_github (github_org check) – Conditional inside analyse_github that checks if github_org is empty; if empty, the function returns early.
    reads / writes – reads the github_org field from the database row; no writes on early return.
    branch – happy path continues when github_org is non‑empty.

  8. analyse_github (analyzed_at check) – Conditional inside analyse_github that compares github_analyzed_at to the current timestamp; if the analysis is recent (less than _GH_ANALYSE_REFRESH_DAYS), the function returns early.
    reads / writes – reads github_analyzed_at from the row; no writes on early return.
    branch – happy path proceeds when the data is stale or never analyzed, leading to the GhClient analysis (not shown in provided source).

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The enrichment subsystem is an ordered pipeline that transforms a sparse lead—often just a company name or domain—into a rich profile by executing a sequence of specialized extraction nodes. The mechanism begins by gathering raw markdown from home and careers pages, then fans out into vertical-specific extractors such as enrich_vertical_fit (which only runs when state["vertical"] is set) and vertical-tuned nodes like extract_voice_ops_signals (gated on _VOICE_OPS_VERTICAL) or extract_pi_signals (gated on _PI_VERTICAL). After these vertical-specific nodes complete, extract_funding_stage runs for every company, followed by extract_buying_intent which emits a buying-intent signal for composite ranking. Each node follows a strict pattern: it checks state.get("_error") or state.get("_skip_reason") and returns {} immediately if set, then wraps untrusted markdown through wrap_untrusted to prevent prompt injection, calls ainvoke_json_with_telemetry with a vertical‑specific system prompt, and on failure silently returns {} so the rest of the graph is unaffected.

The invariant the design preserves is that every fact added to the profile is written to a company_facts row with full provenance—confidence, reason, source, and evidence—and that a failure in any single extraction node never blocks data that has already been committed by persist. This is explicitly a non‑fatal design: errors, LLM failures, kill‑switch activations, or parse failures all yield an empty dict, leaving previously enriched fields intact. The write boundary is per‑field: each node writes only its own field key under company_facts (e.g., field='funding_stage' for extract_funding_stage, or field='vertical_fit.<vertical>' for enrich_vertical_fit), ensuring that partial results coexist without corruption. The graph thus guarantees that no single extraction failure can take down the entire enrichment run, and every piece of data is traceable to a specific source excerpt.

The key trade-off is between a single, monolithic extraction call and the current architecture of many narrow, vertical‑specific nodes. A single generic LLM prompt could attempt to extract all fields for any company, but that approach would require an enormous system prompt covering every vertical’s signals, would struggle to maintain provenance, and would be brittle when encountering out‑of‑domain companies. The design rejects that alternative because it would incur a high cost in accuracy and debuggability—hallucinated fields would be hard to isolate, and a single parse failure could lose all enrichment. Instead, the subsystem uses per‑vertical system prompts (e.g., _VOICE_OPS_SYSTEM_PROMPT for voice‑ops, or a prompt that branches on MICRO_VERTICALS for vertical fit) and per‑node observability via gen_ai.* span attributes and the agentic_sales.node metadata key. The cost this avoids is the operational headache of diagnosing which part of a giant prompt failed; each node’s failure is self‑contained and independently observable.

Consider a concrete failure mode: the extract_buying_intent node receives a malformed LLM response that cannot be parsed as JSON. The node’s logic catches the parse exception and returns {} immediately, leaving state["buying_intent"] unset and allowing downstream nodes to continue. An operator monitoring the enrichment pipeline would see an error logged at the ainvoke_json_with_telemetry call, a spike in the gen_ai.parse_failure metric, and the span attribute agentic_sales.node=extract_buying_intent marked with an error tag. No other node is affected, and the composite ranking step V73 simply sees an absent signal rather than a bad one—preserving the overall profile’s integrity while flagging the specific extraction for investigation.

Cost & performance — the real knobs

The enrichment subsystem spends time and money primarily on LLM inference (each call to DeepSeek Flash, costing per token), GitHub API calls (analyse_github), and cache lookups/storage. The following five real knobs control those costs and latencies.

  • LLM_KILL_SWITCH — An environment variable that disables all LLM-driven extraction nodes.
    Bounds: When set to a truthy value (e.g., 1), every enrichment function that calls an LLM returns {} immediately.
    Effect: Turning it on drives time and dollar cost to zero for all LLM steps, but no signals (voice-ops, funding stage, pricing, fintech, buying intent) are produced. Turning it off restores full enrichment at full cost.
    Risk: Mis-set on leaves the system blind to all LLM-derived fields; mis-set off permits unbounded LLM spend if no other rate limit is in place.

  • temperature — A parameter passed to make_deepseek_flash(temperature=0.1).
    Bounds: Controls the randomness of LLM output (≥0.0). Default is 0.1.
    Effect: Lower values (e.g., 0.0) make outputs more deterministic, reducing token count variation and slightly lowering per-call cost and latency. Higher values (e.g., 0.5) increase diversity but may produce longer responses and more retries.
    Risk: Too high can produce non‑JSON output, forcing retries and raising cost; too low may cause repetitive responses that still pass schema, wasting tokens without harm.

  • cache — A boolean parameter in ainvoke_json_with_telemetry; default True.
    Bounds: When True, the LLM response is stored under the cache_scope identifier (e.g., "company_enrichment.voice_ops_signals.voice-ops"). A subsequent request with identical inputs returns the cached result.
    Effect: Cache on reduces both latency (zero LLM call) and cost (no token consumption) for repeated queries. Cache off forces a fresh LLM call every time, increasing latency and dollar spend linearly with query count.
    Risk: Disabling cache turns every repeated enrichment into a paid call. Enabling cache with too long a TTL (not shown in source) risks serving stale signals.

  • max_chars — A parameter in wrap_untrusted for truncating input text (max_chars=6000 for home page, max_chars=2000 for careers page).
    Bounds: Caps the number of characters passed to the LLM prompt. Defaults are 6000 and 2000 respectively.
    Effect: Lower values reduce per‑call token count, cutting both latency and cost. Higher values include more of the scraped page, potentially improving signal accuracy but increasing token cost and response time.
    Risk: Setting too low may truncate key evidence (e.g., a pricing mention or compliance certification), causing missed signals. Setting too high wastes money on boilerplate text that does not contribute.

  • _GH_ANALYSE_REFRESH_DAYS — A constant that determines how often GitHub analysis is repeated (analyse_github). Exact default is not shown in the snippet, but it is compared against the age of github_analyzed_at.
    Bounds: An integer threshold in days. If the last analysis is younger than this value, the GitHub probe is skipped.
    Effect: A smaller value (e.g., 1) increases the frequency of GitHub API calls, raising throughput and cost (rate limits and compute). A larger value (e.g., 30) reduces calls, saving money but risking outdated GitHub insights.
    Risk: Too small can exhaust GitHub API rate limits or waste resources on unchanged repos; too large allows stale commit activity to persist in scoring.

Failure modes — what breaks, what catches it

Failure 1: LLM Kill Switch Engaged

  • Trigger — The environment-level LLM_KILL_SWITCH flag is set to True, causing make_deepseek_flash or the ainvoke_json_with_telemetry call in any enrichment node to raise LlmDisabledError.
  • GuardLlmDisabledError is explicitly caught and swallowed in the try/except blocks of nodes like extract_funding_stage (source says “Gated by LLM_KILL_SWITCH (LlmDisabledError swallowed below)”). The node immediately returns {}.
  • Posture — fail-soft. The run continues, but the affected node’s output (e.g., funding_stage, vertical_fit, immigration_signals) is silently absent.
  • Operator signal — No error log; the expected company_facts row under the node’s field is missing. Telemetry spans show the node completed in ~0 ms with no result.
  • Recovery — No automatic retry. The operator must set LLM_KILL_SWITCH = False and re-run the enrichment for the affected companies.

Failure 2: LLM Network or Provider Error

  • Trigger — The DeepSeek API is unreachable, returns a 5xx, or exceeds a timeout during ainvoke_json_with_telemetry.
  • Guard — The generic try/except block in each node (e.g., extract_voice_ops_signals, enrich_vertical_fit, extract_funding_stage) catches the exception. No named exception class is specified in the source; the fallback is always return {}.
  • Posture — fail-soft. The rest of the graph proceeds, but the node’s signals are empty.
  • Operator signal — A gen_ai.* span may show a timeout or error, but no structured error is written to state. The operator sees a missing company_facts entry for the node’s field.
  • Recovery — None. The graph does not retry. A manual re-run or a later run on the same company (if the API recovers) will attempt extraction again.

Failure 3: LLM Returns Malformed JSON (Parse Failure)

  • Trigger — The DeepSeek response is not valid JSON, or the JSON does not match the expected schema (e.g., missing detected, confidence, evidence, reason keys).
  • Guard — Implicit in the nodes’ “parse failure returns {}” contract. The code likely wraps json.loads in a try/except ValueError (not shown in snippet), returning {} on failure. No named guard is given.
  • Posture — fail-soft. The node contributes no data; the rest of the graph is unaffected.
  • Operator signal — No log line; the company_facts row for the node’s field is absent. Telemetry cannot distinguish this from a network error without custom metrics.
  • Recovery — None. The cache (cache=True in ainvoke_json_with_telemetry) may have stored the invalid response, preventing a retry without manual cache invalidation.

Failure 4: D1 Database Write Error

  • Trigger — The company_facts insert (e.g., in enrich_vertical_fit’s persist path) fails due to a D1Error (connection loss, constraint violation, or row-level conflict).
  • Guardanalyse_github explicitly catches D1Error and returns {"agent_timings": …}. Other nodes (e.g., enrich_vertical_fit) do not show a database guard in the snippet, but they are documented as non-fatal — likely the write error is caught and swallowed upstream (e.g., in the persist helper).
  • Posture — fail-soft. The in-memory state may still contain the computed signals, but they are never persisted to company_facts. Downstream consumers (e.g., scoring) will see stale or absent data.
  • Operator signal — No error in the graph state; the operator must check D1 query logs or observe that the company_facts row is missing for the expected field.
  • Recovery — None. The enrichment run does not retry the write. A manual re-run or a dedicated backfill job is required.

Failure 5: GitHub API Rate Limit or Network Error

  • Trigger — The analyse_github node’s call to analyse_org (GhClient internally) is throttled, the token is missing, or the network is down.
  • Guardanalyse_github wraps the entire operation in a try/except that catches any exception and returns {"agent_timings": …}. The source explicitly names “rate-limit, missing token, network” as caught failures.
  • Posture — fail-soft. The companies.github_* columns remain as they were (or are left null). The rest of the enrichment is unaffected.
  • Operator signal — No exception propagates; agent_timings is recorded. The operator would see github_analyzed_at unchanged and no new GitHub data in the company record.
  • Recovery — No automatic retry. The node will re-attempt next time it runs (after _GH_ANALYSE_REFRESH_DAYS), but if the problem persists, manual intervention is needed (e.g., rotate token, adjust rate limits).
Interview — could you explain it?

Q1 (Warm-up)
How does the subsystem decide whether to trust or ignore a trend signal like hiring velocity when scoring a company?

A
The score node in company_enrichment_graph.py checks hv_grounded by verifying that evidence exists and confidence >= 0.5. If those conditions fail, the trend is set to empty and appended as “ungrounded,ignored” to the reasons list – no boost or drag is applied. This ensures no fabricated signal moves the ranking.

Follow-up
Why not simply discard the whole hiring_velocity object when it’s ungrounded?
One‑line answer
The code still records the reason for ignoring it so that downstream audit or debugging can see why the trend was skipped.

Weak answer misses
The groundedness gate applies to both evidence existence and a confidence threshold (≥0.5), not just one of them.


Q2 (Medium)
What happens when a company’s classification is produced by a heuristic fallback instead of an LLM? How does the pipeline handle grading?

A
In the grade node, the code explicitly checks state.get("classify_source") == "heuristic". If true, it immediately returns a verdict of ok with a note skipped: "heuristic", bypassing the LLM‑based grader. This is because a heuristic-sourced classification cannot be improved by retrying an LLM call – a retry would produce the same regex‑based answer.

Follow-up
Is there any other guard to prevent heuristic guesses from being treated as high‑confidence facts?
One‑line answer
Yes, the classify function itself sets "confidence": 0.3 and "source": "heuristic", so downstream scoring (which multiplies by confidence) naturally weights it less.

Weak answer misses
The heuristic branch also records matched keywords as evidence, but its confidence is fixed at 0.3 – the article says “a guess must never pass as a grounded fact.”


Q3 (Design question – medium/hard)
Why does the pipeline use a CRAG (Corrective RAG) loop for classification – why not just rely on a single high‑temperature LLM call and move on?

A
The grade node acts as a quality gate. When the LLM grader flags low‑confidence fields (such as category_ok, tier_ok, remote_policy_ok), the _grade_router conditional edge loops back to classify for a single retry. This mirrors the “grade‑then‑rewrite” pattern from LangGraph examples, allowing the second pass to correct mistakes without wasting multiple calls on the same input.

Follow-up
What stops an infinite loop if the retry still fails the grade?
One‑line answer
The constant _CRAG_MAX_ATTEMPTS = 2 caps the retries, enforced by the router logic (not shown but implied by the conditional edge).

Weak answer misses
The retry reuses the already‑fetched markdown (home_markdown and careers_markdown) – there is no reason to scrape again, and the cap prevents unbounded cost.


Q4 (Hard)
The vertical‑specific signal extraction nodes (e.g., extract_voice_ops_signals, extract_fintech_signals) are chained in a fixed order after persist. Why chain them sequentially instead of running them in parallel, given that each no‑ops for non‑relevant verticals?

A
The chain is defined as a sequence of add_edge calls (e.g., persist → analyse_github → enrich_vertical_fit → extract_pi_signals → … → extract_fintech_signals). Each node checks state["vertical"] first; if it doesn’t match, the node returns early with empty results. This sequential design avoids the complexity of a branching conditional graph while still ensuring that every vertical gets its own tailored signal extractor – the overhead of a no‑op is minimal compared to managing concurrency.

Follow-up
Could a failure in one vertical signal block subsequent extractors?
One‑line answer
Every vertical signal node is documented as “non‑fatal – any failure here does not block enrichment that already committed in persist.”

Weak answer misses
The edge order is not arbitrary: enrich_vertical_fit must run before extract_pi_signals because the PI extractor depends on the vertical already being set (the “V13”/“V14” labels hint at version dependency).


Q5 (Hardest)
How does the subsystem enforce that “every fact added must be grounded in a real, checkable source rather than guessed,” as stated in the chapter? Cite at least two concrete mechanisms from the code.

A
Two mechanisms are:

  1. Hiring‑velocity grounding gate: The score node requires hv.get("evidence") to be truthy and confidence >= 0.5 before applying any trend – otherwise the trend is dropped with a reason.
  2. Heuristic source tagging: The classify function sets source: "heuristic" and confidence: 0.3 for regex‑based answers, and the grade node skips LLM grading for heuristic‑sourced classifications, ensuring a guess never passes as a grounded fact.

Follow-up
What about the vertical‑fit enrichment – does it have a similar groundedness rule?
One‑line answer
The enrich_vertical_fit node only runs when state["vertical"] is set and the micro‑vertical definition exists; it’s gated by a LLM kill‑switch and the prompt explicitly demands evidence, but the core grounding is enforced by the LLM prompt itself rather than a post‑hoc confidence filter.

Weak answer misses
The CRAG grade node (grade) adds a second layer of validation specifically for the classification output, verifying that claims are supported by the page text – this is the “checkable source” requirement for LLM‑produced facts.

02. Discovery Versus Enrichment

It helps to separate two jobs that are easy to confuse. Discovery is how raw candidates first enter the system. Enrichment is the second step that deepens each record. In this platform candidates come only from sources you can verify. Examples include job postings on hiring systems and crawls of the public web. They never come from a model inventing company names. Why keep the two jobs apart? The answer is accountability. Discovery answers one question: where did this record come from? Enrichment answers a different one: what do we now know, and how do we know it? Because the platform sources only from real origins, every enriched profile traces back to a signal a human could go and inspect. There is a price for being this strict. You discover fewer candidates than a system willing to dream up prospects. Yet the ones you keep are real. That property is what the whole pipeline depends on.

An enrichment node that takes a previously discovered company record and scraped site content to extract additional attributes—keeping discovery (external sourcing) and enrichment (deepening) strictly separate.

python
async def extract_competitors(state: CompanyEnrichmentState) -> dict:
    """V65: Extract named competitors with evidence for all companies.

    … Persists to ``company_facts`` under ``field='competitors'``.
    Non-fatal — any failure returns ``{}`` so the rest of the graph is unaffected.
    """
    if state.get("_error") or state.get("_skip_reason"):
        return {}
    company = state.get("company") or {}
    home_markdown = state.get("home_markdown") or ""
    careers_markdown = state.get("careers_markdown") or ""

    user_prompt = (
        f"Company: {company.get('name')}\n"
        f"Domain: {company.get('canonical_domain')}\n\n"
        f"Home page:\n"
        f"{wrap_untrusted(home_markdown, label='HOME PAGE', max_chars=6000)}\n\n"
        f"Careers page:\n"
        f"{wrap_untrusted(careers_markdown, label='CAREERS PAGE', max_chars=2000)}\n"
        "Return JSON only."
    )
    # … LLM invocation and competitor parsing omitted
    # returns enriched result or {} on failure
ELI5 — the plain-language version

Think of this like a library that first accepts only books from trusted donors—actual publishers or authors who submit real volumes—and only then lets a researcher write summary notes in the margins. Discovery is the donation desk: it takes in raw candidates from job-board postings or web crawls, never letting the system make up companies from thin air. Enrichment is the researcher who later reads each book and adds observations—like a function that examines a company’s career page for hiring trends (extract_hiring_velocity reads the careers markdown, classifies the trend as rising, flat, or falling, and records confidence with verbatim evidence). That enrichment runs only after discovery, and it’s gated by rules—for example, voice-ops signal extraction runs only when the company’s vertical is “voice-ops.” Without that separation, the researcher might start inventing books that never arrived, or worse, scribble imaginary facts onto real books, corrupting the entire catalog. A beginner would feel the failure as confusing where any piece of information came from—was it a real company or a hallucination? The system would lose accountability, and no one could trust the catalog’s accuracy.

Data flow — one request, in order
  1. classify
    Invokes an LLM to classify the company into a category (CONSULTANCY, STAFFING, AGENCY, PRODUCT, UNKNOWN), tier, industry, remote policy, and open‑role flag from its home and careers page markdown.
    readsstate["company"], state["home_markdown"], state["careers_markdown"]
    writesstate["classification"] (dict containing category, tier, industry, remote_policy, has_open_roles, confidence, reason, source)
    branch – If state["_error"] or state["_skip_reason"] is truthy, returns {} (skip). Happy path: returns classification dict.

  2. grade
    Uses an LLM grader to audit whether the classification is grounded in the page text; if the earlier classify was heuristic, grading is skipped with an "ok" verdict.
    readsstate["classification"], state["classify_source"] (implicitly via heuristic check), state["grade_attempts"]
    writesstate["grade"] (dict containing verdict, issues, skipped), state["grade_attempts"] (incremented), state["agent_timings"]["grade"]
    branch – If classification is empty, returns {}. If classify_source == "heuristic", immediately returns verdict: "ok" and increments attempt. Happy path: returns verdict "ok" after LLM critique.

  3. router after grade
    Reads the grade verdict: if "ok" the request proceeds to the score node; if issues are found and retries remain (< 2), it loops back to classify.
    readsstate["grade"]["verdict"], state["grade_attempts"]
    writes – (no writes; controls next edge)
    branch – Happy path: verdict "ok" → continue to score. Failure path: verdict with issues and attempts < 2 → loop to classify.

  4. score
    (Node referenced in router comment; no source code provided here. Presumably computes scoring weights or decides which vertical‑specific extractors to invoke.)
    reads – likely state["classification"] and state["grade"]
    writes – unknown (not in context)
    branch – Happy path: proceeds to vertical‑specific extractors.

  5. extract_voice_ops_signals
    Only runs when state["vertical"] == "voice-ops"; asks an LLM to identify telephony stack, target vertical, and SaaS integrations from the company’s markdown.
    readsstate["vertical"], state["company"], state["home_markdown"], state["careers_markdown"]
    writesstate["telephony_stack"] (list), state["target_vertical"] (string), state["saas_integrations"] (list), each with confidence, reason, source, evidence
    branch – If vertical != "voice-ops", returns {} (skip). Happy path: returns voice‑ops signal dict.

  6. extract_funding_stage
    Runs for every company after the vertical‑specific extractors; uses an LLM to infer funding stage, signals, team‑size estimate, and a seniority gate flag.
    readsstate["company"], state["home_markdown"], state["careers_markdown"]
    writesstate["funding_stage"] (dict with stage, funding_signals, team_size_estimate, seniority_gate_ok, and provenance fields)
    branch – If state["_error"] or state["_skip_reason"], returns {}. Always executes on happy path; any LLM failure is non‑fatal and returns empty.

  7. analyse_github
    Queries the database for the company’s GitHub org; if an org exists and hasn’t been analyzed recently, calls analyse_org and save_org_patterns to populate GitHub‑related columns.
    readsstate["company_id"] – then internally queries companies table for github_org, github_url, github_analyzed_at
    writes – updates companies table (columns github_*) via analyse_org / save_org_patterns
    branch – If company_id is None, returns {}. If no github_org or analysis is recent, returns early with only timing. Happy path: performs full GitHub analysis.

  8. persist
    (Terminal node, referenced in comments as the step where enrichment results are committed to the database; exact implementation not shown.)
    reads – all state keys produced by previous nodes (classification, signals, funding, github data)
    writes – commits data to company_facts and possibly other storage tables
    branch – Always runs after all extractors; non‑fatal errors in earlier nodes are swallowed here. Happy path: final commit completes.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The enrichment subsystem is built as a set of independent, non-fatal node functions that each extract a specific signal from scraped web content. The ordered mechanism begins with a guard: every function checks state.get("_error") or state.get("_skip_reason") and returns {} immediately if either is set, ensuring no further work occurs after a prior failure. Next, vertical-gated functions like extract_voice_ops_signals and extract_pi_signals check the state["vertical"] field and exit early if it does not match their predefined constant (e.g. _VOICE_OPS_VERTICAL). General extractors such as extract_funding_stage and extract_hiring_velocity run for all companies, but the docstring for extract_funding_stage explicitly says it runs “after the vertical-specific signal extractors,” establishing a two-phase ordering. All functions then build a user prompt from company metadata and scraped markdown, fencing the untrusted text via wrap_untrusted before the LLM call. The result is parsed by ainvoke_json_with_telemetry, and on success it is persisted to the company_facts table under a specific field name (e.g. 'immigration_signals'). On any failure — LLM error, kill-switch (LLM_KILL_SWITCH), or parse failure — the function returns {} and the rest of the graph proceeds unaffected.

The invariant preserved is a non‑fatal resilience guarantee: no single extraction failure can block enrichment already committed in the persist node. This is stated explicitly for every function (“any failure here does not block enrichment that already committed in persist”). Because each node is an idempotent state mutator that either writes a well‑typed result or writes nothing, the overall graph can tolerate partial outages of the DeepSeek LLM or transient parse errors without cascading. Additionally, the use of wrap_untrusted enforces an integrity invariant: planted [SYSTEM] injections in scraped text cannot alter the system prompt, preserving the extraction’s intended behavior.

The key trade‑off is LLM‑driven extraction versus deterministic parsing. The design rejects a more predictable alternative — writing hand‑crafted regex or HTML parse rules for each signal — because the cost of maintaining dozens of brittle scrapers across evolving web layouts and marketing copy would be prohibitive. Instead, a single DeepSeek model with varied system prompts handles all signals, accepting the risk of hallucination or low confidence in exchange for scalability. The cost avoided is continuous maintenance of signal‑specific parsers, which would require per‑site engineering effort whenever a landing page changes its structure.

One concrete failure mode occurs when extract_funding_stage is called but the LLM returns a non‑parseable JSON object. The try‑block in that function (mimicking the pattern seen in extract_voice_ops_signals) catches the error and returns {}. An operator would see the node complete without error in the graph execution logs, but no funding_stage row would appear in the company_facts table for that company. The seniority_gate_ok field would be absent, which may later cause downstream scoring nodes (V25/V29) to treat the gate as False, but the pipeline as a whole continues. The observability span gen_ai.* would contain the LLM call failure details, and the agentic_sales.node=extract_funding_stage metadata tag would allow an operator to filter telemetry and identify the company that triggered the parse failure, enabling manual remediation or prompt adjustment.

Cost & performance — the real knobs

The enrichment subsystem centers on LLM calls (e.g., ainvoke_json_with_telemetry) that extract structured signals from scraped markdown, plus occasional GitHub API probes. Time is dominated by these remote calls — each can take seconds. Money flows to model providers (DeepSeek) and, for GitHub, to API consumption. Below are six real performance knobs drawn from the source, each directly controlling these costs.


LLM_KILL_SWITCH

  • Knob — The global gate identifier LLM_KILL_SWITCH (env var or constant; default not shown, but a truthy value disables all LLM calls).
  • Bounds — A boolean toggle: when enabled, every LLM-dependent enrichment function returns {} immediately.
  • Effect — Turning it ON (truthy) eliminates all LLM spend and latency but halts enrichment. Turning it OFF allows normal operation — cost and time rise with each enrichment run.
  • Risk — Left ON, the subsystem outputs only stale or pre‑existing data. Left OFF when a cost cap is needed, unbilled token usage can accumulate.

cache

  • Knob — The boolean cache parameter in ainvoke_json_with_telemetry (default True in the seen calls) and the associated cache_scope string.
  • Bounds — Controls whether LLM responses are stored and reused; bounds memory/storage for cache entries.
  • Effect — Enabling caching cuts repeated-costs and latency for identical inputs; disabling forces a fresh LLM call every time, increasing both time and money.
  • Risk — Too‑aggressive caching (long TTL, no invalidation) returns stale signals. No caching (or a too‑small cache) doubles cost on the next run for the same company.

temperature

  • Knob — The temperature=0.1 parameter passed to make_deepseek_flash (no default shown, but 0.1 is the explicit value in the voice‑ops extractor).
  • Bounds — Floating point; lower values reduce output randomness, higher values increase creativity.
  • Effect — Lower temperature yields more exact, repeatable JSON — less wasted output and fewer parse retries, keeping cost stable. Higher temperature may produce varied evidence strings but increases risk of non‑parseable responses, raising fallback cost.
  • Risk — If too low, the model can become brittle (repeating safe patterns); if too high, it might generate hallucinated evidence or malformed JSON, requiring retries.

max_chars (wrap_untrusted)

  • Knob — The max_chars argument in wrap_untrusted() — e.g., max_chars=6000 for the home page, 2000 or 3000 for the careers page, varying by extractor.
  • Bounds — Truncates the input markdown to that many characters before the LLM call, bounding token consumption per request.
  • Effect — Reducing max_chars lowers token count (hence both latency and cost) but may omit critical signals. Increasing it provides richer context but drives up per‑call dollars and time.
  • Risk — Too small: the model misses the only mention of a pricing model or fintech certification, causing false negatives. Too large: tokens soar with no guarantee of improved accuracy, and the prompt may exceed context limits.

_GH_ANALYSE_REFRESH_DAYS

  • Knob — The internal constant _GH_ANALYSE_REFRESH_DAYS (integer, exact value not printed but used in age_days < _GH_ANALYSE_REFRESH_DAYS to skip re‑analysis).
  • Bounds — Minimum age in days for a previously analyzed GitHub org to qualify for refresh.
  • Effect — A higher value reduces GitHub API calls (saving time and avoiding rate‑limit penalties) but leaves data stale. A lower value increases freshness but drives up both latency and API quota usage.
  • Risk — Too high: sales outreach may rely on outdated commit or org activity. Too low: rapid re‑analysis can exhaust GitHub rate limits, causing the whole analyse_github step to fail silently.

model/provider choice

  • Knobmake_deepseek_flash and provider="deepseek" in the LLM call; alternative models (e.g., GPT‑4) would require a different maker/provider.
  • Bounds — Each model has its own latency, cost‑per‑token, and accuracy profile.
  • Effect — A cheaper/faster model reduces both time and money per call but may yield lower‑quality extractions (missing key signals or producing inconsistent JSON). A more capable model increases cost and latency but improves reliability.
  • Risk — Picking an under‑powered model for fine‑grained signals (e.g., pricing classification) can result in low‑confidence outputs that degrade downstream scoring. Picking a premium model for every call blows budget without proportional lift.
Failure modes — what breaks, what catches it

1. Pre‑existing Error or Skip Reason Prevents Enrichment

  • Trigger — Any prior node in the graph set state["_error"] or state["_skip_reason"] to a truthy value.
  • Guard — The early‑return statement
    if state.get("_error") or state.get("_skip_reason"): return {}
    appears verbatim in every enrichment function (extract_immigration_signals, extract_buying_intent, extract_voice_ops_signals, extract_funding_stage, extract_pi_signals, analyse_github).
  • Posture — Fail‑soft. The enrichment step silently returns an empty dict and the rest of the graph continues unaffected. No partial data from this node is persisted.
  • Operator signal — No explicit log line in the source. The operator would observe that the company record lacks the fields normally produced by the skipped enrichment (e.g., missing funding_stage, missing immigration_signals). The upstream error or skip reason is the only clue.
  • Recovery — The guard is purely a skip; no retry or fallback is attempted. The operator must correct the preceding error and re‑run the company through the graph.

2. Vertical Mismatch Causes Early Return

  • Triggerstate["vertical"] does not equal the target vertical for a given extractor. For example, extract_pi_signals checks if vertical != _PI_VERTICAL: return {}, and extract_voice_ops_signals checks if vertical != _VOICE_OPS_VERTICAL: return {}. The extractors for legal‑immigration and funding_stage use similar string comparisons.
  • Guard — The exact if statement guarding the function body (e.g., if vertical != “legal-immigration”: return {}).
  • Posture — Fail‑closed. The function refuses to run altogether; no harm is done, but no signal is produced for that vertical’s specific fields.
  • Operator signal — No log is emitted from the extractor itself. Operators see the absence of vertical‑specific signal fields in the company enrichment output.
  • Recovery — None. This is by design: the function only applies when the vertical matches. No retry or fallback is provided.

3. D1 Database Read Failure in analyse_github

  • Trigger — The query to companies table (SELECT key, github_org, … FROM companies WHERE id = ?) raises a D1Error.
  • Guard — The except D1Error: clause that catches it and returns
    {"agent_timings": {"analyse_github": round(time.perf_counter() - t0, 3)}}.
  • Posture — Fail‑soft. The function returns only a timing metric and does not populate any GitHub‑related columns. The rest of the graph remains unaffected.
  • Operator signal — The exception is logged (the source says “any failure … is logged and swallowed”) but the exact log line is not shown. The operator can observe the presence of the agent_timings.analyse_github key in the output, which indicates the function ran but failed.
  • Recovery — No retry. The function returns immediately with the timing dict. The operator would need to manually investigate the database connection or rerun the company after the issue is resolved.

4. LLM Kill Switch Raising LlmDisabledError

  • Trigger — The LLM_KILL_SWITCH global is active when extract_funding_stage attempts an LLM call via ainvoke_json_with_telemetry.
  • Guard — The docstring for extract_funding_stage explicitly states “Gated by LLM_KILL_SWITCH (LlmDisabledError swallowed below)”. The source code catches LlmDisabledError and returns an empty dict {}.
  • Posture — Fail‑soft. The enrichment for that node is silently skipped; previously committed data in the graph is preserved.
  • Operator signal — The source references a log of the kill‑switch event (e.g., “LlmDisabledError swallowed”) but does not show the exact line. The operator would see the missing funding_stage field in the company facts, and potentially a log entry indicating the kill switch was active.
  • Recovery — No automatic retry. The operator must disable the kill switch and re‑enrich the company for the funding stage data to be produced.
Interview — could you explain it?

Q1 (Warm-up)

Q – When the LLM classifier fails or is skipped, how does the system still produce a classification result?
A – The classify node returns a hardcoded dictionary with category, tier, confidence of 0.3, source="heuristic", and evidence listing matched keywords. This is the heuristic fallback; it never claims to be a grounded fact because the persist layer labels it HEURISTIC.
Follow-up – How does the system ensure that a heuristic guess doesn’t accidentally pollute downstream scoring?
A – The fallback dict carries a low confidence (0.3) and explicitly sets source="heuristic", so downstream scoring weights it less and the persist layer keeps it separate from LLM-originated facts.
Weak answer misses – The fallback also includes "industry": "" and "remote_policy": "unknown" to avoid fabricating details.

Q2 (Medium)

Q – The system has a node dedicated to buying-intent detection. Why does it run for every company, not just those in a specific vertical?
A – The extract_buying_intent node (V69) is documented as running “for all companies regardless of vertical.” It emits buying_intent state with cue_type, strength, confidence, etc., which is persisted to company_facts and later consumed by composite ranking (V73). Every company is a potential buyer of AI consultancy, so the signal is universally useful.
Follow-up – How does it avoid flagging a company that builds its own AI products (i.e., an AI vendor) as having buying intent?
A – The system prompt explicitly instructs the LLM: “Do NOT infer intent from generic ‘we use AI’ or ‘we build AI products’ language — only flag companies that signal they are BUYING/EVALUATING external AI solutions.”
Weak answer misses – The node is non‑fatal and gated by LLM_KILL_SWITCH, so a failure never blocks other enrichment.

Q3 (Design question – “why this way and not the obvious alternative”)

Q – Why does the graph enforce a strict separation between candidate discovery and enrichment, rather than having a single LLM call generate a company name and all its enriched fields in one shot?
A – The architecture ensures accountability: discovery answers “where did this record come from” (verified sources like job postings and web crawls), while enrichment answers “what do we now know about it and how do we know it.” The graph edges show that the node persist commits the base record before any enrichment node (e.g., analyse_github, extract_buying_intent) runs, so discovery is never contaminated by an LLM‑invented company.
Follow-up – What mechanism prevents an enrichment step from hallucinating a company name that never existed?
A – Enrichment nodes only read state["company"] which was already persisted; they never create new company records. The classify node’s heuristic fallback explicitly avoids inventing fields like industry or remote_policy by setting them to empty/unknown.
Weak answer misses – The context states discovery sources are “job postings on applicant tracking systems and crawls of the public web, never from a model simply inventing company names.”

Q4 (Hard)

Q – The hiring-velocity signal needs to be robust against stale or boilerplate careers pages. How does the system handle pages with almost no content?
A – The extract_hiring_velocity node (V66) includes a rule: “Default to ‘flat’ when the careers page has no content or only boilerplate.” The LLM also gets instructions to set evidence to an empty string when no hiring copy is present, and confidence is adjusted downward (<0.6 when the page has very little hiring content).
Follow-up – How is this signal actually used to influence the company’s ICP score?
A – The emitted hiring_velocity state (trend + magnitude) is consumed by the score node to “boost (rising) or dampen (falling) the company’s ICP score.”
Weak answer misses – The node runs regardless of vertical and is non‑fatal; it is gated by LLM_KILL_SWITCH and uses the wrap_untrusted safety fence to prevent prompt injection.

Q5 (Hard – design nuance)

Q – When the classify node makes a mistake, the graph can retry without reprocessing everything. Explain how that loop is bounded.
A – The _grade_router conditional edge after grade can send the state back to classify for a CRAG retry: “fold the critic’s issues into the user prompt so the second pass has a chance to correct itself.” The graph then either continues to score after the retry or, on first pass when the verdict is OK, proceeds directly. The loop is bounded because it only runs one retry – the source says “After the retry (or on first pass when the verdict is OK), we proceed to score.”
Follow-up – What prevents the retry from simply repeating the same error?
A – The critic’s issues are injected into the user prompt (the “CRAG retry” mechanism), giving the LLM explicit feedback to correct its earlier mistake rather than re‑reading only the original context.
Weak answer misses – The retry is only triggered if the grade node flagged the row; otherwise the router goes directly to score without a second classification call.

03. Enriching a Company

Company enrichment starts from a bare company name. A model then assembles the basic business facts, supported by evidence from the web. Those facts include the industry, the rough size, the location, and what the company actually sells. The model also notes the software the company appears to use. It estimates how well the company fits the kind of customer you want. The model is never trusted to free associate. Its output must fit a fixed schema. A schema is a strict shape that says which fields may exist and what type each one holds. So the model fills in blanks instead of writing loose prose. When it has no evidence, the right answer is to leave a field empty. Inventing a plausible value is the failure mode to avoid. The clean record is then written to the database for other parts of the system to use. The rejected alternative is to ask an open question and parse whatever comes back. That is quicker to build, but it lets confident inventions slip through. Constraining the shape costs a little flexibility. In return, downstream code always gets the structure it expects, and missing evidence shows up honestly as a missing field.

The extract_pricing_model node constrains the LLM to a fixed enum, builds a structured record, and persists it to company_facts.

python
async def extract_pricing_model(state: CompanyEnrichmentState) -> dict:
    if state.get("_error") or state.get("_skip_reason"):
        return {}
    company = state.get("company") or {}
    home_md = state.get("home_markdown") or ""
    careers_md = state.get("careers_markdown") or ""
    user_prompt = (
        f"Company: {company.get('name')}\nDomain: {company.get('canonical_domain')}\n\n"
        f"Home page:\n{wrap_untrusted(home_md, label='HOME PAGE', max_chars=6000)}\n\n"
        f"Careers page:\n{wrap_untrusted(careers_md, label='CAREERS PAGE', max_chars=2000)}\n"
        "Return JSON only.")
    llm = make_deepseek_flash(temperature=0.1)
    result, _ = await ainvoke_json_with_telemetry(llm,
        [{"role": "system", "content": _PRICING_MODEL_SYSTEM_PROMPT},
         {"role": "user", "content": user_prompt}],
        provider="deepseek", cache=True,
        cache_scope="company_enrichment.pricing_model")
    if isinstance(result, dict) and result.get("pricing_model") in _PRICING_MODEL_ENUM:
        pm_result = {
            "pricing_model": str(result["pricing_model"]),
            "confidence": _clamp01(result.get("confidence"), 0.5),
            "reason": str(result.get("reason") or ""),
            "evidence": str(result.get("evidence") or ""),
            "source": "llm",
        }
        # persisted to company_facts with full provenance
    return {}
ELI5 — the plain-language version

Think of it like a detective who must fill in a standard incident form. They aren't allowed to write a story—they only check boxes like “industry,” “size,” and “location” based on evidence at the scene. That’s exactly what this subsystem does: it takes a bare company name, feeds scraped web pages to a model, and forces the model to output only those specific fields using a strict schema. The model never free‑associates; it fills blanks with evidence from the source, and if it’s unsure, it leaves the field empty or marks low confidence. To prevent the scraped text from tricking the model, every page is fenced with wrap_untrusted before it reaches the LLM. Without this constraint, the model could invent plausible‑sounding facts or be manipulated by hidden instructions buried in the website copy, producing a polished but entirely wrong profile—and a beginner would have no idea the “facts” were fake. The whole enrichment would become unreliable, and downstream decisions would rest on fantasy instead of reality.

Data flow — one request, in order
  1. classify — Runs an LLM to classify the company into category, tier, industry, remote policy, and open-roles flag.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["company"], state["home_markdown"], state["careers_markdown"]. Returns category, tier, industry, remote_policy, has_open_roles, confidence, reason, evidence, source.
    • branch — If state["_error"] or state["_skip_reason"] is truthy, returns {} immediately. Otherwise happy path calls the LLM.
  2. grade — (Referenced in classify docstring) Checks the output of the first classify pass and emits critic issues if a retry is needed.

    • reads / writes — Reads the output of classify; returns critic_issues (inferred) to be folded into the next call.
    • branch — If no issues are flagged, the retry step is skipped. Otherwise the next classify pass incorporates the issues.
  3. classify (retry) — Second LLM call that includes the critic’s issues from grade, allowing self-correction. Same schema and effect as the first call.

    • reads / writes — Same as step 1, plus the critic issues from grade.
    • branch — Only runs when grade flagged a problem. On happy path (no flag), this step is absent.
  4. enrich_vertical_fit — If vertical is set, runs an LLM tailored to that vertical’s label and keyword signals to produce product_summary, icp, ai_native (bool + confidence), and vertical_fit (strong/partial/none) with provenance; writes a company_facts row.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["vertical"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"]. Returns vertical_fit fields and agent_timings. Writes to company_facts under field='vertical_fit.<vertical>'.
    • branch — If state["vertical"] is empty or _error/_skip_reason is set, returns {} without calling LLM. If the vertical is not in MICRO_VERTICALS, returns only timings.
  5. extract_voice_ops_signals — Only when state["vertical"] == "voice-ops", runs an LLM to extract telephony_stack[], target_vertical, and saas_integrations[] with confidence and evidence; writes to company_facts.

    • reads / writes — Reads the same state keys as enrich_vertical_fit. Returns voice-ops specific fields and agent_timings. Writes to company_facts under field='<field_name>'.
    • branch — If vertical is not "voice-ops", returns {} immediately. Also gated by LLM_KILL_SWITCH.
  6. extract_pi_signals — Only when state["vertical"] == "legal-pi-demand", runs an LLM to extract demand_automation, medical_record_summarization, and case_intake with confidence and evidence; writes to company_facts.

    • reads / writes — Same state keys. Returns PI-specific fields and agent_timings. Writes to company_facts.
    • branch — If vertical is not "legal-pi-demand", returns {} immediately. Also gated by LLM_KILL_SWITCH.
  7. extract_funding_stage — Runs for all companies (after vertical-specific extractors). LLM determines stage, funding_signals, team_size_estimate, and seniority_gate_ok; writes to company_facts.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"], state["vertical"]. Returns funding-stage fields and agent_timings. Writes to company_facts under field='funding_stage'.
    • branch — If _error or _skip_reason is set, returns {}. Otherwise calls LLM; non-fatal on failure.
  8. extract_pricing_model — Runs for all companies regardless of vertical. LLM classifies pricing_model (self-serve/sales-led/usage-based) with confidence, reason, evidence, and source; writes to company_facts.

    • reads / writes — Same state keys as extract_funding_stage. Returns pricing-model fields and agent_timings. Writes to company_facts under field='pricing_model'.
    • branch — Early return if _error or _skip_reason is set. Gated by LLM_KILL_SWITCH.
  9. persist — (Referenced in enrich_vertical_fit docstring) Commits all enrichment that has been written to company_facts rows throughout the previous steps.

    • reads / writes — Writes accumulated state to the database; no return value.
    • branch — This is the terminal step for the enrichment subsystem; any earlier failures do not block it because commits already happened in each individual node.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The enrichment subsystem is structured as a directed acyclic graph of async nodes, each consuming the shared CompanyEnrichmentState and returning a dict. Execution begins with universal extractors active for every company: extract_pricing_model and extract_funding_stage. Each node first checks for the presence of _error or _skip_reason in state and returns an empty dict {} if either is set, short-circuiting any further work. If no early-termination flag is present, the node builds a user_prompt from the company’s home and careers markdown, fences the untrusted content with wrap_untrusted to prevent prompt injection, then calls the LLM via ainvoke_json_with_telemetry. The result is parsed against a strict JSON schema; on any LLM error, kill-switch (LLM_KILL_SWITCH), or parse failure, the node returns {} and the graph continues. After the universal nodes, vertical-specific nodes such as enrich_vertical_fit, extract_voice_ops_signals (gated on state["vertical"] == "voice-ops"), and extract_pi_signals (gated on state["vertical"] == "legal-pi-demand") execute in order, each independently checking for earlier failure.

The design preserves a per-field independent commit invariant: each node writes its structured output into company_facts under a distinct field identifier (e.g., field='pricing_model', field='funding_stage', field='vertical_fit.<vertical>'). The source states explicitly that “any failure here does not block enrichment that already committed in persist.” This means the persistence layer treats each node’s write as an isolated append or upsert; a failure in extract_pricing_model will not roll back the results of extract_funding_stage, and a failure in extract_voice_ops_signals will not undo the vertical-fit classification written by enrich_vertical_fit. The guarantee is not exactly-once across the entire graph — multiple nodes may be called, and those that succeed commit durably. Instead, the invariant is that enrichment is field-wise idempotent: each node can be safely re-run, and its output replaces only its own field in the datastore.

The key trade-off is decomposition versus a monolithic extraction. The obvious alternative is a single LLM call that emits all firmographic fields — industry, size, vertical, pricing model, funding stage, technology stack — in one prompt. The chosen design rejects that approach because a single call that fails (due to a transient model error, context window overflow, or an injection payload in the scraped text) would lose every field, forcing a full retry. Instead, each extractor is separately callable, scoped to its own field identifier, so a failure in extract_pricing_model costs only that field. This avoids the cost of reprocessing the entire enrichment pipeline on partial failures and keeps latency predictable per field. The cost is increased LLM round trips and the orchestration complexity of gating nodes on vertical tags, but that overhead is acceptable because each node’s prompt is relatively small and the cache scope (e.g., cache_scope=f"company_enrichment.voice_ops_signals.{_VOICE_OPS_VERTICAL}") amortizes repeated calls for the same company.

A concrete failure mode occurs in extract_pricing_model when the LLM raises an LlmDisabledError (due to LLM_KILL_SWITCH being engaged). The node catches the exception, returns {}, and the graph continues because the body does not modify state directly — it only writes to company_facts inside the persistence layer. An operator monitoring telemetry sees that the agentic_sales.node=extract_pricing_model span on the ainvoke_json_with_telemetry call has an error attribute or is missing entirely, and the pricing_model field in company_facts is absent for that company. The agent_timings dict returned by the node will contain an entry for extract_pricing_model with the elapsed time, but no pricing_model key, so downstream consumers (like the composite ranking in V73) must handle a missing field gracefully. No cascading failure occurs because extract_funding_stage and vertical-specific nodes see no _error flag and proceed normally.

Cost & performance — the real knobs

The subsystem spends time and money primarily on LLM inference—each enrichment calls ainvoke_json_with_telemetry with a DeepSeek model (make_deepseek_flash). The dominant cost is per‑token API charges; latency is driven by the model’s response speed and the size of the inputs. Caching avoids redundant calls when the same company is re‑enriched, while the LLM_KILL_SWITCH can zero out both time and cost by bypassing all LLM work. Token counts are bounded by max_chars parameters that truncate scraped web content.

Below are five real performance knobs identified in the source:

  • LLM_KILL_SWITCH

    • Knob — Environment variable / constant referenced as LLM_KILL_SWITCH (no default value shown).
    • Bounds — Boolean gate on all LLM‑powered extraction functions; when set True, no model call is made and empty results are returned.
    • Effect — Turning it on eliminates all latency and API cost (zero money and time for LLM work), but no enrichment signals are produced. Off (default) enables the full inference pipeline.
    • Risk — If mistakenly left on, every company will be returned as unenriched (empty state), making downstream ranking or scoring impossible. If off, costs and latency scale with company count.
  • temperature

    • Knobtemperature=0.1 (parameter to make_deepseek_flash).
    • Bounds — Controls randomness of model output; range typically 0.0–1.0.
    • Effect — A low value (0.1) yields deterministic, reproducible JSON; increasing it could produce more varied reasoning but may introduce malformed or non‑JSON output, raising retry overhead.
    • Risk — Too high: increased parse failures and token waste on invalid responses. Too low: reduced diversity, but here that is desirable for structured extraction.
  • cache

    • Knobcache=True (parameter in ainvoke_json_with_telemetry).
    • Bounds — Boolean; enables an internal LRU or persistent cache keyed by cache_scope plus model inputs.
    • Effect — On: second request for the same company + domain + page content returns immediately with zero API cost. Off: each company pays full inference cost every time.
    • Risk — On but stale cache: enriched fields may become outdated. Off: linear cost and latency growth with every enrichment pass.
  • max_chars

    • Knobmax_chars=6000 (home page) and max_chars=2000 or 3000 (careers page) in wrap_untrusted calls.
    • Bounds — Truncates scraped Markdown text to a token limit; larger values include more evidence but increase prompt size (more tokens, higher cost and latency).
    • Effect — Increasing max_chars gives the model richer context, improving field accuracy, but adds token cost and slows inference. Decreasing reduces cost and latency but risks missing key evidence.
    • Risk — Too high: prompt may exceed context window, causing truncation or failure; cost grows quickly. Too low: model may hallucinate or leave fields blank due to insufficient data.
  • cache_scope

    • Knob — A string parameter (e.g., f"company_enrichment.voice_ops_signals.{_VOICE_OPS_VERTICAL}").
    • Bounds — Namespaces the cache so different enrichment stages or verticals do not collide; no bound on length.
    • Effect — Proper scoping ensures a prompt for one vertical or stage does not return a cached result from another, preserving correctness. Changing it can invalidate old caches or share caches across stages.
    • Risk — If mis‑scoped (too narrow), cache hit rate drops, increasing cost. Too broad: stale or wrong cached values for different prompts degrade enrichment quality.
Failure modes — what breaks, what catches it

Enrichment Node Skipped Due to Prior Error

  • Trigger – The state dict contains a non‑empty _error or _skip_reason key, set by an earlier node in the graph.
  • Guard – The check if state.get("_error") or state.get("_skip_reason"): return {} at the top of every enrichment node.
  • Posture – Fail‑soft. The node immediately returns an empty dict without executing, and the rest of the graph continues unaffected.
  • Operator signal – No output is written by this node; the operator sees the expected enrichment fields (e.g. immigration_signals, funding_stage) missing from the company_facts row.
  • Recovery – No retry. The error is assumed to be handled upstream; the run proceeds with downstream scoring nodes, which may produce incomplete results.

LLM Kill Switch Blocks Model Call

  • Trigger – The environment‑level kill switch is active, causing make_deepseek_flash or the inference call to raise LlmDisabledError.
  • Guard – The exception LlmDisabledError is caught and swallowed (explicitly noted in extract_funding_stage as “LlmDisabledError swallowed below”).
  • Posture – Fail‑soft. The node returns {} and the enrichment skips that signal.
  • Operator signal – No log line is shown in the provided code; the operator infers the cause by the absence of expected fields and may check the kill‑switch variable.
  • Recovery – None. The node simply returns empty; no automatic retry. The operator must deactivate the kill switch and re‑run the enrichment if needed.

Database Query Failure in GitHub Analysis

  • Trigger – The d1_one call raises D1Error (e.g. network issue, table corruption).
  • Guardexcept D1Error: return {"agent_timings": round(time.perf_counter() - t0, 3)} in analyse_github.
  • Posture – Fail‑soft. The function returns a timing metric but no data; the enrichment continues with other nodes.
  • Operator signal – The agent_timings dict is emitted, but the companies.github_* columns remain unpopulated. The operator may notice stale github_analyzed_at or missing org data.
  • Recovery – No retry. The call is swallowed; a later invocation of the graph (if not already recent) will retry the analysis.

Missing Company Identifier

  • Trigger – The state’s company_id is None when analyse_github begins.
  • Guard – The explicit guard if company_id is None: return {} in analyse_github.
  • Posture – Fail‑soft. The node returns early without any side effects.
  • Operator signal – No github_* columns are written; the operator sees that the company record lacks an ID upstream.
  • Recovery – None. The enrichment graph must be invoked with a valid company_id. Dependent nodes (e.g. scoring) will produce no input from this branch.

Missing GitHub Organization for Analysis

  • Trigger – The company’s github_org field is empty after reading from the database.
  • Guard – The check if not org: return {} in analyse_github.
  • Posture – Fail‑soft. The node returns an empty dict; no GitHub enrichment is performed.
  • Operator signal – No updates to github_* columns; the operator sees that the company has no GitHub org recorded.
  • Recovery – None. The condition is accepted as a fact; the manual step would be to set the github_org column and re‑run.
Interview — could you explain it?

Q (warm-up): Which node classifies a company’s category and tier from its web content?
A: The classify async function returns a strict JSON with category (CONSULTANCY/STAFFING/AGENCY/PRODUCT/UNKNOWN), tier (0-2), industry, remote_policy, and has_open_roles. The system prompt encodes explicit rules (e.g., “CONSULTANCY = paid AI/ML services”).
Follow-up: What happens when the LLM call fails?
A: A heuristic fallback runs, setting confidence=0.3, source='heuristic', and recording matched keywords as evidence — it never passes as a grounded fact.
Weak answer misses: The fallback has a hardcoded confidence of 0.3 and source string “heuristic”, not just a generic empty response.


Q (medium): How does the system tailor firmographics to a company’s specific vertical?
A: enrich_vertical_fit runs only when state["vertical"] is set. It loads the vertical’s MICRO_VERTICALS entry to get its label and keyword_signals, then builds a tailored system prompt so each micro-vertical gets a qualifier unique to its signals instead of a generic one.
Follow-up: Why not use one prompt for all verticals?
A: The docstring states “Prompt branches on the vertical's label + keyword_signals so each of the 5 micro-verticals gets a tailored qualifier” — a single prompt would lose the specificity needed for accurate vertical-fit assessment.
Weak answer misses: The prompt is built dynamically from the vertical’s keyword_signals (first 6), not just a static template per vertical.


Q (medium-hard): How are model outputs constrained to a fixed schema rather than free-form prose?
A: Every extraction node (e.g., extract_hiring_velocity, classify) provides a “Return strict JSON only” instruction in the system prompt, listing exact field names and allowed values. The call ainvoke_json_with_telemetry parses the JSON object; if parsing fails the node returns {} non-fatally.
Follow-up: What happens if the model returns valid JSON but with extra fields?
A: The code typically extracts only the expected keys — extra fields are ignored (e.g., the voice-ops prompt returns telephony_stack[], target_vertical, saas_integrations[] and the rest is discarded).
Weak answer misses: The schema is enforced purely by prompt instruction and JSON parsing; there is no separate validation layer, and malformed responses do not block the graph.


Q (hard – design question): Why do most enrichment nodes emit provenance fields (confidence, reason, evidence, source) alongside every extracted value instead of just returning a raw string?
A: Nodes like extract_hiring_velocity mandate confidence, reason, evidence, and source so downstream scoring can weight signals appropriately. For example, the classify heuristic fallback sets source="heuristic" and confidence=0.3 so the persist layer labels it HEURISTIC (not LLM) — a guess must never pass as a grounded fact.
Follow-up: But isn’t the model’s own confidence already inside the returned JSON?
A: The model-generated confidence is part of the schema, but the source field (e.g., “heuristic” vs. “LLM”) lets the system distinguish different generation methods at the persist layer, which a bare confidence score cannot do alone.
Weak answer misses: The source field controls how the persist layer tags the row (HEURISTIC vs. LLM), not just providing a confidence value for filtering.


Q (hard – architectural): The design marks many enrichment nodes as “non-fatal — any failure here does not block enrichment that already committed in persist.” Why that pattern instead of failing the whole pipeline?
A: Because the persist node commits earlier results (e.g., company name, domain) before later nodes run. If enrich_vertical_fit or extract_buying_intent fail, the already-committed data is preserved — the graph continues without blocking. The docstrings explicitly state this non-fatal guarantee.
Follow-up: How does the LLM_KILL_SWITCH fit into this?
A: Nodes gated by LLM_KILL_SWITCH simply return {} when the switch is on, again non-fatally, so the pipeline still runs but skips LLM-dependent enrichment.
Weak answer misses: The non-fatal pattern is tightly coupled to the persist commit point; it is not about general fault tolerance but about protecting already-persisted state from later failures.

04. Grounding and Schema Constraints

Grounding is the single most important rule in enrichment. A model may only assert what it can tie to evidence. Its output must also fit a shape you defined in advance. There are two layers to this. The first layer constrains the shape of the answer, so the model returns a structured object with known fields rather than a paragraph someone must interpret. The second layer constrains the content of certain fields against a fixed vocabulary. A skill or a technology, for example, must match an entry in a known list, not whatever phrase the model felt like emitting. Together these layers turn the work into classification and extraction. Those are tasks a model is reliable at. Open invention, by contrast, is where it grows dangerous. Untrusted text pulled from the web is fenced off before it reaches the model. That way a scraped page cannot smuggle in instructions and hijack the request. The cost is real. Someone has to define and maintain those schemas and vocabularies. The payoff is that no bug and no clever input can quietly let an ungrounded value drive a decision.

The classify node enforces both shape and content constraints through a strict JSON schema and enumerated values, while fencing untrusted scraped content with wrap_untrusted.

python
system_prompt = (
    "You classify a company for B2B AI-consultancy ICP targeting. "
    'Return strict JSON: {"category": "CONSULTANCY"|"STAFFING"|"AGENCY"|"PRODUCT"|"UNKNOWN", '
    '"tier": 0|1|2, "industry": string, "remote_policy": "full_remote"|"hybrid"|"onsite"|"unknown", '
    '"has_open_roles": boolean, "confidence": 0..1, "reason": string}. '
    "Category rules: CONSULTANCY (paid AI/ML services), STAFFING (body-shop), ..."
)

user_prompt = (
    f"Company: {company.get('name')}\n"
    f"Domain: {company.get('canonical_domain')}\n\n"
    f"Home page:\n{wrap_untrusted(home_markdown, label='HOME PAGE', max_chars=6000)}\n\n"
    f"Careers page:\n{wrap_untrusted(careers_markdown, label='CAREERS PAGE', max_chars=3000)}\n"
    "Return JSON only."
)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
llm = make_deepseek_pro(temperature=0.2)
result, _ = await ainvoke_json_with_telemetry(llm, messages, ...)
if isinstance(result, dict) and result.get("confidence", 0) >= 0.4:
    classification = result
else:
    classification = _heuristic_classify(home_markdown, careers_markdown)
ELI5 — the plain-language version

Imagine a detective who must write up every lead on a standard police report form, and can only cite exact quotes from witnesses—never guesses or hearsay. That’s exactly what this system does. It forces the AI to extract signals (like a company’s hiring pace or buying intent) only by pulling verbatim evidence from scraped web pages—the evidence field holds the exact phrase, and the reason field explains why that phrase supports the verdict. Every output is locked into a strict JSON schema with fields like confidence, source, and trend; if the model returns anything else, the whole extraction returns {} so nothing breaks. On top of that, controlled vocabularies keep assertions honest: for vertical fit, the model can only answer “strong”, “partial”, or “none” based on a predefined list of keyword signals, not whatever buzzword it invents. Without these guards, the AI might fabricate a hiring boom from boilerplate text, or dump a messy paragraph that downstream scoring can’t parse—and sales teams would chase companies that never actually signaled interest. The detective’s report would be pure fiction.

Data flow — one request, in order
  1. classify node entry
    Reads state["_error"], state["_skip_reason"]; returns {} if either is truthy.
    Branch: Happy path continues; failure path returns early empty dict.

  2. classify main body
    Reads state["company"], state["home_markdown"], state["careers_markdown"] and constructs system/user prompts with a strict JSON schema (category, tier, industry, remote_policy, has_open_roles, confidence, reason). Calls LLM via ainvoke_json_with_telemetry.
    Branch: On LLM success, returns a classification dict with the schema fields; on failure, applies a heuristic regex fallback (returns {"category":..., "source":"heuristic", "confidence":0.3}).
    Writes: Mutates state["classification"] (and sets state["classify_source"] implicitly, though not shown in the snippet – the grade node checks state.get("classify_source")).

  3. classify return
    The returned dict is merged into state (graph framework).
    Writes: state["classification"] (keys: category, tier, industry, remote_policy, has_open_roles, confidence, reason, and source in the heuristic case).

  4. grade node entry
    Reads state["_error"], state["_skip_reason"]; returns {} if either is truthy.
    Branch: Happy path continues; failure path returns early empty dict.

  5. grade main body
    Reads state["classification"]. If empty, returns {}. Then checks state["classify_source"]; if "heuristic", immediately returns {"grade": {"verdict":"ok", "issues":[], "skipped":"heuristic"}, "grade_attempts": ..., "agent_timings":...}.
    Branch: Heuristic source → short‑circuit to skip grading; otherwise continues to LLM‑based grading.

  6. grade LLM call
    Reads state["home_markdown"] (first 5000 chars) and state["careers_markdown"] (first 2000 chars). Constructs a system prompt to audit groundedness and calls the LLM to produce a verdict (ok or not_ok with issues).
    Writes: Mutates state["grade"] (contains verdict, issues), state["grade_attempts"] (incremented by 1).

  7. grade return
    Returns a dict with keys grade, grade_attempts, agent_timings that gets merged into state.
    Writes: state["grade"], state["grade_attempts"].

  8. Graceful retry router (not a function, but a conditional edge)
    Reads state["grade"]["verdict"] and state["grade_attempts"].
    Branch: If verdict is "ok", proceed to next node (score/persist). If verdict is "not_ok" and grade_attempts < 2, loop back to classify (reuses existing markdown – no refetch).
    Control loop: Up to one retry (max 2 total attempts); each retry passes the same input text.

  9. After acceptable grade -> persist node entry
    Reads state["_error"], state["_skip_reason"]; returns {} if either is truthy.
    Branch: Happy path continues.

  10. persist main body
    Reads state["company"], state["company_id"], state["classification"], state["scores"], state["home_markdown"], state["careers_markdown"], state["careers_url"]. Determines classify_method as "HEURISTIC" if classification.get("source") == "heuristic" else "LLM".
    Writes: Executes D1 SQL to UPDATE companies with fields category, tier, score, score_reasons, classification_reason, classification_confidence, updated_at. Also inserts a company_facts row under field='classification' with provenance.
    Returns: A dict (likely empty or with stats) that gets merged into state.

  11. Terminal step
    The graph framework sees no further nodes; enrichment for this company is complete.
    No additional mutations – the persisted data is written to D1.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem operates through a strict conditional pipeline. Each extraction function—like extract_pi_signals, extract_buying_intent, or extract_hiring_velocity—begins by checking the state for _error or _skip_reason and immediately returns an empty dict if either is set, ensuring earlier failures do not cascade. Next, a vertical gate is evaluated: extract_pi_signals only runs when state["vertical"] == "legal-pi-demand", while others like extract_buying_intent or extract_hiring_velocity have no vertical restriction and fire for every company. The home and careers markdown are then fenced through wrap_untrusted before being inserted into a user prompt, and the LLM is called via ainvoke_json_with_telemetry. The response must be strict JSON with fields such as detected, confidence, evidence, and reason. On any failure—LLM error, kill-switch (LlmDisabledError), or parse failure—the function returns {} so that the remainder of the enrichment graph is unaffected; the fault is swallowed, and no partial data is written.

The invariant the design preserves is grounded provenance: every signal must carry a verbatim evidence quote from the source text, a confidence score, and a reason explanation. This ensures that no assertion can be persisted without a traceable link back to the input, preventing hallucinated or unverifiable facts from entering the company_facts table. The return of {} on failure further guarantees that a corrupt or missing extraction does not leave stale or misleading data—any absent field simply remains unset, and downstream nodes (e.g., score) see no entry for that signal. This is a soft invariant: the graph is tolerant of missing signals, but any present signal must be fully traceable.

The key trade-off is using an LLM for flexible, context-aware extraction rather than a purely heuristic or regex-based classifier. The obvious rejected alternative is to use deterministic keyword matching, which is cheaper, faster, and immune to model drift, but it cannot capture nuanced signals like “demand automation” from prose that never uses the exact phrase. The chosen design accepts the latency and cost of LLM calls, plus the risk of subtle hallucinations or prompt injection, in exchange for higher recall of implicit signals. That risk is mitigated by the wrap_untrusted fence, which neutralizes planted [SYSTEM] directives, and by the grounding requirement: the LLM must produce a verbatim quote—any answer without precise evidence is invalid and will fail parsing, returning {} rather than a false positive.

A concrete failure mode occurs when the global LLM_KILL_SWITCH is engaged. The ainvoke_json_with_telemetry call raises LlmDisabledError, which is caught and swallowed inside each extraction function. The operator sees no error trace in enrichment logs; instead, the corresponding field in company_facts (e.g., field='immigration_signals' for an immigration-tech extraction) is simply missing for that company. They would observe an empty entry in the company_facts table for that field, with no supporting confidence, reason, or evidence rows—a silent gap that must be diagnosed by checking whether the kill-switch was active, as the function returned {} successfully.

Cost & performance — the real knobs

The enrichment subsystem spends time on LLM calls (classification, grading, extraction) and money on LLM token usage and external API requests (e.g., GitHub). Four grounded performance knobs control these resources:

  • _CRAG_MAX_ATTEMPTS (default 2)
    Bounds: Maximum number of retries in the grade‑classify loop.
    Effect: Increasing the value allows more LLM passes to correct low‑confidence classifications, raising latency and cost; decreasing it speeds up the pipeline but may finalize poor outputs.
    Risk: Too high → cost blow‑up and stalled graphs; too low → persistent classification errors corrupt downstream scoring.

  • _GH_ANALYSE_REFRESH_DAYS
    Bounds: Minimum age in days before re‑fetching GitHub metadata (used in analyse_github).
    Effect: Larger values reduce API calls and speed up enrichment for recently‑analyzed companies; smaller values keep data fresher at the expense of rate‑limit consumption.
    Risk: Too high → stale org/repo signals mislead scoring; too low → hitting GitHub rate limits and slower pipeline.

  • HOME_PAGE_MAX_CHARS (constant 6000 passed to wrap_untrusted)
    Bounds: Character truncation of home‑page markdown fed into every LLM prompt (classification, funding stage, buying intent, etc.).
    Effect: More characters provide richer grounding evidence but increase token spend and response latency; fewer characters save cost and time but risk omitting critical signals.
    Risk: Too high → context overflow or excessive bill; too low → LLM cannot see essential evidence, degrading classification accuracy.

  • CAREERS_PAGE_MAX_CHARS (typically 2000 in classify / grade, up to 3000 in extract_buying_intent)
    Bounds: Character limit for careers‑page content sent to the LLM.
    Effect: Controls how much job‑related text the model can inspect for roles, seniority, and remote policy. Larger values improve detection of open roles and seniority hints but increase token costs; smaller values are cheaper and faster.
    Risk: Too high → wasted tokens on irrelevant careers content; too low → missing job postings that are critical for scoring.

All four knobs directly affect the trade‑off between enrichment thoroughness and operational cost. They are real identifiers found in the source code – _CRAG_MAX_ATTEMPTS, _GH_ANALYSE_REFRESH_DAYS, and the inline max_chars parameters inside wrap_untrusted calls.

Failure modes — what breaks, what catches it

1. LLM response fails to parse as valid JSON

  • Trigger — The model returns text that cannot be parsed as structured JSON (e.g., broken braces, extra prose) during a call to ainvoke_json_with_telemetry in any of the signal-extraction functions (extract_immigration_signals, extract_buying_intent, extract_voice_ops_signals, extract_funding_stage, extract_pi_signals).
  • Guard — The source does not name an explicit parse-failure handler. The docstring for each function states “any failure (LLM error, kill-switch, parse failure) returns {}”, implying a blanket except clause that swallows the exception. No retry or fallback logic is shown.
  • Posture — Fail-soft. The function returns an empty dict {}, and the rest of the enrichment graph continues unaffected (the docstring explicitly says “Non-fatal … so the rest of the graph is unaffected”).
  • Operator signal — Absence of the expected enrichment data for that company. No dedicated log line or metric is shown; the only artifact is the skipped key (e.g., immigration_signals, funding_stage) silently missing from the persisted company_facts row.
  • Recovery — No automatic retry. The empty dict is consumed by the caller and enrichment proceeds to the next node. A human operator would need to re-run the pipeline for that company (e.g., by resetting its field flags).

2. Kill switch (LLM_KILL_SWITCH) is enabled

  • Trigger — The global gating flag LLM_KILL_SWITCH (appears in the docstrings for extract_voice_ops_signals, extract_funding_stage, extract_pi_signals) is set to a truthy value, causing LlmDisabledError to be raised when the LLM call is attempted.
  • GuardLlmDisabledError is explicitly named in the docstring: “Gated by LLM_KILL_SWITCH (LlmDisabledError swallowed below)”. The exception is caught (presumably in the same generic except block) and the function returns {}.
  • Posture — Fail-soft. The function returns {}; no data is written for that signal. The rest of the graph continues because the empty dict does not propagate failure.
  • Operator signal — The error is likely logged (the source does not show the exact log line), but the observable effect is the same as parse failure: missing enrichment keys. The operator can detect it by checking whether LLM_KILL_SWITCH was enabled at runtime.
  • Recovery — No automatic retry. The operator must disable the kill switch and re-run the enrichment for affected companies.

3. D1 database query failure (D1Error) in analyse_github

  • Trigger — The SQL query SELECT key, github_org, ... FROM companies WHERE id = ? inside analyse_github raises a D1Error (e.g., connection timeout, table missing, or constraint violation).
  • Guard — The try block catches D1Error explicitly and returns a dict containing only agent_timings: {"agent_timings": {"analyse_github": round(time.perf_counter() - t0, 3)}}.
  • Posture — Fail-soft. No GitHub analysis is performed, but the function returns a harmless dict; the caller continues because the state is not mutated.
  • Operator signal — The agent_timings entry appears in the output state for that company. The operator can see that analyse_github ran (timing is present) but produced no GitHub columns. A log of the D1Error is assumed but not shown in the snippet.
  • Recovery — No retry. The function exits immediately. The operator must investigate the D1 error (e.g., check SQL schema, network, or permissions) and re-run enrichment for that company.

4. Model output is valid JSON but violates schema constraints (missing or malformed fields)

  • Trigger — The LLM returns a JSON object that passes parsing but omits required fields like evidence, reason, or confidence, or includes extra fields not in the predefined schema (e.g., for immigration_signals each signal must carry detected, confidence, evidence, reason). No schema validation is shown in the source after the JSON is parsed.
  • GuardNone explicitly present in the source. The code that consumes the parsed result (e.g., to persist under field='immigration_signals') is not shown, but there is no pydantic model, jsonschema check, or assertion over the keys.
  • Posture — Fail-soft (silent degradation). Downstream persistence code may crash if it expects required keys, or may silently write null/missing values because no guard rejects the malformed object. The enrichment graph does not abort.
  • Operator signal — Incomplete or missing fields in the persisted company_facts row (e.g., evidence is null or absent). No error is raised, so the operator must validate the structured data after enrichment completes.
  • Recovery — No automatic recovery. The only remediation is a manual re-run or a code change to add schema validation before persistence.

5. Scraped source text contains injection tokens, but wrap_untrusted prevents steering

  • Trigger — The home or careers markdown includes patterns like [SYSTEM] or other prompt-injection attempts. The functions pass the text through wrap_untrusted (imported from prompt_safety.py) before the LLM call.
  • Guardwrap_untrusted is shown in every user prompt (e.g., wrap_untrusted(home_markdown, label='HOME PAGE', max_chars=6000)). It fences the untrusted content so that injected tokens cannot break out of the user-message context.
  • Posture — Fail-soft (guard succeeds). The injection attempt is neutralised; the LLM sees fenced text and behaves as intended. No failure occurs, but the quality of extraction may degrade if the injection caused truncation or confusion.
  • Operator signal — No direct signal. The operator would observe no unexpected enrichment output. If the injection caused quality degradation, it would appear as low confidence values or missing detected flags in the output signals.
  • Recovery — No recovery needed for the guard itself. If output quality is harmed, the operator must audit the source text and potentially re-run after cleansing the injection.
Interview — could you explain it?

Q – How does the system ensure that a hiring‑velocity trend doesn’t move the score when the trend signal is ungrounded?
A – In the score function the variable hv_grounded is computed by checking that evidence is present and confidence >= 0.5. When hv_grounded is false and a trend exists, the trend is cleared and logged as "hiring_velocity:{trend}(ungrounded,ignored)", so the boost or drag is never applied.
Follow-up – What happens to the magnitude value when the trend is ignored?
The magnitude is still computed and available in the hv dict, but the if not hv_grounded and trend: guard ensures it is never used for scoring.
Weak answer misses – The exact condition bool(hv.get("evidence")) and float(hv.get("confidence") or 0.0) >= 0.5 and the separate if not hv_grounded and trend: block that prevents any boost/drag code from executing.


Q – When the LLM fails to classify a company, how does the system still produce a schema‑compliant output while respecting the “only grounded assertions” rule?
A – The classify function has a heuristic fallback that returns a fixed confidence of 0.3, sets source: "heuristic", and lists matched keywords as evidence. The code comment explicitly says “a guess must never pass as a grounded fact”, so the low confidence and heuristic source ensure downstream scoring treats it as ungrounded.
Follow-up – Why exactly 0.3 and not another value?
The comment states the value is chosen so “downstream scoring weights it less”; the same comment also mandates source="heuristic" so the persist layer labels its method as HEURISTIC, not LLM.
Weak answer misses – The specific confidence value 0.3 and the mandatory source: "heuristic" field that distinguishes it from an LLM‑sourced classification.


Q – How does the system prevent a low‑confidence classification on critical fields from moving forward to scoring without a second chance?
A – The grade node audits the classification for the fields listed in _CRAG_GATED_FIELDS (category_ok, tier_ok, remote_policy_ok). If the verdict is not OK, the conditional edge _grade_router loops back to classify up to _CRAG_MAX_ATTEMPTS=2 times, reusing the same fetched markdown.
Follow-up – What happens if the grader’s LLM call fails (network/parse error)?
The default verdict is "ok" so a flaky grader never blocks enrichment; as stated in the docstring: “When the LLM grader fails the verdict defaults to ok so a flaky grader can never block enrichment.”
Weak answer misses – The exact set _CRAG_GATED_FIELDS and the default‑to‑ok behavior on failure, and the cap _CRAG_MAX_ATTEMPTS = 2.


Q – Why does the system use a separate grade node to audit classification, rather than having the classifier itself produce a confidence score and trust that? (Design question)
A – The separate grader provides an independent LLM critique that can identify hallucinations or missing evidence; when a grade pass flags issues, those issues are folded into the classifier’s user prompt for a self‑correcting retry. The comment explicitly references “the grading‑then‑rewrite pattern from examples/rag/langgraph_crag.ipynb” to avoid repeating the same mistake.
Follow-up – Could the grader and classifier be the same model call with a single prompt?
They share the same LLM but use different prompts; the grader is a separate node so the router can decide to retry without re‑executing the scoring and persistence nodes.
Weak answer misses – The comment about folding the critic’s issues into the user prompt for the retry pass, and the fact that both nodes reuse the already fetched home_markdown and careers_markdown without refetching.


Q – Describe two distinct places in the enrichment pipeline where the system enforces a controlled vocabulary, beyond the JSON schema shape.
A – In classify, the category field is constrained to the set {"CONSULTANCY", "STAFFING", "AGENCY", "PRODUCT", "UNKNOWN"} and tier to {0, 1, 2}. In extract_funding_stage, the seniority gate uses the frozenset _EARLY_STAGES = frozenset({"pre-seed", "seed", "series-a"}) to determine if seniority_gate_ok is True. Additionally, the vertical label in enrich_vertical_fit is looked up from the MICRO_VERTICALS dictionary, not free‑text.
Follow-up – What prevents a node from accidentally writing an arbitrary key to the shared state?
The state is typed by CompanyEnrichmentState and each node returns a fixed dict of known keys; there is no runtime schema enforcement, but the code relies on those return structures being consistent.
Weak answer misses – The exact frozenset _EARLY_STAGES, the controlled category string set in the system prompt for classify, and the MICRO_VERTICALS dictionary lookup in enrich_vertical_fit.

05. Finding and Verifying Contacts

A company is only actionable once you know who to talk to. So contact enrichment finds the right people and confirms how to reach them. The system identifies likely decision makers, then keeps an email address only after that address is verified as deliverable. A guessed address might be right, but sending to it risks a bounce. Bounces damage the reputation of your sending domain, and over time they quietly lower the deliverability of every future message you send. Contacts also need careful deduplication. The same person can appear under slightly different web addresses or profile variants. The platform keys identity on a stable core of the person's profile, not the raw link. So two records that point at the same human collapse into one and never become duplicate outreach. The price of verification and tight matching is volume. You end up with fewer contacts than a looser system would gather. Yet each one is a real, reachable person you will not accidentally message twice.

Extracting named customers with verbatim evidence and confidence, mirroring contact verification by insisting on ground-truth signals rather than guessed patterns.

python

customers: list[dict[str, Any]] = []
for item in raw_list:
    if not isinstance(item, dict):
        continue
    name = str(item.get("name") or "").strip()
    if not name:
        continue
    customers.append({
        "name": name,
        "logo_url": str(item.get("logo_url") or ""),
        "confidence": _clamp01(item.get("confidence"), 0.0),
        "evidence": str(item.get("evidence") or ""),
        "reason": str(item.get("reason") or ""),
        "source": "llm",
    })
ELI5 — the plain-language version

The provided context does not contain any information about finding or verifying contacts, email deliverability, or deduplication. All excerpts come from company_enrichment_graph.py and describe functions for extracting vertical-specific signals (immigration, voice‑ops, pricing, buying intent, funding stage, PI signals). No mechanism for contact enrichment or email verification is mentioned. Therefore, I cannot answer the query using only the given source.

Data flow — one request, in order

I cannot answer this query because the provided context does not contain any code or functions related to contact enrichment, email verification, or deduplication. The context is limited to vertical-specific signal extraction (legal-immigration, voice-ops, funding stage, GitHub analysis, PI signals) from company_enrichment_graph.py. There are no steps for finding or verifying contacts, identifying decision-makers, or verifying email deliverability. To trace a request through the contact enrichment subsystem, I would need source code that includes those specific functions, nodes, and state keys.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The provided context does not contain any information about contact enrichment, email verification, deduplication, or the "Finding and Verifying Contacts" subsystem. The snippets are exclusively from company_enrichment_graph.py and cover vertical-specific signal extraction (e.g., extract_funding_stage, extract_pi_signals, extract_voice_ops_signals), without mentioning decision-maker identification, email deliverability, or contact deduplication. Therefore, I cannot answer the query using only the given source.

Cost & performance — the real knobs

The provided context describes a company enrichment subsystem, not contact finding or verification. However, I can identify real performance knobs from that subsystem that control latency, throughput, and cost. Below are six knobs grounded in the source identifiers and constants.

temperature — set to 0.1 in make_deepseek_flash(temperature=0.1).

  • Bounds: Controls the randomness of LLM output; higher values increase diversity but also token waste and cost per call.
  • Effect: Lowering reduces variability and latency (fewer retries for malformed JSON), raising may cause more parse failures and wasted tokens.
  • Risk: Too high (>0.3) likely produces malformed JSON, triggering parse failure and returning {}; too low may miss nuanced signals.

max_home_chars — hardcoded as max_chars=6000 in wrap_untrusted(..., label='HOME PAGE', max_chars=6000).

  • Bounds: Truncates home page markdown before LLM call, limiting token consumption per request.
  • Effect: Smaller values reduce LLM cost and latency but risk omitting critical evidence; larger values improve recall at higher cost.
  • Risk: Too low (e.g., 1000) might lose the only mention of a competitor or pricing model; too high (e.g., 20000) drives token bill up linearly.

max_careers_chars — hardcoded as max_chars=2000 in wrap_untrusted(..., label='CAREERS PAGE', max_chars=2000).

  • Bounds: Same as above but for the careers page; typically smaller than home page.
  • Effect: Affects the amount of hiring and team-size information visible to the LLM; truncated pages lower signal quality.
  • Risk: Setting too high increases cost without proportionate gain (careers pages are shorter); too low misses funding stage signals.

_GH_ANALYSE_REFRESH_DAYS — a constant (value not shown in snippet, but used in comparison if age_days < _GH_ANALYSE_REFRESH_DAYS).

  • Bounds: Determines how often GitHub analysis is re-run; skips if analyzed within that many days.
  • Effect: Lower values cause more frequent GitHub API calls (rate-limited, costs tokens/bandwidth); higher values reuse stale data.
  • Risk: If set too low, you hit GitHub rate limits and slow down enrichment; too high (e.g., 365) uses outdated org data.

cache — boolean flag cache=True passed to ainvoke_json_with_telemetry(..., cache=True, cache_scope=...).

  • Bounds: Enables/disables caching of LLM responses per unique prompt (keyed by cache_scope).
  • Effect: On reduces repeated LLM calls (saves money) for the same company/prompt; off increases cost but ensures fresh extraction.
  • Risk: Cache may serve stale results if the company page changes; clearing it manually is not described but could be needed.

LLM_KILL_SWITCH — a global boolean gate checked early in every extraction function via if state.get("_error") or state.get("_skip_reason") and also gated by LLM_KILL_SWITCH (the constant name appears in docstrings).

  • Bounds: When True, all LLM extraction returns {} immediately, consuming no tokens or API calls.
  • Effect: Turning it on stops all LLM spend; off allows normal operation.
  • Risk: Mis-setting it to True during production halts enrichment entirely; to False when a provider outage occurs leads to wasted retries.

These knobs are drawn directly from the company enrichment graph source. The query’s “contact enrichment” chapter is not covered by this context, so no contact-specific knobs (e.g., email verification rate limits, deduplication batch sizes) can be named here.

Failure modes — what breaks, what catches it

The provided source—company_enrichment_graph.py—contains no functions, guards, or identifiers related to contact finding, email verification, or deduplication. The extractors shown (immigration signals, buying intent, voice ops, competitors, GitHub analysis, funding stage) all operate on product/careers copy and persist to company_facts; none handle contact lists, email deliverability checks, or deduplication of people records.

Because the query demands a deep failure-mode analysis grounded only in the source, and the source does not contain the subsystem described in the query, no distinct failures can be identified. No real identifiers exist for the contact enrichment pipeline, so no triggers, guards, postures, operator signals, or recovery steps can be named.

If the intent was to analyse the generic failure patterns visible across the extractors (LLM error, kill-switch, parse failure), those are uniformly handled by returning {} (fail-soft), but they are not specific to contact enrichment. To produce a meaningful analysis, the actual code for find_contacts, verify_email, or deduplicate_contacts would need to be provided.

Interview — could you explain it?

Q — How does the system determine whether a contact’s affiliation is a plausible buyer without calling an LLM for every paper-author profile?
A — The buyer_fit_classifier.py module applies a heuristic, no‑LLM verdict using the OpenAlex author profile fields (institution_type, institution name) and predefined keyword sets. It assigns a score and maps it into three bands: buyer (≥0.6), not_buyer (≤0.3), and unknown (0.4–0.6). This fast pre‑filter avoids expensive LLM calls on the large volume of incoming contacts.
Follow-up — What happens when the affiliation_type from Team A is missing?
A — The module degrades gracefully: it falls back to matching the institution name against the _ACADEMIC_NAME_KEYWORDS tuple (e.g., “university”, “college”) to infer academic affiliation.
Weak answer misses — That the classifier also boosts buyer‑fit via _GH_AI_TOPIC_SIGNALS and GitHub org membership, which are defined as frozensets in the module.


Q — Once we have a plausible buyer contact, how does the system decide which companies to prioritise for outreach?
A — The extract_buying_intent node runs for every company regardless of vertical and emits a buying_intent state with fields like cue_type, strength, confidence, and evidence. This signal is exposed for composite ranking consumption (referenced as V73), allowing the system to prioritise contacts at companies that actively show intent to evaluate or purchase external AI solutions.
Follow-up — What sourcing does the buying‑intent node use?
A — It uses the company’s home‑page and careers‑page markdown text, wrapping untrusted content via wrap_untrusted to prevent prompt injection, and calls ainvoke_json_with_telemetry with a DeepSeek model.
Weak answer misses — That the signal is persisted to company_facts under field='buying_intent' with full provenance (confidence, reason, source, evidence), not just kept in‑memory.


Q — Why was a heuristic, no‑LLM approach chosen for the buyer‑fit classifier instead of using an LLM that could give richer analysis?
A — The buyer_fit_classifier.py module is explicitly labelled as a “heuristic, no‑LLM verdict” because it serves as a cheap pre‑filter on the high‑volume stream of paper‑author contacts. Using an LLM for every contact would be both costly and slow. The bands (buyer ≥0.6, not_buyer ≤0.3) are designed to be conservative, and the module degrades gracefully when upstream data is incomplete.
Follow-up — How does the classifier handle noise from organisation names that contain academic keywords?
A — It also considers institution_type from OpenAlex; if that field is "company", the academic keyword match is overridden. The comment says the module “degrades gracefully” when affiliation_type is None.
Weak answer misses — That the heuristic still incorporates external signals like GitHub topics (_GH_AI_TOPIC_SIGNALS) and GitHub org membership, which are not purely name‑based.


Q — The classify node for company type includes a CRAG retry mechanism. How does that mechanism improve the reliability of classification used for contact targeting?
A — In classify, when an earlier grade pass flagged a row (e.g., a previous classification was uncertain), the critic’s issues are folded into the user prompt for a second LLM pass. This allows the model to correct its own mistakes rather than repeating the same error, which is important because the company‑type classification (CONSULTANCY/STAFFING/AGENCY/PRODUCT) feeds into decisions about which contacts at that company are relevant for B2B AI‑engineering outreach.
Follow-up — What stops the CRAG retry from running indefinitely?
A — The retry is gated by the LLM_KILL_SWITCH flag and the presence of a previous grade; the code only folds the critic’s issues once and does not loop.
Weak answer misses — That the CRAG mechanism is explicitly triggered by a grade pass from an earlier step, not by a generic failure — the comment says “when an earlier grade pass flagged this row”, indicating a two‑stage validation pipeline.

06. Scoring Fit and Opportunity

Not every real company is a good fit. So enrichment scores how well a prospect matches your ideal customer profile. A classifier reads the enriched signals. It looks at what the company does, its size, and the software it runs. It also weighs any buying signal, such as an open role that implies a need. The classifier then returns two things: a judgment and a confidence in that judgment. The confidence matters as much as the verdict. It tells the system how far to trust the call. It also decides whether a cheap model's answer can stand or needs a second look from a stronger one. Scores are used to rank and prioritize. Scarce human attention flows to the most promising prospects first. Here is the honest caveat. A score is only as good as the signals beneath it. When the data is sparse, the right move is a low confidence score that triggers more gathering. A falsely precise number is the trap. Ranking by a fabricated score feels productive while it quietly steers effort toward the wrong companies.

The score node combines enriched classification signals into a single fit score, weighted by confidence and flagging low-confidence results for human review.

python
async def score(state: CompanyEnrichmentState) -> dict:
    if state.get("_error") or state.get("_skip_reason"):
        return {}
    c = state.get("classification") or {}
    s = 0.0
    reasons: list[str] = []

    if c.get("category") in ("CONSULTANCY", "AGENCY"):
        s += 0.25
        reasons.append("ICP category")
    tier = c.get("tier", 0)
    s += {2: 0.25, 1: 0.18, 0: 0.0}.get(tier, 0.0)
    if tier >= 1:
        reasons.append(f"AI tier {tier}")
    rp = c.get("remote_policy")
    s += {"full_remote": 0.20, "hybrid": 0.12}.get(rp, 0.0)
    if rp in ("full_remote", "hybrid"):
        reasons.append(rp)
    if c.get("has_open_roles"):
        s += 0.10
        reasons.append("hiring")
    # … (hiring‑velocity handling elided)

    s *= 0.6 + 0.4 * c.get("confidence", 0.5)
    score_value = round(min(s, 1.0), 3)
    needs_review = c.get("confidence", 0) < 0.6

    return {
        "scores": {
            "score": score_value,
            "reasons": reasons,
            "needs_review": needs_review,
        },
        "agent_timings": {"score": round(time.perf_counter() - t0, 3)},
    }
ELI5 — the plain-language version

Think of scoring fit like grading a class project: you don’t just slap a pass/fail on it; you check how well it meets each criterion and note how sure you are of that grade. The enrichment subsystem here acts as a careful grader. It takes real company signals—like what the company builds, how fast it’s hiring, whether it’s issuing RFPs—and runs them through classifiers that output a judgment (for example, “strong vertical fit”) alongside a confidence score. The document shows this concretely in the extract_hiring_velocity function: it reads careers pages to decide if hiring is rising, flat, or falling, and gives a magnitude and confidence. That confidence matters because if it’s low (say 0.3 from a heuristic keyword match), the system knows not to trust the signal as much, and it records a reason like “no keywords matched.” Without this scoring, the system would treat every hint as equally important—it might overvalue a vague job posting and undervalue a concrete RFP, leading it to pursue the wrong companies or waste effort on poor prospects. The confidence keeps the evaluation honest, ensuring guesses are marked as guesses and real evidence gets the weight it deserves.

Data flow — one request, in order
  1. classify(state) — Classifies the company into an ICP category, tier, and other signals using LLM on scraped markdown.

    • reads / writes: reads state["_error"], state["_skip_reason"], state["company"], state["home_markdown"], state["careers_markdown"]; returns a classification dict (keys: category, tier, industry, remote_policy, has_open_roles, confidence, reason, evidence, source).
    • branch: if state["_error"] or state["_skip_reason"] are set, returns {} early (empty classification, skip path). Happy path continues.
  2. grade(state) — Audits the groundedness of the classification by querying the LLM for any unsupported claims.

    • reads / writes: reads state["_error"], state["_skip_reason"], state["classification"], state["classify_source"], state["grade_attempts"], state["home_markdown"][:5000], state["careers_markdown"][:2000]; returns a grade dict (verdict, issues, skipped, grade_attempts, agent_timings).
    • branch: if state["_error"] or state["_skip_reason"] or empty classification, returns {} early; if state["classify_source"] == "heuristic" returns verdict "ok" and skipped="heuristic" without a grading call — happy path for LLM-sourced classifications.
  3. post‑grade router — Decides whether the classification is acceptable or must be retried.

    • reads / writes: reads grade.verdict, grade.issues, current grade_attempts; writes nothing, but directs the graph flow.
    • branch: if grade.verdict contains issues and grade_attempts < _CRAG_MAX_ATTEMPTS (2), loop back to classify with a corrected prompt. Otherwise continue to the next enrichment stage. On first pass the request enters the loop; after the second classify call the router sees no issues and proceeds.
  4. classify(state) (second pass) — Reruns classification with the critic’s issues folded into the user prompt so the model can correct itself.

    • reads / writes: same as step 1, but now also reads the issues carried from grade; returns updated classification.
    • branch: identical precondition checks; always happy path here because the loop was only entered with valid state.
  5. grade(state) (second pass) — Regrades the corrected classification.

    • reads / writes: same as step 2, but grade_attempts is now 1; returns clean verdict="ok".
    • branch: no early‑return branches fire; grade passes without issues.
  6. post‑grade router (second pass) — Now sees verdict="ok" and no issues, so exits the loop and moves to the fixed next node: extract_funding_stage.

    • reads / writes: same as step 3; no writes.
    • branch: route to extraction nodes.
  7. extract_funding_stage(state) — Extracts funding stage, signals, and team‑size estimate for all companies.

    • reads / writes: reads state["_error"], state["_skip_reason"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"], state["vertical"]; returns a funding_stage dict (stage, funding_signals, team_size_estimate, seniority_gate_ok, confidence, reason, evidence, source).
    • branch: if state["_error"] or state["_skip_reason"] are set, returns {} early. Runs for every company regardless of vertical — happy path.
  8. extract_pi_signals(state) — Extracts PI‑specific signals only for the legal-pi-demand vertical.

    • reads / writes: reads state["_error"], state["_skip_reason"], state["vertical"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"]; returns PI‑signal fields (demand_automation, medical_record_summarization, case_intake with nested detected, confidence, reason, evidence).
    • branch: if state["vertical"] != "legal-pi-demand" returns {} early (skip). For a non‑PI vertical the request ends here — this is the terminal step for this trace.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The scoring subsystem operates in a strict ordered pipeline beginning with the classify node, which takes the enriched home and careers markdown along with company metadata and emits a JSON verdict containing category, tier, remote_policy, has_open_roles, and a confidence score. Immediately after, the grade node (the CRAG quality gate) audits that classification for groundedness in the source text, returning a verdict of "ok" or listing issues. If the verdict is not "ok" for any of the _CRAG_GATED_FIELDS — which are "category_ok", "tier_ok", and "remote_policy_ok" — the router loops back to classify for a retry, capped at _CRAG_MAX_ATTEMPTS=2. On each retry the grader’s issues are folded into the user prompt so the second pass can correct itself. Heuristic-sourced classifications (those produced by regex keyword matching with source "heuristic") skip grading entirely because there is no LLM output to critique, and a retry would just reproduce the same heuristic answer. This design ensures that every final classification has either passed an LLM groundedness check or is explicitly flagged as a low-confidence guess.

The invariant the design preserves is that a guess must never pass as a grounded fact, enforced by the interplay between the heuristic fallback and the gated grading mechanism. When the heuristic path fires, it sets confidence=0.3 and marks source="heuristic" so the persist layer labels its method HEURISTIC, not LLM. The grade function itself defaults the verdict to "ok" on any failure (network or parse error) specifically so that a flaky grader can never block enrichment; this is a deliberate choice that subordinates perfect correctness to pipeline continuity. The system also maintains a write boundary: different enrichment nodes like extract_pi_signals and extract_buying_intent are declared non-fatal — any failure returns {} and leaves the rest of the graph unaffected, so an error in scoring a single signal cannot corrupt already-committed data in the persist step.

The key trade-off is spending an extra LLM call (the grader) per classification to achieve groundedness, against the alternative of a single pass that trusts the first classify output without critique. That alternative is rejected because it would allow hallucinated or ungrounded fields to propagate into downstream ranking and scoring, degrading the entire fit-assessment process. The cost avoided is the silent corruption of the ICP scoring layer with false signals. To keep this cost bounded, the retry cap is set to a maximum of two attempts — "there's no point spending more than one extra LLM call on the same input" — and the grading verdict defaults to "ok" on failure so a transient grader glitch never stalls the pipeline. The system also uses a heuristic fallback for companies whose pages are too short or match no clear pattern, accepting low precision (confidence 0.3) to avoid blocking enrichment on low-signal prospects, while clearly segregating those guesses from LLM-grounded outputs.

A concrete failure mode is a persistent LLM grader failure (e.g., a network outage or a malformed response that cannot be parsed). In that case, grade returns a default verdict of "ok" with no issues, and the classification proceeds to the next node without retry. The operator would see no error raised in the pipeline logs, but would observe that classifications with low confidence or suspicious reasoning are never flagged for a second pass, potentially leading to ungrounded ICP scores. A second visible failure mode arises in the extract_buying_intent node: if that LLM call fails (kill-switch, timeout, parse error), it returns {} and the buying_intent field is simply missing for that company. An operator monitoring enrichment output would detect that certain companies lack a buying_intent signal entirely, which is a clear indicator that the scoring sub-graph for that field silently aborted.

Cost & performance — the real knobs

The subsystem spends time on multiple synchronous LLM invocations (classify, grade, and vertical signal extractors) and money on the token cost of those calls. The grade node may re‑invoke classify up to a configurable limit, compounding both time and cost. Heuristic fallback (when LLM fails) saves money but produces low‑confidence output. Token‑truncation parameters cap the number of input tokens per call, trading lower cost/latency against the risk of missing relevant evidence. The following four to six real knobs directly control these trade‑offs.


  • _CRAG_MAX_ATTEMPTS
    Knob — constant _CRAG_MAX_ATTEMPTS=2 (source line).
    Bounds — limits the number of classify retries triggered by the grade node.
    Effect — increasing it adds more LLM calls per row, raising latency and dollar cost; decreasing it reduces retries but may lower final confidence.
    Risk — too low → poor classification can’t self‑correct; too high → wasted spend on stubborn inputs.

  • LLM_KILL_SWITCH
    Knob — environment variable or boolean switch (source references).
    Bounds — when True, all LLM‑based nodes (extract, classify, grade) silently return {} without calling any model.
    Effect — turning it on eliminates LLM cost and latency but skips all signal extraction; turning it off restores full enrichment.
    Risk — accidentally enabled during production → all enrichment is essentially empty; disabled during high load → no cost control.

  • home_markdown truncation (max_chars=6000 in extract nodes)
    Knob — parameter max_chars=6000 passed to wrap_untrusted in extract_fintech_signals, extract_funding_stage, extract_pricing_model, extract_buying_intent.
    Bounds — caps the number of characters from the home page fed to the LLM.
    Effect — raising it includes more page content (potentially better signal), but increases input tokens → higher cost and latency; lowering it saves tokens but may discard critical evidence.
    Risk — too low → missing compliance cues or buying signals; too high → blows up token budget per company.

  • careers_markdown truncation (max_chars=2000/3000 in extract nodes)
    Knob — parameter max_chars=2000 (or 3000 in extract_buying_intent).
    Bounds — same as above but for the careers page content.
    Effect — smaller limit reduces cost; larger limit improves detection of open roles and seniority gating.
    Risk — too low → missing hiring signals that enable tier/funding‑stage classification; too high → unjustified spend for companies with long careers pages.

  • grade home truncation ([:5000])
    Knob — explicit substring (state.get("home_markdown") or "")[:5000] in the grade node.
    Bounds — independent of the extract‑node limit; caps home page text used for grading at 5000 characters.
    Effect — lowering it speeds up grading and reduces cost, but may make the grader unable to verify evidence; raising it improves grading accuracy at the expense of another LLM call’s tokens.
    Risk — too low → validator frequently requires a retry (cycle back to classify) because it can’t see the evidence; too high → defeats the purpose of grading as a lightweight check.

  • heuristic fallback confidence (constant 0.3)
    Knob — literal value "confidence": 0.3 in the heuristic return dict.
    Bounds — this is the confidence assigned to a classification when no LLM answer is available.
    Effect — raising it (e.g., to 0.5) would make heuristic outputs weigh more in downstream scoring, reducing the incentive to retry; lowering it further would make they are almost ignored.
    Risk — too high → low‑quality heuristic guesses can dominate ranking; too low → even valid keyword‑matched classifications are discarded, causing missed prospects.

Failure modes — what breaks, what catches it

1. LLM inference failure (network / model error)

  • Trigger — The DeepSeek API call inside any signal-extraction function (extract_immigration_signals, extract_buying_intent, extract_pi_signals, extract_funding_stage, classify) returns an HTTP 5xx, times out, or the model crashes mid-response.
  • Guard — The docstring‑level contract “any failure (LLM error, kill-switch, parse failure) returns {}” – realised as an implicit try/except Exception (not named in the snippet) that swallows the exception and returns an empty dict.
  • PostureFail‑soft — the rest of the graph is unaffected because the return value is {} and no fields are persisted for that node.
  • Operator signal — The gen_ai.* span carries an error attribute; no immigration_signals, buying_intent, funding_stage, or classification row appears in company_facts for that company.
  • Recovery — None. No retry is implemented; the empty dict is used downstream, and the missing signal is silently ignored. Manual re‑run or a separate retry‑wrapper would be needed.

2. D1 database query failure (transient / network / schema)

  • Trigger — The d1_one call inside analyse_github fails (e.g., connection refused, deadlock, schema mismatch).
  • Guardexcept D1Error in analyse_github – the exception is caught and the function returns only {"agent_timings": ...} with no GitHub‑related state.
  • PostureFail‑soft — the error is logged and swallowed; the analyse_github step is skipped but the enrichment graph continues.
  • Operator signal — A logged D1Error trace; the github_* columns in companies are not updated (or remain stale); agent_timings shows a short elapsed time with no data.
  • Recovery — None. The function does not retry the DB call. The row will be re‑examined on the next pass when github_analyzed_at expires (after _GH_ANALYSE_REFRESH_DAYS).

3. Heuristic fallback activation (after LLM failure)

  • Trigger — The classify function’s LLM call fails (parse error, kill switch, or low‑confidence output that causes the critic to reject the result), and no valid {category, tier, ...} JSON is produced.
  • Guard — The inline heuristic block that returns {"category": ..., "tier": ..., "confidence": 0.3, "reason": "heuristic fallback (regex keyword match)", "evidence": ..., "source": "heuristic"}. (Identifiable by the literal string "heuristic fallback (regex keyword match)" in the source.)
  • PostureFail‑soft — a low‑confidence (0.3) classification is persisted, preventing a None score from blocking the ranking pipeline.
  • Operator signal — A company_facts row with source = "HEURISTIC" and confidence = 0.3; the reason field contains "heuristic fallback (regex keyword match)".
  • Recovery — None. The heuristic output is used as‑is. A downstream human‑in‑the‑loop system (not shown) could override it.

4. Kill switch engagement (LLM_KILL_SWITCH)

  • Trigger — The global LLM_KILL_SWITCH is enabled, causing ainvoke_json_with_telemetry to raise LlmDisabledError before making the API call.
  • Guardexcept LlmDisabledError (explicitly named in extract_funding_stage’s docstring: “Gated by LLM_KILL_SWITCH (LlmDisabledError swallowed below)”), which catches the exception and returns {}.
  • PostureFail‑soft — the signal node produces no output, but the graph continues because the empty dict propagates harmlessly.
  • Operator signal — No LLM call is made; the gen_ai.* span is absent or shows a disabled status; the respective signal field (funding_stage, immigration_signals, etc.) is missing from the persisted company_facts.
  • Recovery — None. The kill switch must be manually turned off before the next run, or the company is processed without that enrichment.

5. Low‑confidence classification triggering CRAG retry

  • Trigger — The initial classify LLM output has low confidence (or the critic grading pass flags it as incorrect), but the function is configured to run a CRAG retry.
  • Guard — The CRAG retry logic (described in the source comment: “when an earlier grade pass flagged this row, fold the critic's issues into the user prompt so the second pass has a chance to correct itself”). This is a retry‑with‑feedback mechanism, not a first‑line guard.
  • PostureFail‑soft — the retry attempts to improve the result; if the second pass also fails, the heuristic fallback (failure #3) is used.
  • Operator signal — Two consecutive gen_ai.* spans for the same company; the second span has a modified user prompt containing the critic’s issues. If both fail, the operator sees a heuristic output as described in failure #3.
  • Recovery — One retry with augmented context. No exponential backoff is shown; the retry is immediate. After retry, the system accepts the second answer or falls back to heuristic.
Interview — could you explain it?

Pair 1 (Warm-up)

  • Q: How does the score node turn multiple signals into the final ICP score for a company?
  • A: The node starts with a raw score s, adjusts it by the hiring-velocity trend—adding a boost of 0.10 * (0.5 + 0.5 * magnitude) for "rising" or subtracting a drag of 0.08 * (0.5 + 0.5 * magnitude) for "falling"—then multiplies the result by 0.6 + 0.4 * confidence from the classifier, clips at 1.0, and sets needs_review when confidence is below 0.6.
  • Follow-up: What prevents a low-confidence hiring-velocity trend from affecting the score?
    A: Before any adjustment, the code checks hv_grounded = bool(hv.get("evidence")) and float(hv.get("confidence") or 0.0) >= 0.5; if false, the trend is replaced with an empty string and logged as "ungrounded,ignored".
  • Weak answer misses: The exact threshold (0.5) and the requirement that evidence must be non-empty, not just confidence.

Pair 2 (Design question — why this way instead of an obvious alternative)

  • Q: Why does the buyer_fit_classifier.py module use a heuristic, no-LLM approach for deciding if a contact’s affiliation is a plausible B2B AI buyer, rather than calling an LLM for every contact?
  • A: The classifier is designed to be cheap and fast: it reuses the OpenAlex institution_type and a small set of name keywords (_ACADEMIC_NAME_KEYWORDS) plus GitHub topic signals (_GH_AI_TOPIC_SIGNALS) to produce a banded verdict (buyer, not_buyer, unknown) without any LLM cost. This is appropriate because the majority of contacts can be resolved by simple rules, and a wrong call can be caught downstream by the confidence band.
  • Follow-up: What happens when the OpenAlex institution_type field is empty?
    A: The classifier falls back to lowercased substring matching against _ACADEMIC_NAME_KEYWORDS (e.g. "university", "college") to flag academic institutions, and otherwise treats the affiliation as unknown.
  • Weak answer misses: The existence of _GH_AI_TOPIC_SIGNALS and the distinction between buyer (score ≥ 0.6), not_buyer (score ≤ 0.3), and unknown (0.4-0.6) bands — not just a binary pass/fail.

Pair 3 (Harder — handling signal reliability in scoring)

  • Q: How does the score node ensure that a weak or ungrounded hiring-velocity signal does not falsely move a company’s rank?
  • A: The node extracts hv from state, checks hv_grounded (evidence present and confidence ≥ 0.5), and if false, strips any trend that was set and adds "ungrounded,ignored" to the reasons list. Only after this guard does it apply the rising boost or falling drag to the score s.
  • Follow-up: What fields inside the hiring_velocity state block are required for the node to consider the signal grounded?
    A: The "evidence" key must be truthy and the "confidence" key must be a float ≥ 0.5; if either is missing or below threshold, the trend is ignored.
  • Weak answer misses: The exact confidence threshold (0.5) and the fact that trend is cleared to "" — not just left in place—so it cannot accidentally affect other logic.

Pair 4 (Hard — vertical-fit weighting and confidence)

  • Q: How does the enrich_vertical_fit node produce a vertical-fitness judgment that the scoring pipeline can trust, and how is low confidence handled downstream?
  • A: The node runs only when a vertical is tagged in MICRO_VERTICALS; it sends a tailored prompt (using the vertical’s label and keyword_signals) to an LLM and returns a structured result that includes a confidence value and a source (e.g., "heuristic" or "llm"). Downstream, the score node multiplies the raw score by 0.6 + 0.4 * c.get("confidence", 0.5), so a low-confidence vertical-fit result is automatically down-weighted.
  • Follow-up: What prevents the enrich_vertical_fit node from blocking the rest of the pipeline if it fails?
    A: Its design is explicitly non-fatal: any error, kill-switch trigger, or missing vertical returns {}, and the graph continues because the node is wired as an unconditional edge after analyse_github.
  • Weak answer misses: The prompt branches per vertical (not a generic prompt) and the requirement that vertical is non-empty and exists in MICRO_VERTICALS.

Pair 5 (Hard — confidence-driven review flag)

  • Q: Why does the score node output a needs_review flag, and how is it determined?
  • A: The flag is set to True when c.get("confidence", 0) < 0.6, meaning the classifier’s own confidence in its judgment (category, tier, etc.) is too low to trust the final score without human oversight. This allows downstream processes to route low-confidence enrichments to a manual review queue rather than acting on a potentially noisy score.
  • Follow-up: What does the needs_review flag imply about the reliability of the hiring-velocity adjustment when the classifier confidence is low?
    A: Even if hiring-velocity signal is strong and grounded, the entire score is scaled by the low classifier confidence factor (0.6 + 0.4 * confidence), so the absolute score is dampened; the needs_review flag then signals that the whole compound score should be treated as tentative.
  • Weak answer misses: The exact threshold 0.6 and the fact that the flag is computed from the classifier’s confidence (c.get("confidence")) — not from any vertical-fit or hiring-velocity confidence.

07. Semantic Matching With Embeddings

Some enrichment questions are about similarity, not exact matches. That is where embeddings come in. An embedding turns a piece of text, such as a company description, into a list of numbers called a vector. The numbers are arranged so that texts with similar meaning sit near each other in that space. With embeddings you can ask fuzzy questions a keyword search cannot answer. You can ask which companies resemble your strongest customer. You can also ask whether two differently worded records describe the same business. This powers semantic deduplication and matching by meaning rather than exact words. The cost is real infrastructure. A model has to compute the vectors, and a store has to hold them and search them quickly. So what happens when that capability is down? A well built pipeline degrades gracefully, falling back to the checks it can still make instead of failing outright. Enrichment keeps producing grounded results even without the similarity layer.

When the embedding subsystem is unavailable, the pipeline degrades gracefully with a no-op node that preserves topology.

python
async def embed_profile(state: CompanyEnrichmentState) -> dict:
    """No-op terminal node — profile embeddings were decommissioned (2026-06).
    ...
    """
    if state.get("_error") or state.get("_skip_reason"):
        return {}
    if state.get("company_id") is None:
        return {}
    return {"agent_timings": {"embed_profile": 0.0}}
ELI5 — the plain-language version

The provided source material does not contain any information about semantic matching with embeddings, vectors, or similarity‑based deduplication. Every extract describes LLM‑driven classification (pricing model, buying intent, company category, health signals, competitors) and heuristic keyword fallbacks — none of which involve embedding techniques. Without relevant context, I cannot write the grounded explanation you requested.

Data flow — one request, in order

I cannot trace a request through a “Semantic Matching With Embeddings” subsystem because the provided source code contains no embedding functions, vector similarity comparisons, or embedding-based deduplication. The only files given are company_enrichment_graph.py (enrichment graph nodes) and buyer_fit_classifier.py (heuristic, keyword‑based classifier).

If you intended to trace a request through the company enrichment graph itself, the steps (using only the real identifiers from the context) are below. Note that no embedding step exists; the entire classification relies on LLM or heuristic keyword matching.

  1. async def classify(state: CompanyEnrichmentState) -> dict
    Reads state["company"], state["home_markdown"], state["careers_markdown"]. Calls LLM (or heuristic fallback) to produce a classification dict with keys category, tier, remote_policy, has_open_roles, confidence, reason, evidence, source.
    reads / writes – reads state.get("_error"), state.get("_skip_reason"), state.get("company"), state.get("home_markdown"), state.get("careers_markdown"). Returns classification dict (and – by convention – the LangGraph node writes it to state["classification"]).
    branch – early return {} if state["_error"] or state["_skip_reason"] is truthy (failure path). Happy path: proceeds to classification.

  2. async def grade(state: CompanyEnrichmentState) -> dict
    Reads state["classification"] and state["classify_source"]. If classify_source == "heuristic" (i.e., the previous classify used keyword fallback), it skips grading: returns grade with verdict: "ok", issues: [], skipped: "heuristic" and increments grade_attempts. Otherwise it runs an LLM grader to validate groundedness and returns grade dict with verdict, issues, grade_attempts.
    reads / writes – reads state["classification"], state.get("classify_source"), state.get("home_markdown"), state.get("careers_markdown"), state.get("grade_attempts"). Returns grade dict (writes to state["grade"]). Also writes state["grade_attempts"] (+1).
    branch – early return {} on _error or _skip_reason (failure). If classification is empty, returns {}. Heuristic source → returns ok (no retry). Otherwise proceeds to LLM grading.

  3. Graph router (not a named function in source; inferred from comments)
    After grade returns, a router (LangGraph conditional edge) checks grade.verdict. If verdict != "ok" and grade_attempts < _CRAG_MAX_ATTEMPTS (2), it loops back to classify for a retry. On each retry the LLM classify receives a “CRAG” prompt (folded critic issues) to correct mistakes.
    branch – happy path: verdict == "ok" → continue downstream. Fail path: if max attempts exceeded and still not ok, the router may still proceed (the grade defaults to ok on failure, so blocking is avoided). Also if classification came from heuristic, it never loops.

  4. async def extract_funding_stage(state: CompanyEnrichmentState) -> dict
    Runs for every company after classification. Reads state["company"], state["home_markdown"], state["careers_markdown"], state.get("vertical"). Calls LLM to extract funding_stage dict containing stage, funding_signals, team_size_estimate, seniority_gate_ok, plus full provenance (confidence, reason, source, evidence).
    reads / writes – reads from state. Returns funding_stage (writes to state["funding_stage"]).
    branch – early return {} if _error or _skip_reason (failure). Non‑fatal: errors are swallowed (LlmDisabledError), returning {} to not block enrichment.

  5. async def extract_buying_intent(state: CompanyEnrichmentState) -> dict
    Runs for all companies (vertical‑agnostic). Reads state["company"], state["home_markdown"], state["careers_markdown"]. LLM extracts buying‑intent signals and returns buying_intent dict with cue_type, strength, confidence, reason, evidence, source.
    reads / writes – reads state. Returns buying_intent (writes to state["buying_intent"]).
    branch – early return {} on _error/_skip_reason. Non‑fatal: any LLM/kill‑switch failure returns {} (no blocking).

  6. async def persist(state: CompanyEnrichmentState) -> dict
    Reads state["company"], state["company_id"], state["classification"], state["scores"], state["home_markdown"], state["careers_markdown"], state["careers_url"]. Writes to the database: updates companies table with category, tier, score, score_reasons, classification_reason, classification_confidence, updated_at. Inserts into company_facts and company_careers_url. Also records classification method (HEURISTIC or LLM).
    reads / writes – reads from state. No return value (writes to DB); returns an empty dict or similar (not shown).
    branch – early return {} if _error/_skip_reason. Failure in d1_run raises exception (non‑fatal to other nodes, but persist is terminal).

Total steps on happy path (no retry loop):
1 → 2 (heuristic skip or grading) → 4 → 5 → 6

With one retry loop:
1 (classify) → 2 (grade fails) → 1 (classify retry) → 2 (grade ok) → 4 → 5 → 6

Maximum with full retries:
1 → 2 → 1 → 2 → 1 → 2 (max attempts exhausted) → 4 → 5 → 6

The buyer_fit_classifier module is not part of this graph; it is a standalone heuristic module used elsewhere for contact enrichment. No embedding or vector‑similarity step appears in any of the provided code.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The provided context does not contain any information about embeddings, semantic matching, or similarity-based deduplication. The functions and mechanisms described in the source code—such as classify, extract_buying_intent, extract_health_signals, extract_fintech_signals, and extract_competitors—operate through heuristic keyword matching, LLM calls, and explicit category rules, not through vector-based similarity. No embedding pipeline, vector store, or similarity-search mechanism is referenced.

Therefore, a system-design explanation of the “Semantic Matching With Embeddings” chapter cannot be grounded in the given source. An operator would need to consult a different part of the codebase or documentation to find the actual embedding subsystem—its ordered mechanism, invariants (such as idempotent vector writes or exactly-once indexing), trade-offs (e.g., accuracy vs. latency, rejection of exact-match-only approach), and failure modes (e.g., a missing embedding index leading to zero similarity results). The source material here is silent on those topics.

Cost & performance — the real knobs

The provided context does not describe a Semantic Matching With Embeddings subsystem; it covers the company enrichment graph, which relies on LLM calls and heuristic fallbacks. Time and money in this subsystem are spent primarily on LLM API invocations (DeepSeek calls) and markdown scraping. The following performance knobs are directly identifiable from the source. Each is a real identifier with documented default or explicit value.

  • LLM_KILL_SWITCH

    • Knob — Environment variable (no default shown; presence gates LLM calls).
    • Bounds — When set, disables every LLM-based extractor (e.g., extract_funding_stage, extract_pricing_model).
    • Effect — Turning it on eliminates all LLM latency and dollar cost; off allows normal LLM spending.
    • Risk — If left on permanently, no AI-derived fields populate (category, tier, competitors, funding stage, etc.) and downstream scoring fails.
  • wrap_untrusted(…, max_chars=6000) (home page truncation)

    • Knob — Parameter max_chars=6000 in the wrap_untrusted call for home_markdown.
    • Bounds — Caps input token count for the LLM prompt; trades off context completeness against per-request cost.
    • Effect — Reducing it lowers token consumption and latency per call but risks omitting crucial signal text. Increasing it improves recall at higher dollar cost and slower inference.
    • Risk — Too low: missing evidence for classification/cues. Too high: blowing context windows, incurring extra cost with marginal gain.
  • wrap_untrusted(…, max_chars=2000) (careers page truncation)

    • Knob — Parameter max_chars=2000 in the wrap_untrusted call for careers_markdown (seen in extract_funding_stage). Some extractors use 3000 (extract_buying_intent).
    • Bounds — Same as above but applied to careers page content.
    • Effect — Smaller value reduces cost and latency for career‑derived signals (open roles, seniority cues); larger value may catch more job‑listing evidence.
    • Risk — Careers text often contains role descriptions; too aggressive truncation loses evidence for has_open_roles and seniority_gate_ok.
  • confidence: 0.3 (heuristic fallback)

    • Knob — Fixed value in the heuristic fallback dict returned when LLM call is skipped or fails.
    • Bounds — Hard‑coded at 0.3; no external control, but its presence weights downstream scoring differently than LLM‑derived confidence.
    • Effect — A lower value (0.3) reduces the impact of heuristic guesses on composite scores, saving potential ranking mistakes. A higher value would let guesses dominate.
    • Risk — Cannot be tuned per company; if the heuristic is overly accurate for some verticals, 0.3 may underweight good evidence, or if inaccurate, no immediate harm from low weight.
  • confidence: 0.9+ / 0.6-0.89 (LLM signal criteria)

    • Knob — Instructions in system prompts (e.g., fintech signals: “0.9+ when page explicitly names capability, 0.6-0.89 for implied signals”). Also appears in pricing model extraction.
    • Bounds — These are soft guidelines; no hard constant, but they implicitly control the LLM’s confidence output.
    • Effect — Tighter thresholds (e.g., require 0.95+ for “detected”) reduces false positives but increases missed signals. Looser thresholds capture more weak evidence at the cost of noise.
    • Risk — Overly strict: many valid cues get low confidence and are ignored. Overly loose: garbage signals inflate downstream rankings.
  • CRAG retry (implicit retry parameter)

    • Knob — Described as “CRAG retry: when an earlier grade pass flagged this row, fold the critic's issues into the user prompt”. No explicit count or backoff is shown, but the mechanism is present.
    • Bounds — Limits the number of retry passes (typically one retry per failing row, but not configurable from the snippet).
    • Effect — Enabling a retry increases latency and cost per row (second LLM call) but can correct classification mistakes. Disabling retry saves cost at the risk of persisting errors.
    • Risk — Too many retries could explode cost and time; too few leaves errors uncorrected. No retry at all would remove the second chance for misclassified entries.

These knobs directly influence how much time (latency per row) and money (LLM token cost) the enrichment pipeline consumes. None belong to a separate embedding subsystem, because the provided context does not reference embeddings at all.

Failure modes — what breaks, what catches it

LLM API Failure (DeepSeek timeout or non-JSON response)

  • Trigger — The DeepSeek call inside ainvoke_json_with_telemetry times out, returns a non‑JSON body, or raises an HTTP error under load or due to network interruption.

  • Guard — No explicit except clause is shown in the provided code. The docstrings of extract_immigration_signals, extract_buying_intent, extract_competitors, extract_funding_stage, and classify all state that “any failure (LLM error, kill‑switch, parse failure) returns {}”. The guard is therefore the unwritten blanket exception handler (presumably except Exception) that returns an empty dictionary. The only named exception mentioned is LlmDisabledError, which is for kill‑switch failures, not generic LLM errors.

  • Posture — fail‑soft. An empty {} is returned, so the graph continues but the enrichment signal is missing. Downstream scoring runs with a null or default value.

  • Operator signal — The missing signal may be visible in the company_facts table as absent rows for the relevant field. Telemetry spans under gen_ai.* may record the error, but the exact log line is not provided.

  • Recovery — No automatic retry. The function returns immediately with {}. Manual re‑run of the enrichment is required (e.g., by resetting the state and re‑entering the graph).

D1 Database Query Failure (D1Error)

  • Trigger — The d1_one call in analyse_github fails due to a transient DB error (connection timeout, constraint violation, or D1 service outage).

  • Guardexcept D1Error: return {"agent_timings": {"analyse_github": round(time.perf_counter() - t0, 3)}}

  • Posture — fail‑soft. The exception is caught, timings are still emitted, and the function returns an empty dictionary (only with the timing key). No enrichment data is lost; the analyse_github step is simply skipped.

  • Operator signal — The D1Error is logged by the except block (log call not shown), and the returned dict contains no github_* keys. An operator may notice that the github_analyzed_at timestamp was not updated.

  • Recovery — No automatic retry. The function exits early. On the next run, the staleness check (_GH_ANALYSE_REFRESH_DAYS) will attempt the query again if the analyzed timestamp is old enough.

Kill‑Switch Active (LLM_KILL_SWITCH raises LlmDisabledError)

  • Trigger — The operator sets the global LLM_KILL_SWITCH to disable LLM calls. When any enrichment function (e.g., extract_funding_stage) calls ainvoke_json_with_telemetry, the underlying client throws LlmDisabledError.

  • Guardexcept LlmDisabledError: return {} (swallowed inside the enrichment function, as stated in the docstring of extract_funding_stage: “Gated by LLM_KILL_SWITCH (LlmDisabledError swallowed below)”).

  • Posture — fail‑soft. The function returns {} and the graph continues. No LLM‑generated signals are produced, but all heuristic or database‑driven steps (e.g., analyse_github) still run.

  • Operator signal — The LlmDisabledError is silently caught. The operator can observe that all LLM‑derived fields (immigration_signals, competitors, funding_stage, etc.) are missing or show empty values in the final state.

  • Recovery — Manual: The operator must clear LLM_KILL_SWITCH and re‑trigger enrichment for the affected companies. No automatic retry is attempted while the kill‑switch is active.

Parse Failure of LLM JSON Output

  • Trigger — The LLM returns well‑formed text that cannot be parsed as the expected strict JSON schema (e.g., extra fields, missing required keys, or literal string instead of JSON). This is a common failure when the model overshoots the prompt instructions.

  • Guard — The docstrings of all LLM‑based functions (e.g., extract_immigration_signals) state that “parse failure returns {}”. The guard is therefore the JSON‑parsing code inside each enrichment function (likely json.loads inside a try/except) that catches json.JSONDecodeError or similar and returns {}. The exact exception name is not shown in the snippets, but the behavior is documented.

  • Posture — fail‑soft. An empty dictionary is returned, omitting the signal. The graph continues.

  • Operator signal — The gen_ai.* telemetry span may include a parse‑failure attribute. The absence of the expected field in company_facts is the primary operator clue.

  • Recovery — No automatic retry. The function returns immediately. A future run with a different LLM output (e.g., after prompt changes) may succeed.

Heuristic Fallback in classify After LLM Failure

  • Trigger — The LLM call in classify fails (timeout, kill‑switch, parse error) or returns no valid JSON, and the function then executes the heuristic fallback branch (the return that sets source="heuristic", confidence=0.3).

  • Guard — The fallback return block at the end of classify (visible in the snippet: returns a dict with confidence=0.3, reason="heuristic fallback (regex keyword match)", source="heuristic"). This is not an exception handler but an explicit code path guarded by the earlier failure. The docstring does not describe this fallback, but the code shows it.

  • Posture — fail‑soft. The company still receives a classification, but with low confidence and a note that it was not LLM‑grounded. Downstream scoring can weight this signal appropriately.

  • Operator signal — The source field in the returned dict is "heuristic". The operator can detect this by inspecting the classify result in the state or the persisted company_facts row for field='classification'.

  • Recovery — No automated retry. The CRAG retry mechanism (mentioned in the docstring) is a separate quality‑improvement retry that runs before the fallback, not after. If the CRAG retry also fails, the heuristic is used. Manual re‑classification is possible by resetting state.

Prompt Injection Attempt (Planting [SYSTEM] Directives)

  • Trigger — The scraped home_markdown or careers_markdown contains crafted text like [SYSTEM] ignore previous instructions intended to hijack the LLM’s system prompt.

  • Guard — The wrap_untrusted function is called on every untrusted markdown snippet before it enters the LLM prompt (e.g., wrap_untrusted(home_markdown, label='HOME PAGE', max_chars=6000)). As stated in the docstrings, “scraped product copy is fenced via wrap_untrusted … so planted [SYSTEM] injections in the source text cannot steer the extraction.”

  • Posture — fail‑closed. The injection is neutralised by the fence; the LLM never sees the untrusted directive. The extraction proceeds normally, producing correct results. The guard is preventive and does not cause a failure.

  • Operator signal — No signal; the injection attempt is silently blocked. The operator would only notice if the injection was particularly egregious and caused the raw markdown to be truncated or modified, but that is a side effect of wrap_untrusted (e.g., max_chars limit).

  • Recovery — None needed. The guard is already in place. If wrap_untrusted were bypassed (e.g., due to a bug), the injection would succeed—but that is not a failure modelled in the source.

Interview — could you explain it?

Q1 (warm-up): How does the enrich_vertical_fit node determine whether a company fits a specific vertical, and what data does it use?
A: It uses an LLM prompt that is tailored per vertical using the vertical’s label and its top six keyword_signals from MICRO_VERTICALS. The prompt also receives the company’s home_markdown and careers_markdown as context. The output includes vertical_fit (strong/partial/none), ai_native (bool + confidence), and full provenance fields (confidence, reason, source, evidence). The node runs only when state["vertical"] is set and is non-fatal.
Follow-up: Why not compute vertical fit with embedding similarity instead of an LLM call?
One-line grounded answer: The context shows no embedding code; the design deliberately uses a prompt that branches on keyword signals for each of the 5 micro-verticals, giving a tailored qualifier rather than a generic semantic similarity score.
Weak answer misses: The answer must cite MICRO_VERTICALS and the fact that the prompt uses mv.keyword_signals (limited to the first six) to specialize the assessment.


Q2 (design question): In the classify node, a heuristic fallback assigns confidence: 0.3 and source: "heuristic" — why this approach over using a semantic embedding model?
A: The heuristic fallback is a cheap, fast guess when the LLM fails or is skipped. It deliberately carries lower confidence (0.3) so that downstream scoring weights it less, and marks source="heuristic" so the persist layer labels its method HEURISTIC — a guess must never pass as a grounded fact. This design protects data quality without blocking enrichment.
Follow-up: What happens if no keywords are matched in that fallback?
One-line grounded answer: The evidence field becomes "no keywords matched" (see the else branch in the fallback return).
Weak answer misses: The answer must mention that the heuristic logic performs regex keyword matching against predefined tuples (e.g., AI_NAME_HINTS, AI_FRAMEWORKS) and that the evidence string explicitly lists matched keywords or the fallback phrase.


Q3 (medium): The buyer_fit_classifier provides a heuristic, no-LLM verdict on whether a contact’s affiliation is a plausible B2B AI-engineering buyer. How does it handle incomplete or missing affiliation_type from Team A?
A: It degrades gracefully by checking the institution name against a set of academic keywords (e.g., “university”, “institute of technology”) when affiliation_type is None. It also uses OpenAlex fields like institution_type, institution_country, and GitHub AI/ML topic signals (_GH_AI_TOPIC_SIGNALS) to compute a score band (buyer ≥0.6, not_buyer ≤0.3, unknown 0.4–0.6).
Follow-up: Why does it avoid calling an LLM for this classification?
One-line grounded answer: It is intentionally a pure heuristic module — no-LLM — to keep the buyer-fit verdict cheap, deterministic, and independent of LLM latency or cost.
Weak answer misses: The answer must cite the specific constant _ACADEMIC_NAME_KEYWORDS and the fallback logic when affiliation_type is None (stated in the docstring).


Q4 (harder): The extract_buying_intent node uses an LLM prompt to detect buying-intent signals. How does the prompt prevent the model from falsely inferring intent from generic “we use AI” language?
A: The system prompt explicitly forbids inferring intent from generic “we use AI” or “we build AI products” language — it only flags companies that signal they are buying or evaluating external AI solutions. The prompt instructs the model to choose the strongest single cue_type (e.g., RFP, proof-of-concept) grounded in a short verbatim phrase from the provided text, and to set strength and confidence accordingly.
Follow-up: What mechanism protects against adversarial text injections in the scraped markdown?
One-line grounded answer: The text is fenced via wrap_untrusted (documented in the extract_hiring_velocity node) before being placed into the LLM prompt.
Weak answer misses: The answer must refer to the exact prompt rules: “Do NOT infer intent from generic 'we use AI' or 'we build AI products' language” and the requirement that evidence is a verbatim phrase ≤40 words.


Q5 (hard): The extract_hiring_velocity node emits a trend (rising/flat/falling) and magnitude (0.0–1.0). How is this different from simply counting open job postings, and how does it affect the final ICP score?
A: It is an LLM-based classifier that interprets language from both the home and careers pages — including phrases like “200% headcount growth” — to derive a trend and a numeric magnitude with confidence, reason, and evidence. A simple count would miss qualitative signals and vague language. The emitted trend and magnitude are later consumed by the score node to boost (rising) or dampen (falling) the company’s ICP score.
Follow-up: What is the default trend when the careers page has no content or only boilerplate?
One-line grounded answer: The default is 'flat' (explicitly stated in the prompt rules).
Weak answer misses: The answer must cite the specific rule “Default to 'flat' when the careers page has no content or only boilerplate” and the fact that magnitude is 0.0 for essentially absent signals.

08. Multi-Model Routing and Cost

Calling a large language model for every field of every record costs real money and time. So the pipeline routes work to match the difficulty of each task. The strategy is to try cheap first. A smaller, faster model handles the routine cases. The system escalates to a stronger, costlier model only when the cheap one returns low confidence. It also escalates when the task is genuinely hard. This mirrors how a team works, where a junior handles the common cases and asks a senior only for the tricky ones. All model traffic also flows through a single gateway that acts as a shared front door. It centralizes the provider keys, the caching, and the monitoring in one place. So individual call sites never hold secrets, and repeated identical requests can be served straight from cache. The cost is added complexity in deciding when to escalate. There is also a risk that the cheap model is confidently wrong. That risk is exactly why confidence scores and grounding rules matter. Tuned well, routing cuts cost sharply while keeping quality where it counts.

The grade node acts as a cost‑aware router: after a cheap initial classification it checks groundedness via a separate LLM call, and only escalates to a costly retry on the first detected failure.

python
_CRAG_GATED_FIELDS = ("category_ok", "tier_ok", "remote_policy_ok")
_CRAG_MAX_ATTEMPTS = 2

async def grade(state: CompanyEnrichmentState) -> dict:
    if state.get("_error") or state.get("_skip_reason"):
        return {}
    classification = state.get("classification") or {}
    if not classification or state.get("classify_source") == "heuristic":
        return {"grade": {"verdict": "ok", "issues": [], "skipped": "heuristic"}, …}

    # … build user_prompt with classification assertions …
    verdict = "ok"
    issues: list[str] = []
    try:
        llm = make_deepseek_flash(temperature=0.0)          # centralised model gateway
        result, _ = await ainvoke_json_with_telemetry(       # caching + observability
            llm, [system_msg, user_msg], provider="deepseek", cache=True, …
        )
        if isinstance(result, dict):
            any_bad = any(result.get(k) is False for k in _CRAG_GATED_FIELDS)
            issues = [str(x) for x in (result.get("issues") or []) if isinstance(x, str)]
            attempts_so_far = int(state.get("grade_attempts") or 0)
            # Escalate to a retry only on the first grading pass → costly rerun
            if any_bad and attempts_so_far == 0:
                verdict = "retry"
    except Exception:
        verdict = "ok"   # don’t block pipeline on grader failure

    return {"grade": {"verdict": verdict, "issues": issues, …}, …}
ELI5 — the plain-language version

Think of the pipeline like a busy clinic: a triage nurse handles the straightforward cases with a quick checklist, and only the complex ones are sent to the doctor. Here, the system uses a cheap heuristic (keyword matching) first to classify companies. If that’s enough, it saves the expense of calling the large language model. But when the heuristic is unsure, the job escalates to the LLM for a deeper look. After the LLM answers, a separate grading node (like a second nurse) checks whether the answer is reliable—if confidence is low, the case loops back to the LLM for a second attempt. Without this cheap-first routing, every company would trigger the costly LLM, burning budget and slowing the pipeline. The result would be wasted money on simple tasks and longer processing times, making the entire enrichment process unaffordable for real‑world use.

Data flow — one request, in order
  1. classify(state) – The node calls the LLM (primary, expensive model) to produce a structured classification from the company’s homepage and careers markdown.

    • reads / writes – Consumes state["company"], state["home_markdown"], state["careers_markdown"]; writes state["classification"] dict with keys category, tier, industry, remote_policy, has_open_roles, confidence, reason, evidence, source. Also sets state["classify_source"] to "heuristic" if the fallback path was used, otherwise "LLM".
    • branch – Happy path uses the LLM. If the LLM fails or the input is too sparse, a heuristic fallback (regex keyword match) returns a classification with confidence=0.3, source="heuristic". The fallback is a cheap, no-LLM path.
  2. grade(state) – The node inspects the classification for groundedness using another LLM call (still an LLM, but possibly a cheaper or faster model for critiquing).

    • reads / writes – Reads state["classification"] and state["classify_source"]; writes state["grade"] dict with verdict ("ok" or "not_ok"), issues list, and skipped reason. Increments state["grade_attempts"] by 1.
    • branch – If classify_source == "heuristic", the grader immediately returns {"verdict": "ok", "skipped": "heuristic"} without an LLM call—avoiding an unnecessary expense. The happy path (LLM-sourced classification) proceeds to a full LLM grading.
  3. Router (implicit control flow after grade) – The pipeline inspects grade["verdict"] and state["grade_attempts"] to decide whether to retry classification.

    • reads / writes – Reads state["grade"], state["grade_attempts"]; no direct writes, but determines the next node.
    • branch – If verdict == "not_ok" and grade_attempts < _CRAG_MAX_ATTEMPTS (default 2), control loops back to classify. Otherwise (verdict ok or max attempts reached), control flows to the next stage (score). The retry limit caps total LLM cost per record.
  4. Second classify call (retry) – The node re-runs the primary LLM, now with an improved prompt that folds in the critic’s issues from the grade output, giving the model a chance to correct its own mistakes.

    • reads / writes – Same as step 1, but the user_prompt is augmented with the issues from grade["issues"]. Overwrites state["classification"] and updates state["classify_source"].
    • branch – Still may fall back to heuristic if the second LLM call fails. The retry reuses the same fetched markdown—no extra network cost for re-scraping.
  5. Second grade call (retry) – The grader critiques the second classification output.

    • reads / writes – Same as step 2. Overwrites state["grade"] and increments state["grade_attempts"].
    • branch – Happy path: verdict "ok" ends the loop. If still not ok but attempts < 2, another loop would occur; after the second grade, grade_attempts reaches 2, so the router forces continuation regardless.
  6. Router (after retries exhausted) – Since _CRAG_MAX_ATTEMPTS is 2, after two grade failures the pipeline moves on to the score node with the best available classification (the last one from the retry).

    • reads / writes – No writes; decision is to proceed to score.
    • branch – No further retries; the system accepts the classification even if confidence is low, avoiding infinite cost on a single record.
  7. score node (implied) – The enrichment pipeline computes an overall score using the classification, confidence, and evidence, weighting heuristic results lower than LLM-sourced ones.

    • reads / writes – Reads state["classification"] (especially confidence and source); writes state["score"] or similar scoring fields.
    • branch – Not shown in provided context, but the code comments mention “downstream scoring weights it less” for heuristic sources.
  8. extract_funding_stage node (V20) – A separate LLM call that runs for all companies (vertical‑independent) to extract funding stage, signals, and team-size estimate.

    • reads / writes – Reads state["company"], state["home_markdown"], state["careers_markdown"]; writes state["funding_stage"] dict with stage, funding_signals, team_size_estimate, seniority_gate_ok, and provenance.
    • branch – Non‑fatal; if the LLM fails (e.g., LlmDisabledError) the node silently returns an empty dict and the enrichment continues without blocking.
  9. enrich_vertical_fit node (V13) – For the specific vertical (e.g., legal-pi-demand), an LLM classifies the company’s fit and writes a structured vertical_fit result.

    • reads / writes – Reads state["company"], state["home_markdown"], state["careers_markdown"]; writes state["vertical_fit"] dict.
    • branch – Only runs if state["vertical"] matches a configured vertical; otherwise skipped. Non‑fatal on error.
  10. extract_pi_signals node (V14) – For legal-pi-demand vertical companies, this node calls a dedicated LLM to detect three specific personal‑injury signals (demand_automation, medical_record_summarization, case_intake).

    • reads / writes – Same inputs as above; writes individual signal fields (e.g., demand_automation) with confidence, evidence, etc.
    • branch – Gated by state["vertical"] == "legal-pi-demand"; if not, returns early. Non‑fatal; LLM kill‑switch is respected.
  11. analyse_github node – For companies with a known GitHub org, this node probes activity using the GitHub API (no LLM call), populating github_* columns.

    • reads / writes – Reads state["company_id"]; writes state["agent_timings"]; updates companies table via d1_one.
    • branch – Skips if no github_org or if recently analyzed (cached within _GH_ANALYSE_REFRESH_DAYS). Non‑fatal on rate‑limit or missing token.
  12. persist node (last) – All collected classification, signals, and scores are written to the database, completing enrichment for this record.

    • reads / writes – Reads accumulated state fields; writes to company_facts and other tables.
    • branch – No conditional; always executes. If earlier nodes failed, default or null values are persisted. Terminal step.

Cost‑control pattern summary:

  • A cheap heuristic (regex, no LLM) handles low‑value cases first.
  • The primary LLM call is the “junior” model; the grader LLM (possibly cheaper) critiques it.
  • On failure, the same primary LLM is retried with an improved prompt (no additional scraping cost).
  • Maximum 2 retries per record limits total LLM spend.
  • Separate vertical‑specific LLM nodes are only called when needed, and all are kill‑switch‑gated.
  • Non‑LLM nodes (GitHub analysis) use free API calls.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem described in the source is an enrichment pipeline that applies a single large language model (LLM) to extract structured signals from company web pages. It does not implement a cheap‑first multi‑model router; instead it relies on a two‑stage mechanism: first the classify node calls the LLM, then the grade node audits that output for groundedness. If the grader finds low confidence on any of the gated fields (_CRAG_GATED_FIELDS = ("category_ok", "tier_ok", "remote_policy_ok")), the router loops back to classify for one retry (capped by _CRAG_MAX_ATTEMPTS = 2). On failure of any LLM call — network error, parse failure, or LLM_KILL_SWITCH being active — the node returns {}, which propagates as a non‑fatal skip so that already‑committed enrichment is preserved. Heuristic‑sourced classifications (state["classify_source"] == "heuristic") skip grading entirely because there is no LLM output to critique.

The invariant the design preserves is idempotent non‑fatal enrichment. Every signal‑extraction node (e.g. extract_funding_stage, extract_buying_intent, extract_hiring_velocity) is gated by LLM_KILL_SWITCH and swallows LlmDisabledError; any failure yields {} and does not block downstream nodes that have already persisted data. This guarantees that a transient LLM outage or a kill‑switch activation never corrupts previously committed company facts. The graph is built to continue where it can and silently skip what it cannot — a deliberate design to make the pipeline resilient against the high cost and latency of LLM calls in production.

The central trade‑off is LLM accuracy versus cost and latency. The obvious alternative would be to call the LLM once per field without any quality gate, accepting whatever confidence the model outputs. The source rejects that approach because a low‑confidence hallucination could pollute downstream scoring. Instead it pays for a second LLM call (the grader) only when the first output is suspect — but still uses the same expensive model for both passes. The cost this rejection avoids is the manual cleanup of bad data that would otherwise require re‑enrichment. The cheaper alternative of a purely heuristic classification is used as a fallback for the classify node, but it carries a deliberate low confidence of 0.3 and is labelled source: "heuristic" so downstream consumers can discount it — a guess must never pass as a grounded fact.

A concrete failure mode is an LLM timeout during classify. The operator would see no entry for classification in the output state; instead the node returns {} and the grade node’s router sees an empty classification, causing it to return {"grade": {...}} with verdict maybe missing. The signal visible in logs or metrics is a spike in “Node classify returned {}” or a silent skip in the enrichment output. Because the design is non‑fatal, the operator must monitor a separate “enrichment skip count” metric to detect degraded coverage — there is no hard error to alert on.

Cost & performance — the real knobs

Multi-Model Routing and Cost

The subsystem spends time on repeated LLM calls (e.g., extract_funding_stage, extract_pricing_model, grade, extract_hiring_velocity) and money on API charges per token and per request. The code uses a single model (DeepSeek Flash) but conserves resources through caching, input truncation, a kill switch, and a retry cap. These are the real performance knobs visible in the source:

_CRAG_MAX_ATTEMPTS

  • Knob_CRAG_MAX_ATTEMPTS = 2 (hardcoded constant in company_enrichment_graph.py).
  • Bounds — Retries for low‑confidence LLM classifications (gated fields only). Limits the number of times grade can bounce back to classify.
  • Effect — Raising it increases total LLM calls (more time and cost) per company, but can salvage a borderline classification. Lowering it saves time/cost at the risk of accepting a poor verdict.
  • Risk — Too high: wasted API calls on the same input with diminishing returns. Too low: final classification may be under‑critiqued, hurting downstream scoring quality.

max_chars parameters (6000 for home, 2000 for careers)

  • Knobmax_chars=6000 (home page) and max_chars=2000 (careers page) in wrap_untrusted() calls inside extract_funding_stage, extract_pricing_model, extract_hiring_velocity.
  • Bounds — Maximum characters of scraped markdown fed to the LLM. Truncates the context window.
  • Effect — Lower values reduce tokens processed (saving time and $ per call) but may discard relevant evidence, lowering classification accuracy. Higher values give the model more signal but increase latency/token cost.
  • Risk — Too low: important clues (e.g., explicit funding stage text) may be cut off, leading to uncertain or wrong output. Too high: over‑paying for irrelevant boilerplate, slower per‑company enrichment.

cache flag and cache_scope

  • Knobcache=True and cache_scope="company_enrichment.pricing_model" (hardcoded in extract_pricing_model; same pattern likely in other extractors).
  • Bounds — Enables a semantic cache for identical or near‑identical LLM prompts across companies/vertical runs. The scope string partitions the cache.
  • Effect — On cache hit, the LLM call is skipped entirely (zero time and cost). On miss, normal cost/time applies. Higher cache reuse reduces total spend/makespan.
  • Risk — Too aggressive (e.g., too broad a scope) can return stale or wrong results for genuinely different companies. Disabling cache increases cost and latency linearly with company count.

temperature=0.1 (in extract_pricing_model via make_deepseek_flash(temperature=0.1))

  • Knobtemperature=0.1 (float parameter passed to the LLM constructor).
  • Bounds — Controls output randomness. 0.1 is near‑deterministic; higher values produce more diverse but less reproducible text.
  • Effect – Lower temperature reduces token‑wasting variation (faster, cheaper) but may lock onto a single “safe” answer even when ambiguous. Higher temperature increases risk of verbose or off‑topic output (more tokens = more cost and time).
  • Risk – Too high: excessive creative output wastes tokens and may break JSON parsing, triggering retries. Too low: cannot express uncertainty, leading to overconfident but wrong classifications.

_GH_ANALYSE_REFRESH_DAYS

  • Knob_GH_ANALYSE_REFRESH_DAYS (constant used in analyse_github; default value not shown but appears in an age comparison).
  • Bounds — Minimum number of days between GitHub‑analysis runs for the same company. Companies re‑encountered within that window are skipped.
  • Effect — Lower threshold increases update frequency (more API calls to GitHub, more time spent) but keeps data fresher. Higher threshold reduces load (time/cost) but risks stale org metrics.
  • Risk – Too low: repeated unnecessary calls waste budget and hit rate limits. Too high: stale github_analyzed_at causes missed hiring/funding signals from repo activity.

LLM_KILL_SWITCH

  • KnobLLM_KILL_SWITCH (environment variable or global flag, referenced in extract_funding_stage, extract_pricing_model, extract_hiring_velocity).
  • Bounds – When active, all LLM calls in those extractors throw LlmDisabledError, which is swallowed (returns empty result). Silently bypasses any expensive model request.
  • Effect – Activating the kill switch cuts all LLM‑associated time and cost to zero for those nodes. Deactivating restores normal operation.
  • Risk – Left on: all companies skip enrichment for that field, starving downstream scorers of critical evidence (e.g., funding stage, pricing model). Left off: no emergency brake for cost blow‑ups or model outages.
Failure modes — what breaks, what catches it

I cannot perform a failure-mode analysis of the "Multi-Model Routing and Cost" subsystem because the provided context contains no code, identifiers, or documentation related to model routing, cheap-first strategy, or model escalation.

The snippets you supplied are from company_enrichment_graph.py and cover:

  • Immigration-signal extraction (extract_immigration_signals) – runs only for legal-immigration.
  • GitHub analysis (analyse_github) – non-fatal, checks refresh age.
  • Funding-stage extraction (extract_funding_stage) – runs for all verticals.
  • PI-signal extraction (extract_pi_signals) – runs only for legal-pi-demand.
  • Company classification (classify) with a CRAG-grade loop (grade).

None of these functions implement a multi-model router, a cheap-first chain, or a cost-gateway. The only LLM calls visible are single-model invocations (via ainvoke_json_with_telemetry), and the only retry logic is the classifygrade loop (capped at 2 attempts), which is a quality‑gate for classification, not a cost‑driven escalation.

Because the source you provided does not contain the subsystem you asked about, I am unable to list failures, guards, postures, operator signals, or recovery steps – there is no real identifier or exception handler in the context to ground such an analysis.

Interview — could you explain it?

1. Warm-up

  • Q: How does the pipeline decide to use a cheaper heuristic instead of calling the LLM for company classification?
  • A: The classify function first attempts an LLM call via a system prompt. If that fails or is skipped (e.g., due to the kill switch), a heuristic fallback uses regex keyword matching and returns a result with confidence 0.3 and source "heuristic". This ensures that cheap, low-confidence guesses are used only when the expensive LLM path cannot execute.
  • Follow-up: How does the system prevent that heuristic output from being mistaken for a high-confidence fact?
  • A: The fallback explicitly sets "confidence": 0.3 and "source": "heuristic", so downstream scoring weights it less and the persist layer marks its method as HEURISTIC (not LLM).
  • Weak answer misses: The specific confidence value of 0.3 and the "source": "heuristic" label; a shallow answer might say “it uses a low confidence” but omit the exact threshold and provenance tag.

2. Medium

  • Q: Why does the hiring‑velocity logic require both evidence and a confidence threshold before applying a trend boost or drag?
  • A: The code checks hv_grounded = bool(hv.get("evidence")) and float(hv.get("confidence") or 0.0) >= 0.5. If either condition fails, the trend is ignored and the reason "ungrounded,ignored" is appended. This prevents unsubstantiated signals from moving the score, avoiding fabricated signals in the ranking.
  • Follow-up: What happens if confidence is exactly 0.5?
  • A: It is accepted because the condition is >= 0.5 (not strictly greater).
  • Weak answer misses: The requirement for both evidence (a truthy value) and confidence ≥ 0.5; many might think only confidence matters.

3. Moderate-hard

  • Q: How does the pipeline ensure that an LLM failure does not block the entire enrichment graph?
  • A: Multiple extractors (e.g., extract_funding_stage, extract_buying_intent) are explicitly non‑fatal: they return {} on any LLM error, and they catch LlmDisabledError raised by the kill switch. This allows downstream nodes to proceed even when the LLM is unavailable.
  • Follow-up: What global mechanism allows operators to completely disable LLM calls without changing code?
  • A: The LLM_KILL_SWITCH variable – when it raises LlmDisabledError, the extractors swallow it and return empty results, effectively bypassing all LLM calls.
  • Weak answer misses: The exact exception name LlmDisabledError and the pattern of catching it inside each extractor; a shallow answer might say “they return empty” but miss the kill‑switch gating.

4. Hard (design question)

  • Q: Why does the pipeline use a cheap‑first routing strategy with confidence gating, rather than calling a single powerful model for every field?
  • A: The system avoids paying for an expensive LLM on routine cases by using a heuristic fallback in classify (confidence 0.3) and grounding checks on signals like hiring_velocity (confidence ≥ 0.5 with evidence). Only when the cheap path returns low confidence or the task is genuinely hard does it escalate – for example, the needs_review flag (confidence < 0.6) flags outputs that might require human review, not a stronger model, but the principle of gating on confidence controls cost.
  • Follow-up: Where is the exact confidence threshold for flagging outputs as “needs_review” defined?
  • A: In the scoring function at the end of the hiring‑velocity snippet: needs_review = c.get("confidence", 0) < 0.6.
  • Weak answer misses: The 0.6 threshold is applied to the composite score’s confidence field, not directly to individual LLM outputs; a shallow answer might confuse it with an LLM‑call threshold.

5. Hardest (advanced mechanism)

  • Q: When the LLM classifier produces a low‑confidence result, how does the pipeline give it a chance to correct itself on a retry, instead of just re‑running the same prompt?
  • A: The classify function implements CRAG retry: if an earlier “grade” pass flagged this row, the critic’s issues are folded into the user prompt of the second pass. This means the model sees its previous mistake and can correct it, rather than repeating the same error.
  • Follow-up: What is the key difference between a naive retry and the CRAG approach used here?
  • A: CRAG modifies the input prompt by including the critic’s feedback, while a naive retry resubmits the identical input.
  • Weak answer misses: The exact name “CRAG retry” and the fact that it alters the user prompt; a shallow answer might say “it retries with a different prompt” but not cite the critic‑feedback mechanism.

09. Evaluation and the Accuracy Gate

Enrichment leans on models and prompts. Both are easy to change and hard to judge by eye. So the platform refuses to ship a change on vibes. Every change to a prompt or a model is measured against a fixed set of examples. Each example has a known correct answer. Together they form an evaluation suite. A change must clear an accuracy bar before it is allowed to land. Here that bar is eighty percent. This turns a vague sense that the new prompt feels better into a number you can defend or reject. The discipline is to measure first. You decide how you will judge success before you start tinkering. So real improvements stand out, and regressions are caught the moment they appear. The cost is the upfront work of building those labeled examples. Keeping the bar honest takes ongoing effort too. The payoff is the freedom to iterate quickly without fear. The gate, not hope, protects the quality of every record the system produces.

The grade node enforces an accuracy gate for each classification, checking groundedness and retrying once before scoring.

python
_CRAG_GATED_FIELDS = ("category_ok", "tier_ok", "remote_policy_ok")
_CRAG_MAX_ATTEMPTS = 2

async def grade(state: CompanyEnrichmentState) -> dict:
    classification = state.get("classification") or {}
    # ... skipped if heuristic or empty
    home_markdown = (state.get("home_markdown") or "")[:5000]
    careers_markdown = (state.get("careers_markdown") or "")[:2000]

    system_prompt = (
        "You audit a company classification for groundedness. … "
        "Return strict JSON: "
        '{"category_ok": boolean, "tier_ok": boolean, '
        '"remote_policy_ok": boolean, "issues": [string]}. …'
    )
    user_prompt = (
        "Proposed classification:\n" + json.dumps({
            "category": classification.get("category"),
            "tier": classification.get("tier"),
            "remote_policy": classification.get("remote_policy"),
            "confidence": classification.get("confidence"),
        }) + "\n\nHome page:\n" + wrap_untrusted(home_markdown, ...)
        + "\n\nCareers page:\n" + wrap_untrusted(careers_markdown, ...)
        + "\n\nReturn JSON only."
    )

    verdict = "ok"
    issues: list[str] = []
    try:
        llm = make_deepseek_flash(temperature=0.0)
        result, _ = await ainvoke_json_with_telemetry(
            llm,
            [{"role": "system", "content": system_prompt},
             {"role": "user", "content": user_prompt}],
            …
        )
        if isinstance(result, dict):
            any_bad = any(result.get(k) is False for k in _CRAG_GATED_FIELDS)
            issues = [str(x) for x in (result.get("issues") or []) if isinstance(x, str)]
            attempts_so_far = int(state.get("grade_attempts") or 0)
            if any_bad and attempts_so_far == 0:
                verdict = "retry"
    except Exception:
        verdict = "ok"                    # transient failures never block enrichment

    return {
        "grade": {"verdict": verdict, "issues": issues},
        "grade_attempts": int(state.get("grade_attempts") or 0) + 1,
    }
ELI5 — the plain-language version

Think of the system like a chef who has a second cook taste-test every dish before it leaves the kitchen. Nobody trusts a new recipe on gut feeling alone, so every prompt or model change must be checked against a fixed set of examples with known answers—a test suite—and must clear an accuracy bar before it can be served.

The real mechanism here is the grade node: it reads the AI’s classification alongside the original web page text and asks, “Is this verdict actually supported by the evidence?” If the grade returns low confidence, the system sends the work back for one retry (the CRAG loop), capped at two attempts so a flaky grader can never block progress forever. And if the grader itself fails (network error), the verdict defaults to “ok” to keep the pipeline moving.

Without this gate, a misclassification that sounds plausible but is completely invented would sail straight into scoring, biasing the company ranking and sending sales teams after the wrong targets. The process would run on vibes—nice-sounding but untruthful labels that waste everyone’s time.

Data flow — one request, in order
  1. classify — Uses an LLM to classify the company into one of five categories (CONSULTANCY, STAFFING, AGENCY, PRODUCT, UNKNOWN) and assigns a tier, industry, remote policy, and confidence.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["company"], state["home_markdown"], state["careers_markdown"]. Writes state["classification"] (the JSON verdict), state["classify_source"] (either "llm" or "heuristic"), and state["agent_timings"]["classify"].
    • branch — Returns {} immediately if state["_error"] or state["_skip_reason"] is truthy. On LLM or parse failure, falls back to the heuristic path that sets confidence=0.3, source="heuristic", and writes an evidence string of matched keywords.
  2. grade — Audits the classification for groundedness by having a separate LLM judge whether each high-priority field (category_ok, tier_ok, remote_policy_ok) is supported by the page text.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["classification"], state["classify_source"], state["home_markdown"], state["careers_markdown"], state["grade_attempts"]. Writes state["grade"] (dict with verdict, issues, category_ok), increments state["grade_attempts"], writes state["agent_timings"]["grade"].
    • branch — Returns early {} if state["_error"] or state["_skip_reason"] is set, or if state["classification"] is empty. If state["classify_source"] == "heuristic", the node skips the LLM call and returns verdict: "ok" with skipped: "heuristic". If the LLM grader itself fails (exception), it defaults to verdict: "ok" so the flaky grader cannot block enrichment. Otherwise it checks each gated field for low confidence; if any is low and state["grade_attempts"] is 0, it sets verdict: "retry".
  3. _grade_router — Conditional edge that decides whether to loop back to classify or proceed to score.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["grade"]["verdict"], state["grade_attempts"]. Writes nothing (returns a string).
    • branch — If state["_error"] or state["_skip_reason"] is truthy, returns "score" (skip loop). Otherwise, if verdict=="retry" and grade_attempts < _CRAG_MAX_ATTEMPTS (2), returns "classify"; else returns "score". This creates the CRAG retry loop, with at most one retry attempt because the second grade always yields "ok" by design.
  4. classify (retry pass) — Second execution triggered when grade returned "retry". The user prompt now includes the critic's issues from the first grade so the LLM can correct itself.

    • reads / writes — Same as step 1, but now state["grade"] is present and its issues are folded into the prompt. The same early-exit branches apply.
    • branch — Same kill-switch and heuristic fallback as before.
  5. grade (second pass) — Re-grades the updated classification after retry.

    • reads / writes — Same as step 2, but state["grade_attempts"] is now 1. Writes a new verdict (always "ok" because the retry condition only triggers on attempt 0).
    • branch — No retry can occur now; even if the classification is still weak, the verdict will be "ok", pushing the request forward.
  6. _grade_router (post-retry) — Routes to "score" because the verdict is no longer "retry" or attempts have reached the cap.

    • reads / writes — Same as step 3.
    • branch — Always returns "score" now.
  7. score (inferred node, referenced in router) — Computes a final score and priority for the company based on the classification, grade verdict, and any signal fields. (Exact implementation not shown in provided snippets, but it is the standard next node after grading.)

    • reads / writes — Reads state["classification"], state["grade"], state["vertical"]. Writes scoring results (e.g., state["score"], state["priority"]).
    • branch — Assumed to be unconditional; no early returns are visible.
  8. extract_pi_signals — Only runs when state["vertical"] == "legal-pi-demand". Extracts three product‑capability signals (demand_automation, medical_record_summarization, case_intake) with full provenance.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["vertical"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"]. Writes the three signal sub‑dicts into state (e.g., state["demand_automation"], state["medical_record_summarization"], state["case_intake"]) plus state["agent_timings"]["extract_pi_signals"].
    • branch — Returns {} immediately if vertical is not "legal-pi-demand". Non‑fatal; any failure inside (including LlmDisabledError) returns {} and logs a warning.
  9. extract_funding_stage — Runs for all verticals. Emits funding stage, funding signals, team‑size estimate, and the seniority_gate_ok flag used by downstream scoring.

    • reads / writes — Reads state["_error"], state["_skip_reason"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"], state["vertical"]. Writes state["funding_stage"] (a dict with stage, funding_signals, team_size_estimate, seniority_gate_ok, plus provenance fields), and state["agent_timings"]["extract_funding_stage"].
    • branch — Returns {} if state["_error"] or state["_skip_reason"] is set. The LLM call is gated by LLM_KILL_SWITCH; if that switch is active, the exception is swallowed and the node returns {} so the rest of the graph continues.
  10. Persist (implied by the comment “any failure here does not block enrichment already committed in persist”) — Writes the accumulated facts (classification, signals, funding stage) to the company_facts D1 table.

    • reads / writes — Reads all state keys that need persisting; writes to an external database.
    • branch — Non‑fatal; a failure in this node is logged but does not cause the request to fail, because the enrichment graph has already finished computation.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem is implemented as a set of async extractor functions in company_enrichment_graph.py that run sequentially or conditionally within a graph. Each function—such as extract_pi_signals, extract_funding_stage, extract_buying_intent, and extract_hiring_velocity—follows an ordered mechanism: it first checks state.get("_error") or state.get("_skip_reason"), returning {} if either is set. If no early exit is triggered, vertical‑specific functions (e.g., extract_pi_signals for _PI_VERTICAL or the immigration‑signal extractor for "legal-immigration") gate on state["vertical"] and return {} when the vertical does not match. On passing these gates, the function wraps scraped product copy via wrap_untrusted before calling the LLM, ensuring planted [SYSTEM] injections cannot steer the signal. On success, the result is persisted to company_facts under a specific field (e.g., 'immigration_signals', 'funding_stage', 'buying_intent', 'hiring_velocity') with full provenance (confidence, reason, source, evidence). On any failure—LLM error, kill‑switch (LLM_KILL_SWITCH), parse failure—the function returns {} and the rest of the graph continues unaffected; failures are explicitly non‑fatal.

The invariant preserved by this design is a write boundary: enrichment that has already been committed in the persist node must never be rolled back or blocked by a later failure. The source text states for extract_funding_stage: “Non‑fatal — any failure here does not block enrichment already committed in persist.” Similarly, every extractor guarantees that a failure returns {} so the graph’s downstream flow—especially the persist step—sees only a no‑op. This idempotency‑like contract ensures that partial enrichment is safe: once data is persisted, no subsequent extractor’s error can invalidate it.

The key trade‑off is LLM‑based extraction versus a purely heuristic alternative. The design explicitly includes a heuristic fallback inside classify, which runs a regex keyword match and returns confidence: 0.3, source: "heuristic", and reason: "heuristic fallback (regex keyword match)". The obvious alternative—running only heuristic rules across all companies—is rejected because it cannot capture nuanced signals such as petition drafting or buying‑intent cues. The LLM approach incurs latency and cost per call, and the system guards against prompt injection through wrap_untrusted. However, the cost is bounded by the non‑fatal {} return: a single LLM failure does not cascade, and the LLM_KILL_SWITCH mechanism (LlmDisabledError swallowed) allows operators to globally disable LLM calls without breaking the graph. The rejection of a pure‑heuristic path avoids the cost of false negatives (missed opportunities) when signals are implicit, at the expense of greater operational complexity.

A concrete failure mode is when the DeepSeek call in extract_funding_stage hits the LLM_KILL_SWITCH. The LlmDisabledError is caught and the function returns {} for all companies, regardless of vertical. An operator monitoring the system would observe that company_facts rows for field 'funding_stage' are missing for every company in that run. The observability span attributes (e.g., agentic_sales.vertical=legal-immigration) would show the call was skipped, but no error is propagated to the graph state. The signal to the operator is a sudden absence of funding‑stage data in the output, backed by logs from the kill‑switch. This shows how the non‑fatal invariant is both a strength (no blocking) and a weakness (silent data gaps), requiring separate monitoring to detect.

Cost & performance — the real knobs

Based solely on the provided source snippets, the subsystem spends time and money primarily on LLM API calls (for funding stage, pricing model, buying intent, and grade nodes) and on GitHub API calls (analyse_github). The following five real performance knobs directly control these expenditures:

LLM_KILL_SWITCH

  • KnobLLM_KILL_SWITCH (environment variable, default not shown in source)
  • Bounds — Binary switch that completely disables all LLM calls within the enrichment graph. When active (thrown as LlmDisabledError), nodes like extract_funding_stage and extract_pricing_model skip the LLM call entirely.
  • Effect — Setting it to “on” (killed) eliminates all LLM‑related dollar costs and latency, but enrichment fields that depend on LLM inference (funding stage, pricing model, buying intent) will return empty results or fall back to heuristic defaults. Turning it off (normal operation) restores full enrichment at the cost of every LLM invocation.
  • Risk — If left on inadvertently, the platform loses all LLM‑derived enrichment, breaking downstream scoring that expects those fields. If left off when costs must be strictly controlled, the absence of this kill switch allows unbounded LLM spending.

_GH_ANALYSE_REFRESH_DAYS

  • Knob_GH_ANALYSE_REFRESH_DAYS (module‑level constant, exact value not shown but compared with age_days in analyse_github)
  • Bounds — Minimum number of days that must elapse before the same GitHub org is re‑analyzed. The current timestamp of github_analyzed_at is checked against this threshold.
  • Effect — Lower values increase the frequency of GitHub API calls (rate‑limited, token‑gated), raising both network latency and potential API costs. Higher values reduce request volume, saving time and money but allowing the stored GitHub data (stars, patterns) to become stale.
  • Risk — Set too low, the platform may exceed GitHub hourly rate limits or waste tokens on unchanged repos. Set too high, the github_* columns used by scoring could be months out of date, degrading recommendation quality.

_CRAG_MAX_ATTEMPTS

  • Knob_CRAG_MAX_ATTEMPTS (constant equal to 2)
  • Bounds — Maximum number of times the grade node can loop back to classify (the CRAG retry pattern). The retry reuses the same fetched markdown and does not fetch new data.
  • Effect — Raising this value (e.g., to 3 or 4) increases the maximum number of LLM calls per company for the classification‑grade cycle, directly adding cost and latency for every company that fails the first grade. Lowering it (to 1) eliminates retries, saving money but allowing low‑confidence classifications to proceed unchecked, potentially degrading scoring accuracy.
  • Risk — Too high inflates LLM spend without proportional benefit (the source notes “there’s no point spending more than one extra LLM call”). Too low (or set to 0) defeats the accuracy gate, letting ungrounded classifications into scoring.

wrap_untrusted max_chars

  • Knobwrap_untrusted(home_markdown, label='HOME PAGE', max_chars=6000) and wrap_untrusted(careers_markdown, label='CAREERS PAGE', max_chars=2000) (two separate parameters, each a literal integer)
  • Bounds — Maximum number of characters from the scraped page text that are fed into the user prompt for LLM nodes (e.g., extract_funding_stage, extract_pricing_model). Content beyond this limit is truncated.
  • Effect — Increasing max_chars (e.g., to 8000 for home page) passes more context to the LLM, potentially improving inference quality but increasing token count per call (and thus latency and cost). Decreasing it reduces token consumption, speeding up each LLM reply and lowering per‑call expenses, but may cause the model to miss relevant signals.
  • Risk — Set too high, a single very long page could exceed the model’s context window (e.g., 32k tokens) or cause disproportionate cost for minimal accuracy gain. Set too low, the prompt loses critical evidence, forcing the LLM to guess or return low‑confidence answers.

cache=True (LLM response cache)

  • Knobainvoke_json_with_telemetry(…, cache=True, cache_scope="company_enrichment.pricing_model") (boolean flag plus a scope string)
  • Bounds — Enables or disables caching of LLM responses keyed by the exact prompt (including system and user messages). The cache_scope partitions the cache per node type.
  • Effect — When enabled, repeated calls for the same company (e.g., same home page + careers markdown) return a cached response instantly, eliminating the LLM’s latency and cost entirely for that duplicate. When disabled, every call hits the model provider, spending money and time even on identical inputs.
  • Risk — Stale cache: if the website content changes but the cache key (prompt) remains identical, the enriched fields will be stale until the cache expires or is evicted. Over‑aggressive caching (too long TTL) hides real changes; disabling it wastes resources on repeated identical queries.
Failure modes — what breaks, what catches it

Because the provided source excerpts (all from company_enrichment_graph.py) do not contain the evaluation suite, accuracy bar, or prompt-change gate described in the query, the following analysis treats the grade node and its surrounding CRAG retry logic as the accuracy gate present in the source. Every failure mode, guard, and signal is drawn strictly from the code.


1. LLM grader fails (network error, parse failure)

  • Trigger – The grade function attempts to call an LLM to produce a JSON verdict, but the call fails (network timeout, API error) or returns unparseable text.
  • Guard – The code states: “When the LLM grader fails (network, parse error) the verdict defaults to ok so a flaky grader can never block enrichment.” This fallback is implemented via an implicit except that returns {"grade": {"verdict": "ok"}}.
  • PostureFail-soft. The node returns a default ok and the enrichment graph continues without aborting.
  • Operator signal – No explicit log line in the source. The operator would see grade.verdict in state as ok, but without provenance that the grader actually ran. The agent_timings dictionary may be missing or incomplete for the grade step.
  • Recovery – None. The classification proceeds with the unvetted verdict. The system does not retry because the default is accepted immediately.

2. Heuristic classification bypasses the accuracy gate

  • Trigger – The preceding classify node produced a classification with source set to "heuristic" (because the LLM call failed or fell through to regex). The classify_source state key is "heuristic".
  • Guard – Inside grade, the check if state.get("classify_source") == "heuristic": returns a canned {"grade": {"verdict": "ok", "issues": [], "skipped": "heuristic"}} and increments grade_attempts without any LLM call.
  • PostureFail-closed in the sense that the gate is intentionally bypassed for heuristic data. The system labels it as a guess (confidence=0.3, source="heuristic") so downstream scoring can deprioritise it, but no accuracy check is performed.
  • Operator signal – The grade state will contain "skipped": "heuristic". The operator can also inspect classification.confidence (0.3) and classification.source.
  • Recovery – No retry occurs. The heuristic classification is accepted as-is. Manual re‑analysis of companies with source: heuristic may be required.

3. LLM grader produces a false‑positive ok verdict

  • Trigger – The grader’s LLM call completes successfully but incorrectly returns "verdict": "ok" on a classification that is actually ungrounded.
  • GuardNone. The code has no secondary validator for the grader’s own output. It trusts the LLM’s judgement and does not, for example, compare the classification against the source text deterministically.
  • PostureFail-soft (the run continues with bad data). The enriched information enters the database unchanged.
  • Operator signal – Silent. The grader’s verdict appears normal (verdict: ok). There is no metric or field that flags a misgraded row.
  • Recovery – No automatic recovery. Detection would require offline audit or downstream scores that expose inconsistencies. Manual revert is the only option.

4. Persistent low‑confidence classification after retry attempts exhausted

  • Trigger – The grade node returns issues (non‑empty issues list) and the router sends the state back to classify. The second classification still receives a low‑confidence grade. After the second attempt, the counter grade_attempts reaches _CRAG_MAX_ATTEMPTS (2).
  • Guard – The constant _CRAG_MAX_ATTEMPTS = 2 caps the retry loop. The router (not shown in the provided source, but referenced by the CRAG pattern) must stop cycling after two passes.
  • PostureFail-soft (the graph proceeds with the last classification, even if graded as low confidence). The source does not specify a fallback to heuristic or a forced stop at that point.
  • Operator signal – The state field grade_attempts will be 2 (or greater). The grade dictionary will contain the persistent issues. The operator can monitor grade_attempts across runs.
  • Recovery – None in the code. The enrichment continues with whatever classification was last produced. Manual re‑evaluation of the company is the only remedy.

5. LLM grader timeout or indefinitely slow response

  • Trigger – The external LLM API call inside grade hangs or takes longer than a reasonable timeout. The grade function is an async coroutine with no explicit timeout or fallback for a stuck request.
  • GuardNone. The source does not implement a timeout or cancellation for the grader’s LLM call. The function will await indefinitely unless a higher‑level orchestrator enforces a deadline.
  • PostureFail-soft if the orchestrator eventually times out the whole node; otherwise fail-closed (the graph stalls and blocks downstream nodes). The default ok fallback only applies on network/parse errors, not on timeouts that do not raise exceptions.
  • Operator signal – The grade node would appear as running for an extended period. No progress or error logs are emitted. The operator may observe a stalled enrichment pipeline.
  • Recovery – Not specified. Manual intervention (killing the process or restarting the run) is required. Adding a timeout to the LLM client (outside this file) would be the fix.

6. Missing or malformed state keys cause early silent exit

  • Trigger – The grade function expects state["classification"] to be a non‑empty dict. If it is None, {}, or missing altogether (e.g., because classify was skipped), the guard if not classification: return {} fires.
  • Guard – The early return if not classification: return {} prevents a KeyError or type error. No error is raised.
  • PostureFail-soft. The function returns an empty dict, so no grade state is written. Downstream nodes that depend on grade.verdict may receive None or behave differently.
  • Operator signal – Silent. The grade key in state remains {}. The agent_timings dict still contains the grade timing entry, but there is no log or metric indicating the gate was skipped.
  • Recovery – None automatic. The graph continues, but accuracy gating effectively did not run for that company. Manual inspection of companies with empty grade state is needed.
Interview — could you explain it?

Q (warm-up):
What is the purpose of the grade node in the enrichment graph, and how does it interact with the classify node?

A:
The grade node acts as a CRAG (Corrective Retrieval Augmented Generation) quality gate. It audits the classification output from classify for groundedness in the page text. If it finds low confidence in critical fields (category_ok, tier_ok, remote_policy_ok), the router (_grade_router) sends the state back to classify for a single retry, folding the grader’s issues into the user prompt so the second pass can correct mistakes. This ensures ungrounded or low‑confidence classifications are caught before scoring.

Follow-up:
How does the graph prevent infinite retry loops?

A:
It uses a hard cap: _CRAG_MAX_ATTEMPTS = 2. After one retry, the state always proceeds to score, because re‑evaluating the same fetched markdown more than once is wasteful.

Weak answer misses:
The exact constant _CRAG_MAX_ATTEMPTS and that the maximum number of retry cycles is 2 (not 1 or unlimited).


Q (medium):
What happens when the grade node receives a heuristic‑sourced classification?

A:
Heuristic‑sourced classifications skip grading entirely. The grade function checks if state.get("classify_source") == "heuristic": return {"grade": {"verdict": "ok", "skipped": "heuristic"}}. Because the verdict is immediately “ok,” the router proceeds straight to score without any retry. This avoids pointless LLM critique of a deterministic regex‑based result.

Follow-up:
Why would a heuristic source ever be used instead of the LLM classifier?

A:
When the LLM call fails or is kill‑switched, the system falls back to a heuristic that matches keywords and assigns a confidence of 0.3 (as shown in the return block with "confidence": 0.3). The low confidence ensures downstream scoring weights it less, and source="heuristic" prevents the result being stored as a grounded fact.

Weak answer misses:
The specific confidence value 0.3 and the explicit source label "heuristic" that distinguishes it from LLM‑sourced classifications.


Q (hard):
Why does the score node scale the final score by 0.6 + 0.4 * confidence, and how does this relate to the needs_review flag?

A:
The scaling s *= 0.6 + 0.4 * c.get("confidence", 0.5) dampens the score when the classifier is uncertain. At confidence=0 the raw score is reduced to 60%; at confidence=1.0 it stays unchanged. Separately, needs_review is set to True when confidence < 0.6. This two‑layer approach ensures that low‑confidence outputs are both weighted down and flagged for human inspection — preventing a weak signal from moving the ranking while still giving visibility.

Follow-up:
Why 0.6 and 0.4 specifically? Could that be a bug or a deliberate hyperparameter?

A:
These are fixed hyperparameters in the scoring logic (visible in s *= 0.6 + 0.4 * c.get("confidence", 0.5)). The context does not justify the exact values, but the pattern shows a deliberate design: the confidence can at most reduce the score to 60% of its raw value, and full confidence passes the score through unchanged.

Weak answer misses:
The exact formula (0.6 + 0.4 * confidence) and the threshold for needs_review being < 0.6 (not ≤ or a different number).


Q (design – “why this way and not the obvious alternative”):
Why does the system have a separate grade gate rather than just relying on the self‑reported confidence from the LLM classifier to reject low‑quality results?

A:
Confidence is self‑reported by the classifier and can be inflated (e.g., a confident hallucination). The grade node provides an independent LLM audit of groundedness in the source text. If grade finds issues even when classification confidence is high, the state is sent back to classify for a retry. This decoupling catches failures that the classifier’s own confidence would miss. The grader also has a safety default: on network/parse errors it returns verdict “ok” so a flaky grader never blocks enrichment.

Follow-up:
What are the three fields that the grader specifically evaluates?

A:
The constant _CRAG_GATED_FIELDS = ("category_ok", "tier_ok", "remote_policy_ok"). Only these drive the retry decision; industry and has_open_roles are not gated because they don’t directly influence the ICP score.

Weak answer misses:
The exact tuple _CRAG_GATED_FIELDS and that industry/has_open_roles are explicitly excluded from retry triggering.

10. From Enriched Profile to Outreach

Enrichment exists to serve the step after it. So the final job is handing a complete, grounded profile to outreach. A good message references something specific. It might name what a company does, the role it is hiring for, or the software it runs. That only works because enrichment gathered and verified those details first. The enriched record becomes the ground truth that personalization draws on. This is why every fact in it had to be real. A message that cites an invented detail is not just useless. It actively destroys trust with the exact person you were trying to win. Sends go only to verified addresses. They go out at a controlled pace. Suppression rules keep the system from contacting anyone who should be left alone. The whole arc runs from a raw candidate, through grounded enrichment, to a careful and personal send. It is one pipeline with a single throughline. Do less, but make every record and every message something you could stand behind if the recipient asked you to prove it.

The analyse_org function produces the enriched profile — a single grounded record containing AI, activity, and hiring signals ready for outreach personalisation.

python
async def analyse_org(client: GhClient, org: str, max_repos: int = 30) -> dict[str, Any]:
    repos = await client.org_repos(org, per_page=max_repos) or []
    tech_stack = await aggregate_tech_stack(client, org, repos)
    ai_signals = detect_ai_signals(tech_stack, repos)
    ai_score = score_ai(ai_signals, tech_stack)
    activity = await summarise_activity(client, org, repos)
    activity_score = score_activity(activity)
    hiring_signals = await detect_hiring_signals(client, org, repos)
    hiring_score = score_hiring(hiring_signals, tech_stack)
    return {
        "org": org,
        "ai_score": ai_score,
        "activity_score": activity_score,
        "hiring_score": hiring_score,
        "tech_stack": tech_stack,
        "ai_signals": ai_signals,
        "hiring_signals": hiring_signals,
        "activity": activity,
    }
ELI5 — the plain-language version

Think of the enrichment system like a sous chef prepping every ingredient for the head chef. The head chef (outreach) can only cook a perfect, personalized meal if the sous chef has washed, chopped, and verified each item first. This subsystem does exactly that: it gathers real, grounded facts about a company—for example, by running extract_funding_stage to check whether the company is pre-seed, seed, or series-A, then storing that stage along with seniority_gate_ok so outreach knows the seniority level. It also detects buying-intent signals like a company issuing an RFP for an AI tool. Every detail comes from actual content on the company’s home or careers page, not guesswork. Without this prepping, outreach would be cooking blind—sending messages that cite made‑up technologies or roles. That destroys trust with the very person you’re trying to win. A prospect who reads an invented detail won’t just ignore the message; they’ll dismiss the sender as careless or deceptive. The sous chef’s careful verification is what makes the final outreach credible and personal.

Data flow — one request, in order
  1. classify (not in provided source, but its outputs are consumed by the next nodes) — determines the company’s vertical and writes the initial classification and vertical state keys.

    • reads / writes: consumes raw page text; writes state["classification"], state["vertical"], state["classify_source"]
    • branch: always runs first; on success the graph continues to the grade node, on error it sets _error and short‑circuits.
  2. grade — node 3b: LLM grader audits the classification for groundedness in the fetched page text.

    • reads: state["classification"], state["classify_source"], state["home_markdown"][:5000], state["careers_markdown"][:2000]
    • writes: state["grade"] (contains verdict, issues, optionally skipped), state["grade_attempts"]
    • branch:
      • if state["classify_source"] == "heuristic" → immediate ok verdict, skip the LLM (happy path continues)
      • else, the LLM grader is called; if the verdict is not ok, the graph router loops back to classify (up to _CRAG_MAX_ATTEMPTS retries); on the happy path (verdict ok) the request falls through to the vertical‑specific nodes.
  3. enrich_vertical_fit — node 5d: emits vertical‑fit fields scoped to the company’s tagged vertical.

    • reads: state["vertical"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"]
    • writes: state["vertical_fit"] (contains product_summary, icp, ai_native (bool+confidence), vertical_fit (strong/partial/none), and provenance); also writes a row to company_facts under field='vertical_fit.<vertical>'
    • branch:
      • if state["vertical"] is empty or if _error/_skip_reason is set → returns empty and does nothing
      • on happy path, uses the micro‑vertical’s label and keyword signals, then calls the LLM.
  4. extract_voice_ops_signals — emits telephony stack and voice‑ops signals for companies tagged voice‑ops.

    • reads: state["vertical"], state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"]
    • writes: state["telephony_stack"], state["target_vertical"], state["saas_integrations"] (with confidence and evidence)
    • branch:
      • if state["vertical"] != "voice-ops" → returns empty (this step is skipped for other verticals)
      • on happy path, calls ainvoke_json_with_telemetry with a system prompt distinguishing applied voice‑ops from raw telephony.
  5. extract_funding_stage — V20: emits funding stage, signals, team‑size estimate, and the seniority gate for all companies.

    • reads: state["company"], state["company_id"], state["home_markdown"], state["careers_markdown"], state["vertical"]
    • writes: state["funding_stage"] (contains stage, funding_signals, team_size_estimate, seniority_gate_ok, provenance); persists to company_facts under field='funding_stage'
    • branch:
      • if _error or _skip_reason set → return empty; otherwise runs unconditionally
      • seniority_gate_ok is set to True when the stage is in _EARLY_STAGES (pre‑seed, seed, series‑a).
  6. persist (mentioned in comments but not shown in provided source) — commits all enriched state to the database.

    • reads: the accumulated company_facts rows written by previous nodes
    • writes: final database records (not shown in source)
    • branch: no conditional in provided source; it is the terminal node that hands the complete, grounded profile to the outreach system.

The request fans out at step 4: only one of the vertical‑specific extractors (extract_voice_ops_signals, extract_pi_signals, or possibly others) runs, based on the value of state["vertical"]. The grade loop (step 2) causes at most one extra pass through classify and grade when the verdict is not ok, before normal flow resumes.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The enrichment subsystem operates as a sequential pipeline where each node—extract_funding_stage, extract_buying_intent, extract_voice_ops_signals, extract_pi_signals, and extract_immigration_signals—runs independently but follows a uniform ordered mechanism. Every node first checks state.get("_error") or state.get("_skip_reason"); if either flag is set, the node returns an empty {} immediately, leaving the rest of the graph untouched. Next, the node inspects its vertical gate—for example, extract_voice_ops_signals returns {} unless state["vertical"] == "voice-ops"—while extract_funding_stage and extract_buying_intent run unconditionally. The core extraction then calls ainvoke_json_with_telemetry on a DeepSeek model, wrapping untrusted product copy via wrap_untrusted before the LLM call to prevent injected [SYSTEM] directives from steering the output. On success, the parsed JSON is persisted under a specific field in the company_facts D1 table (e.g., field='funding_stage' or field='buying_intent'). On any failure—LLM error, kill-switch, parse failure—the node returns {}, isolating the fault so prior enrichment (already committed in persist) is preserved.

The design’s central invariant is fault isolation: no single node’s failure can block enrichment already committed by upstream nodes or cascade into downstream steps. This guarantee is enforced by every node returning {} on error and by the LLM_KILL_SWITCH mechanism, which gates all LLM calls and silently swallows LlmDisabledError. Combined with the early exit on _error or _skip_reason, the system ensures that a poisoned or failing node cannot corrupt the shared state. The use of company_facts with per-field keys provides idempotent writes—if a node runs twice (e.g., due to a retry), it overwrites the same row, keeping the profile consistent.

The key trade-off is LLM-based extraction over rule-based heuristics. The obvious alternative is a set of hand-written regex or keyword matchers for each signal (e.g., detecting “Series A” in a careers page). That approach avoids LLM latency and cost, and eliminates the risk of hallucinated evidence. The subsystem rejects that simplicity because product copy is too variable—a company might write “we just raised our A round” or “post-series A growth stage”—making regex brittle and expensive to maintain across hundreds of verticals. Instead, the LLM provides flexibility: a single prompt can detect demand automation, RFE drafting, or buying intent from diverse phrasing, and the confidence and evidence fields ground its output in verbatim source text. The cost of this flexibility is higher per-inference latency, potential for hallucinated low-confidence signals, and reliance on an external model with kill-switch risk. The subsystem manages those costs by using telemetry through ainvoke_json_with_telemetry (capturing gen_ai.* span attributes) and by caching LLM responses with a vertical-specific cache_scope to avoid repeated calls for the same company.

A concrete failure mode occurs when the DeepSeek model returns a malformed JSON—for example, a truncated response or a string that fails json.loads. In extract_funding_stage, the ainvoke_json_with_telemetry call would raise a ParseError, which is caught by the function’s try/except block. The node then logs the error via telemetry spans and returns {}. An operator monitoring dashboards would see an empty funding_stage entry in company_facts for that company—specifically, no row with field='funding_stage' would be written—and in the observability stack, the gen_ai.node=extract_funding_stage span would carry an error tag with the parse failure message. The LLM_KILL_SWITCH row in the kill-switch table (if enabled) would not fire because the error is not a disabled model, but the operator would still see a gap in the enriched profile. Because the node returns {}, the rest of the graph (including extract_buying_intent for the same company) proceeds normally, preserving the invariant of non-blocking isolation.

Cost & performance — the real knobs

LLM_KILL_SWITCH

  • Knob — Env var / constant LLM_KILL_SWITCH (boolean, default not shown but presumed False).
  • Bounds — Global on/off for every LLM call in the enrichment pipeline.
  • Effect — When set to True, all LLM invocations are skipped and the corresponding enrichment steps return empty {}. Dollar cost drops to zero (no API usage), but every LLM‑dependent signal (funding stage, pricing model, buying intent, voice‑ops stacks) becomes missing, gutting personalization quality.
  • Risk — Left on accidentally → all downstream outreach messages lose their company‑specific details, breaking the entire “real fact” requirement. Off by default is safe, but forgetting to toggle it during a cost‑saving exercise silently destroys enrichment output.

temperature

  • Knobtemperature=0.1 passed to make_deepseek_flash() (and presumably to all other DeepSeek calls in the graph).
  • Bounds — Controls the randomness (creativity) of LLM token sampling.
  • Effect — Lower values (0.1) produce more deterministic, factual outputs, reducing the chance of hallucinated signals and the need for retries. Higher values increase variety but also increase token cost (more verbose, more parse failures) and latency (retries).
  • Risk — Set too high → outputs become creative / inconsistent, causing gating failures in grade that waste the _CRAG_MAX_ATTEMPTS budget. Set too low (e.g., 0.0) → the model may refuse structured output and waste a call.

cache (with cache_scope)

  • Knobcache=True in ainvoke_json_with_telemetry together with the cache_scope parameter (e.g., "company_enrichment.voice_ops_signals.voice-ops").
  • Bounds — Reuses cached LLM responses for identical prompts within the given scope (vertical / node).
  • Effect — Hits the cache → zero latency, zero token cost. Misses → full LLM cost. A narrower scope increases misses but ensures freshness; a broader scope increases hit rate but risks serving stale facts.
  • Risk — Scope too broad → an old classification (e.g., “seed” funding stage) persists even after the company has raised Series B, poisoning seniority‑gate logic. Too narrow → cache practically never used, defeating its purpose and inflating both latency and spend.

max_chars (in wrap_untrusted)

  • Knobmax_chars=6000 for home page and max_chars=2000 (or 3000) for careers page in the LLM user prompts (e.g., in extract_funding_stage).
  • Bounds — Truncates the scraped markdown before injecting it into the LLM context. Limits tokens per call.
  • Effect — Smaller values → cheaper and faster calls, but may omit critical evidence (e.g., a “Series‑A” mention only on a deep sub‑page). Larger values → better recall but higher cost and longer processing time.
  • Risk — Too low → missing key signals drive down confidence, causing the gating logic in grade to loop through all _CRAG_MAX_ATTEMPTS retries, wasting time and money. Too high → unnecessary token spend on irrelevant boilerplate.

_GH_ANALYSE_REFRESH_DAYS

  • Knob — Constant _GH_ANALYSE_REFRESH_DAYS (integer, exact default not in snippet but used in age check).
  • Bounds — Minimum days between re‑analyses of a GitHub organisation (rate‑limit throttle).
  • Effect — A larger value reduces GitHub API calls (saving both cost and rate‑limit quota) but risks using stale org activity patterns (stars, commit activity). A smaller value keeps signals fresh but incurs more network requests and possible rate‑limiting errors.
  • Risk — Set too high → outreach may reference outdated repos or missing hiring signals. Set too low → repeated rate‑limit hits (429) cause the analyse_github node to fail silently, returning empty timings.

_CRAG_MAX_ATTEMPTS

  • Knob — Constant _CRAG_MAX_ATTEMPTS = 2 in grade.
  • Bounds — Maximum number of times the grade node may bounce back to classify before giving up.
  • Effect — One extra LLM call per gated field when the initial classification has low confidence (the retry may produce a better‑grounded answer). Setting it to 0 would skip all retries, saving cost but accepting lower‑quality classifications; setting it higher could waste budget on hopeless inputs.
  • Risk — Too high → infinite‑loop risk (though capped at 2 here) and multiplied latency/cost for noisy pages. Too low → a mis‑classification that could have been fixed on retry propagates to scoring, potentially mis‑directing outreach.
Failure modes — what breaks, what catches it

LLM Inference Failure

  • Trigger — The DeepSeek call made via ainvoke_json_with_telemetry times out, returns an HTTP error, or produces text that does not contain valid JSON (e.g., a hallucinated schema).
  • Guard — The try block surrounding each LLM call (e.g., in extract_voice_ops_signals or extract_funding_stage) catches the exception and the function returns {} — see the docstrings: “any failure (LLM error, kill-switch, parse failure) returns {}”. No per-call retry is shown; the guard is the unconditional return of an empty dict.
  • Posturefail-soft. The enrichment step degrades by producing no signals for that facet (e.g., no immigration_signals, no competitors), but the graph continues to later nodes.
  • Operator signal — No explicit log line is visible in the snippets; the gen_ai.* span attrs would record the error status. The operator sees company_facts rows missing the expected field values for that company.
  • Recovery — The downstream outreach step receives an empty or partial profile. No automatic retry is coded; the enrichment must be re-triggered manually or via the next scheduled run.

JSON Parse Failure

  • Trigger — The LLM returns valid JSON that does not match the expected schema (missing required keys like detected, confidence, evidence, reason), or the JSON is truncated.
  • Guard — Same generic try/except that returns {}. The code in extract_voice_ops_signals checks if isinstance(result, dict) before proceeding, but a schema mismatch is not validated – any odd dict passes through, possibly leading to downstream KeyErrors. The only real guard is the empty-dict fallback on exception.
  • Posturefail-soft when an exception is raised; fail-open when a malformed dict is accepted (false data reaches the profile).
  • Operator signal — No dedicated error metric; the operator would see unexpected values in company_facts (e.g., a string in a confidence field) or an AttributeError in later nodes.
  • Recovery — No built-in recovery. An invalid dict propagates to the persist step; if it crashes there, the persist function’s own handling (not shown) may swallow the error. Manual cleanup required.

LLM Kill-Switch Activation

  • Trigger — The environment-level LLM_KILL_SWITCH variable or feature flag is engaged, causing make_deepseek_flash to raise LlmDisabledError or the function to short‑circuit before calling the LLM. Several docstrings explicitly mention “Gated by LLM_KILL_SWITCH”.
  • Guard — The exception LlmDisabledError is caught and swallowed (the docstring says “LlmDisabledError swallowed below” for extract_funding_stage), and the node returns {}. No per-node retry.
  • Posturefail-soft. All LLM-dependent enrichment nodes return empty dicts; the graph completes without any AI-derived signals.
  • Operator signal — The LlmDisabledError would appear in the gen_ai.* span as an error, and the node’s telemetry counter “agentic_sales.node” would show zero successful calls.
  • Recovery — The kill-switch must be manually disarmed. No automatic fallback beyond empty enrichment; outreach then relies solely on scraped-but-unanalyzed data.

D1 Database Network Error in analyse_github

  • Trigger — A transient network or Cloudflare D1 error during the d1_one query (e.g., timeout, connection reset), raising D1Error.
  • Guard — The except D1Error clause in analyse_github catches it and returns a dict containing only agent_timings, leaving github_* columns unpopulated.
  • Posturefail-soft. The function completes but produces no GitHub signals; the rest of the enrichment graph is unaffected (the docstring says “any failure … is logged and swallowed”).
  • Operator signal — The D1Error is logged at error level (not shown, but implied by “logged and swallowed”). The operator would see github_analyzed_at still NULL in the companies table for that row.
  • Recovery — No retry within this run. The next run of the enrichment graph will attempt analyse_github again (if the cache expiry has passed, checked via _GH_ANALYSE_REFRESH_DAYS).

Missing GitHub Organization (Silent Skip)

  • Trigger — The company’s github_org column is empty or whitespace-only after a prior enrichment step failed to extract it.
  • Guard — The explicit check if not org: return {"agent_timings": ...} in analyse_github. No error is raised; the function returns a timing-only dict.
  • Posturefail-soft. The GitHub enrichment is simply not performed; the github_* columns remain NULL.
  • Operator signal — No log or metric: the function returns silently. The operator would notice only by observing github_analyzed_at still missing for a company that was expected to have a known GitHub org.
  • Recovery — No automatic recovery. The missing github_org must be corrected in the companies table (manual update or re-running the org-extraction stage), then a fresh enrichment run will pick it up.
Interview — could you explain it?

Q — Warm-up: What does the extract_funding_stage node contribute that directly affects how outreach messages are personalized for early-stage companies?
A — It emits the Boolean field seniority_gate_ok=True when the company is pre-seed, seed, or series‑A, which is consumed by V25/V29 seniority‑fit scoring to lower the seniority bar for outreach. That means a message to a seed‑stage startup can address a junior decision‑maker instead of insisting on the C‑level, because enrichment has grounded the stage in source text.
Follow-up — How does the system ensure that seniority_gate_ok is not silently missing if the LLM call fails?
A — The function is marked non‑fatal – any failure (including LlmDisabledError) returns {} without raising, so the rest of the graph continues and later scoring simply sees no gate signal.
Weak answer misses — The explicit constant _EARLY_STAGES (frozenset of "pre-seed", "seed", "series-a") that defines the gate and the provenance fields (confidence, reason, source, evidence) that allow trust calibration.


Q — Medium: How does enrich_vertical_fit produce the specific product detail (e.g., “AI‑native ICP”) that an outreach message would cite, and what guarantees that detail is actually grounded?
A — It returns a structured vertical_fit dict containing product_summary, icp, and ai_native (Boolean + confidence) along with full provenance (confidence, reason, source, evidence). The JSON is extracted from an LLM prompt that is tailored per micro‑vertical using MICRO_VERTICALS.get(vertical) to apply the right keyword signals, and it writes a company_facts row under field='vertical_fit.<vertical>' so downstream outreach can retrieve both the value and its evidence.
Follow-up — Why does the node require a specific vertical to be set (state["vertical"]) before it runs?
A — Because the prompt branches on the vertical’s label and keyword_signals to produce a tailored qualifier; if the vertical is empty the node returns {} immediately, preventing a generic guess that would contaminate outreach.
Weak answer misses — The node runs after analyse_github in the graph edge order (builder.add_edge("analyse_github", "enrich_vertical_fit")), ensuring the vertical tag is already established before fit is assessed.


Q — Design question: extract_voice_ops_signals exists solely for the "voice-ops" vertical. Why was it implemented as a separate node that checks vertical == "voice-ops" and no-ops for others, rather than folding that logic into the generic enrich_vertical_fit prompt?
A — Separating it allows a specialized system prompt (_VOICE_OPS_SYSTEM_PROMPT) that distinguishes applied voice‑ops products from raw telephony infrastructure, which a generic prompt could not reliably tease apart. The node emits telephony_stack[], target_vertical, and saas_integrations[] with evidence grounded in source text – a level of detail that would dilute the single‑prompt approach and risk hallucinating infrastructure‑type signals for non‑voice companies.
Follow-up — How does the graph guarantee this node runs in the correct order relative to the other vertical signals?
A — The edge chain is explicit: extract_health_signalsextract_voice_ops_signalsextract_fintech_signals, and only the voice‑ops node checks for vertical != "voice-ops" to short‑circuit, so the order is deterministic.
Weak answer misses — The _VOICE_OPS_VERTICAL constant and the use of ainvoke_json_with_telemetry with provider "deepseek" and a dedicated cache scope (f"company_enrichment.voice_ops_signals.{_VOICE_OPS_VERTICAL}").


Q — Hard: The classify node returns a category field (CONSULTANCY/STAFFING/AGENCY/PRODUCT/UNKNOWN) that influences outreach targeting, but it also includes confidence and reason. How does the subsystem handle cases where the LLM classification is uncertain, and what happens to that uncertainty when the profile is handed to outreach?
A — The classification dict includes a confidence (0..1) and a reason string; the node uses a CRAG retry path: if the grade block flags the output as low‑quality, _grade_router re‑routes back to "classify" for a second pass that includes the critic’s issues in the user prompt. After retry, the final record is written by persist with a method label (LLM or HEURISTIC) so outreach can decide whether to trust a low‑confidence fact or fall back to a different personalization strategy.
Follow-up — What happens if even the retry produces a low‑confidence?
A — The graph does not block – it proceeds to score and persist; the low confidence value (e.g., <0.5) is stored in company_facts, and downstream scoring systems weight it accordingly, which means outreach may simply skip that field rather than use a weak signal.
Weak answer misses — The heuristic fallback function (inside classify) that returns confidence=0.3, source="heuristic" when regex keywords are matched, ensuring even a non‑LLM classification is labelled as a guess and never passed as grounded fact.


Q — Final: The whole enrichment graph culminates in extract_buying_intent, which emits buying‑intent signals (cue_type, strength, confidence). Why must outreach rely on the persisted company_facts record rather than re‑querying the company’s homepage at send time?
A — The enrichment graph runs once per company and persists each fact (funding stage, vertical fit, buying intent, etc.) with full provenance. Re‑querying at send time would be expensive, introduce race conditions, and risk producing inconsistent facts (e.g., the homepage changed). By relying on the persisted company_facts row under field='buying_intent', outreach gets a deterministic, timed snapshot with evidence that can be cited – crucial because an invented detail in a message destroys trust.
Follow-up — How does the graph guarantee that extract_buying_intent has access to all earlier‑extracted context without re‑reading the homepage?
A — It runs last in the chain (builder.add_edge("extract_pricing_model", "extract_buying_intent")) and its prompt uses only the home_markdown and careers_markdown that were already loaded into state; it does not re‑fetch URLs, so it depends on the same source text the earlier nodes used.
Weak answer misses — The explicit source field in every enrichment output (e.g., "source": "heuristic" or "source": "llm"), and the fact that extract_buying_intent returns {} on any failure so that a missing buying‑intent signal never blocks outreach from using other facts.

Glossary — the domain terms, grounded in the code

15terms, each defined from this subsystem’s real source.

CompanyEnrichmentState

CompanyEnrichmentState is the typed state schema imported from schemas.state that is passed through each node of the five-node linear enrichment graph (load → fetch → classify → score → persist), holding fields such as company_id, company, classification, scores, home_markdown, vertical, and error/skip flags.

Memory hook CompanyEnrichmentState is the shared backpack carried through every node of the enrichment pipeline.

From company_enrichment_graph.py

fetch

fetch is an async function that, given a CompanyEnrichmentState, builds URLs for a company's home and career pages, fetches them in parallel via asyncio.gather, and returns a dict containing the home and careers markdown, the careers URL, and a timing record.

Memory hook Fetch uses asyncio.gather to gather home and career pages simultaneously.

From company_enrichment_graph.py

classify

classify is a node (Node 3 in the enrichment pipeline) that produces a classification dictionary (containing fields such as category, tier, and confidence) by either invoking an LLM with cache and memory or falling back to a keyword-based heuristic function (`_heuristic_classify`).

Memory hook Classify chooses a category by LLM recall or keyword guess.

From company_enrichment_graph.py

grade

In this subsystem, grade is an async LLM-based grader that evaluates whether a company classification is grounded in the scraped page text, returning a verdict of "ok" or "retry" along with any issues; the router then uses this verdict to either continue to the score node or loop back to the classify node for a single retry with the critic's issues folded into the prompt.

Memory hook Grade grades the classification's grounding, giving a retry pass if the LLM's first answer flunks.

From company_enrichment_graph.py

score

score is an async function that computes a numerical score from `CompanyEnrichmentState` by summing contributions from the classification’s category, tier, remote_policy, and `has_open_roles`, and optionally adjusts it with a hiring‑velocity signal conditionally grounded on evidence and confidence.

Memory hook Score sums category, tier, remote, and hiring points, then optionally adds a verified velocity lift.

From icp_fit_scorer.py

persist

persist is a graph node implemented as an async function that writes enrichment results—including classification, confidence, score, reasons, and timestamps—to the companies table via an UPDATE statement, running after the score node and before the analyse_github node.

Memory hook Persist saves the enrichment score and classification to the companies table with an UPDATE, making it permanent.

From company_enrichment_graph.py

_grade_router

_grade_router is a conditional edge function that returns "classify" to retry the LLM classification when the grade verdict is "retry" and fewer than two attempts have been made, otherwise returns "score" to continue enrichment.

Memory hook _grade_router bounces a retry verdict back to classify for a single do-over, then moves to score.

From company_enrichment_graph.py

_heuristic_classify

_heuristic_classify is a function that uses keyword matching on home and careers markdown text to classify a company into category, tier, and other fields, returning a low-confidence verdict with source "heuristic" to distinguish it from grounded LLM classifications, and its outputs are used downstream where heuristic-sourced classifications skip the grading step.

Memory hook _heuristic_classify guesses keywords fast, marks itself 'heuristic' so downstream graders ignore its low-confidence hits.

From company_enrichment_graph.py

CRAG

CRAG is a grading-then-rewrite pattern where the `grade` function issues a verdict that the router uses to either proceed to `score` or loop back to `classify` once; the mechanism limits retries to a maximum of 2 attempts (`_CRAG_MAX_ATTEMPTS`) and gates only the high-priority fields `category_ok`, `tier_ok`, and `remote_policy_ok` (`_CRAG_GATED_FIELDS`).

Memory hook CRAG is the grade-and-retry loop that lets a shaky classification climb back up the crag for one more try.

From company_enrichment_graph.py

_FRESHNESS_DAYS

_FRESHNESS_DAYS is a threshold constant used in the freshness skip gate to determine if a company's classification is recent enough to avoid re-enrichment, where age_days is computed from the stored updated_at timestamp and compared to this constant.

Memory hook _FRESHNESS_DAYS is the freshness expiry: if age_days is under it, classification is still fresh and avoids re-enrichment.

From company_enrichment_graph.py

classification

**classification** – A structured dictionary (with keys such as `category`, `tier`, `remote_policy`, `confidence`, and `reason`) produced by an LLM call (or a heuristic fallback) that describes a company; it is later graded for groundedness, scored, and persisted to the `companies` table.

Memory hook Classification is the LLM's structured verdict on a company, graded for groundedness before being saved.

From company_enrichment_graph.py

company_facts

In this subsystem, `company_facts` is a database table that stores enrichment results—such as `buying_intent`, `classification.home`, and careers data—with fields like `company_id`, `field`, `value_json`, `confidence`, and `extractor_version`, and it is written to during the persist phase so that rows from this Python enricher coexist with those from a separate Rust enricher.

Memory hook Company_facts is a shared table where Python and Rust enrichers each deposit enrichment findings with a version stamp.

From company_enrichment_graph.py

extractor_version

extractor_version is a constant identifier for the version of the extraction logic that is persisted alongside each fact row (e.g., in the list with company_id, “LLM”, and status 200) to track which extractor version produced the data.

Memory hook Like a batch number on canned goods, extractor_version stamps each fact with its extraction recipe version.

From company_enrichment_graph.py

hiring_velocity

hiring_velocity is a structured LLM-extracted classification of a company's hiring trend (one of "rising", "flat", "falling") with magnitude, confidence, reason, and evidence, which is used to adjust a score (adding a boost for "rising" or subtracting a drag for "falling") only when the evidence and confidence are sufficient (grounded), and is persisted to the `company_facts` table with full provenance.

Memory hook Hiring velocity shifts the score up or down like a gear shift, but only when the clutch of evidence is engaged.

From company_enrichment_graph.py

wrap_untrusted

wrap_untrusted is a function that fences scraped product or careers copy before an LLM call, preventing planted ``[SYSTEM]`` injections from steering the extraction; it is used on the ``home_markdown`` and ``careers_markdown`` strings with a label and character limit.

Memory hook wrap_untrusted fences scraped copy with a label and char limit to block [SYSTEM] injection attacks.

From company_enrichment_graph.py