01. What Enrichment Is
Enrichment turns a thin record into a profile you can act on. A lead often arrives as just a company name or a website. A contact might be only a name and a job title. That is rarely enough to decide whether the company is worth pursuing. It is also too little to write a message that lands. Enrichment fills the gaps. It works out what the company does, how big it is, and what software it runs. It also finds the right person to reach and a way to reach them. One discipline rules all of this. Every fact you add must trace back to a real, checkable source rather than a guess. A sales decision built on an invented detail is worse than no detail at all. Enrichment sits in the middle of the pipeline. Discovery surfaces raw candidates before it, and outreach turns the best ones into personalized messages after it. There is a clear cost to this care. Grounding every fact is slower than letting a model improvise freely. Done well, it is the difference between a generic blast and a message that proves you understand the business.
Heuristic classifier enriches a sparse company record by extracting category, tier, and other attributes from scraped web pages, grounding them in detected keywords.
def _heuristic_classify(home_markdown: str, careers_markdown: str) -> dict[str, Any]:
text = (home_markdown + " " + careers_markdown).lower()
matched: list[str] = []
tier2 = [k for k in ("llm", "genai", "agent", "rag", "foundation model") if k in text]
tier1 = [k for k in ("machine learning", " ml ", "data science") if k in text]
if tier2:
tier = 2
matched += tier2
elif tier1:
tier = 1
matched += tier1
else:
tier = 0
if "consult" in text or "services" in text:
category = "CONSULTANCY"
matched += [k for k in ("consult", "services") if k in text]
elif "staff" in text or "recruit" in text:
category = "STAFFING"
matched += [k for k in ("staff", "recruit") if k in text]
elif "agency" in text or "marketing" in text:
category = "AGENCY"
matched += [k for k in ("agency", "marketing") if k in text]
elif any(k in text for k in ("platform", "saas", "product")):
category = "PRODUCT"
matched += [k for k in ("platform", "saas", "product") if k in text]
else:
category = "UNKNOWN"
return {
"category": category,
"tier": tier,
"industry": "",
"remote_policy": "unknown",
"has_open_roles": bool(careers_markdown),
"confidence": 0.3,
"reason": "heuristic fallback (regex keyword match)",
"evidence": ("matched keywords: " + ", ".join(sorted(set(matched)))) if matched else "no keywords matched",
"source": "heuristic",
}
Imagine you find a business card with just a name and a phone number. Without any other context, you don’t know if that person is a CEO or a receptionist, or even what their company does. Enrichment is like digging into that card to build a full profile—finding the person’s title, the company’s size, the technology they use, and whether they’re hiring or shrinking. Concretely, this subsystem reads the company’s home page and careers page, fencing the text with wrap_untrusted to prevent trickery, then uses a language model to extract specific signals. For example, extract_hiring_velocity classifies whether the company is actively expanding or holding flat by looking for phrases like “200% headcount growth.” Without enrichment, you would only have that bare card: no idea if the company fits your vertical, no sense of its funding stage, no clue whether it’s even worth a call. You’d waste time on dead ends or send messages that miss the mark entirely.
-
grade – Node 3b that audits classification groundedness; returns early when
_error,_skip_reason, or an emptyclassificationexist.
reads / writes – reads_error,_skip_reason,classification,classify_source; writesgrade(verdict, issues),grade_attempts,agent_timings.
branch – happy path proceeds after all early‑return checks pass andclassify_sourceis not"heuristic". -
enrich_vertical_fit – Node 5d that emits vertical‑fit fields scoped to the company’s tagged vertical; returns
{}if_error,_skip_reason,verticalis empty, or the vertical is not found inMICRO_VERTICALS.
reads / writes – reads_error,_skip_reason,vertical,company,company_id,home_markdown,careers_markdown,MICRO_VERTICALS; writesvertical_fit(product_summary, icp, ai_native, vertical_fit, provenance),agent_timings.
branch – happy path continues whenverticalis non‑empty and exists inMICRO_VERTICALS. -
extract_funding_stage – V20 node that extracts funding stage and team‑size estimate for all companies; returns
{}if_erroror_skip_reasonare set.
reads / writes – reads_error,_skip_reason,company,company_id,home_markdown,careers_markdown,vertical; writesfunding_stage(stage, funding_signals, team_size_estimate, seniority_gate_ok, provenance),agent_timings.
branch – happy path proceeds when no error/skip. -
extract_pi_signals – V14 node that emits PI demand‑letter signal fields for legal‑pi‑demand companies; returns
{}if_error,_skip_reason, orvertical != "legal-pi-demand".
reads / writes – reads_error,_skip_reason,vertical,company,company_id,home_markdown,careers_markdown; writespi_signals(demand_automation, medical_record_summarization, case_intake with detected, reason, confidence),agent_timings.
branch – happy path only whenvertical == "legal-pi-demand". -
analyse_github – Async function that populates
github_*columns for companies with a known organization; returns{}if_error,_skip_reason, or missingcompany_id.
reads / writes – reads_error,_skip_reason,company_id; writesagent_timings; internally writes GitHub columns after later steps.
branch – happy path proceeds when company_id is present. -
d1_one (within
analyse_github) – Database query that retrieves the company row (key, github_org, github_url, github_analyzed_at, tags) from thecompaniestable.
reads / writes – reads from D1 database; writes the returned row dict to local variable.
branch – if no row is returned, the function returns early with timings. -
analyse_github (github_org check) – Conditional inside
analyse_githubthat checks ifgithub_orgis empty; if empty, the function returns early.
reads / writes – reads thegithub_orgfield from the database row; no writes on early return.
branch – happy path continues whengithub_orgis non‑empty. -
analyse_github (analyzed_at check) – Conditional inside
analyse_githubthat comparesgithub_analyzed_atto the current timestamp; if the analysis is recent (less than_GH_ANALYSE_REFRESH_DAYS), the function returns early.
reads / writes – readsgithub_analyzed_atfrom the row; no writes on early return.
branch – happy path proceeds when the data is stale or never analyzed, leading to the GhClient analysis (not shown in provided source).
The enrichment subsystem is an ordered pipeline that transforms a sparse lead—often just a company name or domain—into a rich profile by executing a sequence of specialized extraction nodes. The mechanism begins by gathering raw markdown from home and careers pages, then fans out into vertical-specific extractors such as enrich_vertical_fit (which only runs when state["vertical"] is set) and vertical-tuned nodes like extract_voice_ops_signals (gated on _VOICE_OPS_VERTICAL) or extract_pi_signals (gated on _PI_VERTICAL). After these vertical-specific nodes complete, extract_funding_stage runs for every company, followed by extract_buying_intent which emits a buying-intent signal for composite ranking. Each node follows a strict pattern: it checks state.get("_error") or state.get("_skip_reason") and returns {} immediately if set, then wraps untrusted markdown through wrap_untrusted to prevent prompt injection, calls ainvoke_json_with_telemetry with a vertical‑specific system prompt, and on failure silently returns {} so the rest of the graph is unaffected.
The invariant the design preserves is that every fact added to the profile is written to a company_facts row with full provenance—confidence, reason, source, and evidence—and that a failure in any single extraction node never blocks data that has already been committed by persist. This is explicitly a non‑fatal design: errors, LLM failures, kill‑switch activations, or parse failures all yield an empty dict, leaving previously enriched fields intact. The write boundary is per‑field: each node writes only its own field key under company_facts (e.g., field='funding_stage' for extract_funding_stage, or field='vertical_fit.<vertical>' for enrich_vertical_fit), ensuring that partial results coexist without corruption. The graph thus guarantees that no single extraction failure can take down the entire enrichment run, and every piece of data is traceable to a specific source excerpt.
The key trade-off is between a single, monolithic extraction call and the current architecture of many narrow, vertical‑specific nodes. A single generic LLM prompt could attempt to extract all fields for any company, but that approach would require an enormous system prompt covering every vertical’s signals, would struggle to maintain provenance, and would be brittle when encountering out‑of‑domain companies. The design rejects that alternative because it would incur a high cost in accuracy and debuggability—hallucinated fields would be hard to isolate, and a single parse failure could lose all enrichment. Instead, the subsystem uses per‑vertical system prompts (e.g., _VOICE_OPS_SYSTEM_PROMPT for voice‑ops, or a prompt that branches on MICRO_VERTICALS for vertical fit) and per‑node observability via gen_ai.* span attributes and the agentic_sales.node metadata key. The cost this avoids is the operational headache of diagnosing which part of a giant prompt failed; each node’s failure is self‑contained and independently observable.
Consider a concrete failure mode: the extract_buying_intent node receives a malformed LLM response that cannot be parsed as JSON. The node’s logic catches the parse exception and returns {} immediately, leaving state["buying_intent"] unset and allowing downstream nodes to continue. An operator monitoring the enrichment pipeline would see an error logged at the ainvoke_json_with_telemetry call, a spike in the gen_ai.parse_failure metric, and the span attribute agentic_sales.node=extract_buying_intent marked with an error tag. No other node is affected, and the composite ranking step V73 simply sees an absent signal rather than a bad one—preserving the overall profile’s integrity while flagging the specific extraction for investigation.
The enrichment subsystem spends time and money primarily on LLM inference (each call to DeepSeek Flash, costing per token), GitHub API calls (analyse_github), and cache lookups/storage. The following five real knobs control those costs and latencies.
-
LLM_KILL_SWITCH — An environment variable that disables all LLM-driven extraction nodes.
Bounds: When set to a truthy value (e.g.,1), every enrichment function that calls an LLM returns{}immediately.
Effect: Turning it on drives time and dollar cost to zero for all LLM steps, but no signals (voice-ops, funding stage, pricing, fintech, buying intent) are produced. Turning it off restores full enrichment at full cost.
Risk: Mis-set on leaves the system blind to all LLM-derived fields; mis-set off permits unbounded LLM spend if no other rate limit is in place. -
temperature — A parameter passed to
make_deepseek_flash(temperature=0.1).
Bounds: Controls the randomness of LLM output (≥0.0). Default is 0.1.
Effect: Lower values (e.g., 0.0) make outputs more deterministic, reducing token count variation and slightly lowering per-call cost and latency. Higher values (e.g., 0.5) increase diversity but may produce longer responses and more retries.
Risk: Too high can produce non‑JSON output, forcing retries and raising cost; too low may cause repetitive responses that still pass schema, wasting tokens without harm. -
cache — A boolean parameter in
ainvoke_json_with_telemetry; defaultTrue.
Bounds: WhenTrue, the LLM response is stored under thecache_scopeidentifier (e.g.,"company_enrichment.voice_ops_signals.voice-ops"). A subsequent request with identical inputs returns the cached result.
Effect: Cache on reduces both latency (zero LLM call) and cost (no token consumption) for repeated queries. Cache off forces a fresh LLM call every time, increasing latency and dollar spend linearly with query count.
Risk: Disabling cache turns every repeated enrichment into a paid call. Enabling cache with too long a TTL (not shown in source) risks serving stale signals. -
max_chars — A parameter in
wrap_untrustedfor truncating input text (max_chars=6000for home page,max_chars=2000for careers page).
Bounds: Caps the number of characters passed to the LLM prompt. Defaults are 6000 and 2000 respectively.
Effect: Lower values reduce per‑call token count, cutting both latency and cost. Higher values include more of the scraped page, potentially improving signal accuracy but increasing token cost and response time.
Risk: Setting too low may truncate key evidence (e.g., a pricing mention or compliance certification), causing missed signals. Setting too high wastes money on boilerplate text that does not contribute. -
_GH_ANALYSE_REFRESH_DAYS — A constant that determines how often GitHub analysis is repeated (
analyse_github). Exact default is not shown in the snippet, but it is compared against the age ofgithub_analyzed_at.
Bounds: An integer threshold in days. If the last analysis is younger than this value, the GitHub probe is skipped.
Effect: A smaller value (e.g., 1) increases the frequency of GitHub API calls, raising throughput and cost (rate limits and compute). A larger value (e.g., 30) reduces calls, saving money but risking outdated GitHub insights.
Risk: Too small can exhaust GitHub API rate limits or waste resources on unchanged repos; too large allows stale commit activity to persist in scoring.
Failure 1: LLM Kill Switch Engaged
- Trigger — The environment-level
LLM_KILL_SWITCHflag is set toTrue, causingmake_deepseek_flashor theainvoke_json_with_telemetrycall in any enrichment node to raiseLlmDisabledError. - Guard —
LlmDisabledErroris explicitly caught and swallowed in thetry/exceptblocks of nodes likeextract_funding_stage(source says “Gated byLLM_KILL_SWITCH(LlmDisabledError swallowed below)”). The node immediately returns{}. - Posture — fail-soft. The run continues, but the affected node’s output (e.g.,
funding_stage,vertical_fit,immigration_signals) is silently absent. - Operator signal — No error log; the expected
company_factsrow under the node’sfieldis missing. Telemetry spans show the node completed in ~0 ms with no result. - Recovery — No automatic retry. The operator must set
LLM_KILL_SWITCH = Falseand re-run the enrichment for the affected companies.
Failure 2: LLM Network or Provider Error
- Trigger — The DeepSeek API is unreachable, returns a 5xx, or exceeds a timeout during
ainvoke_json_with_telemetry. - Guard — The generic
try/exceptblock in each node (e.g.,extract_voice_ops_signals,enrich_vertical_fit,extract_funding_stage) catches the exception. No named exception class is specified in the source; the fallback is alwaysreturn {}. - Posture — fail-soft. The rest of the graph proceeds, but the node’s signals are empty.
- Operator signal — A
gen_ai.*span may show a timeout or error, but no structured error is written to state. The operator sees a missingcompany_factsentry for the node’s field. - Recovery — None. The graph does not retry. A manual re-run or a later run on the same company (if the API recovers) will attempt extraction again.
Failure 3: LLM Returns Malformed JSON (Parse Failure)
- Trigger — The DeepSeek response is not valid JSON, or the JSON does not match the expected schema (e.g., missing
detected,confidence,evidence,reasonkeys). - Guard — Implicit in the nodes’ “parse failure returns
{}” contract. The code likely wrapsjson.loadsin atry/except ValueError(not shown in snippet), returning{}on failure. No named guard is given. - Posture — fail-soft. The node contributes no data; the rest of the graph is unaffected.
- Operator signal — No log line; the
company_factsrow for the node’s field is absent. Telemetry cannot distinguish this from a network error without custom metrics. - Recovery — None. The cache (
cache=Trueinainvoke_json_with_telemetry) may have stored the invalid response, preventing a retry without manual cache invalidation.
Failure 4: D1 Database Write Error
- Trigger — The
company_factsinsert (e.g., inenrich_vertical_fit’spersistpath) fails due to aD1Error(connection loss, constraint violation, or row-level conflict). - Guard —
analyse_githubexplicitly catchesD1Errorand returns{"agent_timings": …}. Other nodes (e.g.,enrich_vertical_fit) do not show a database guard in the snippet, but they are documented as non-fatal — likely the write error is caught and swallowed upstream (e.g., in the persist helper). - Posture — fail-soft. The in-memory state may still contain the computed signals, but they are never persisted to
company_facts. Downstream consumers (e.g., scoring) will see stale or absent data. - Operator signal — No error in the graph state; the operator must check D1 query logs or observe that the
company_factsrow is missing for the expectedfield. - Recovery — None. The enrichment run does not retry the write. A manual re-run or a dedicated backfill job is required.
Failure 5: GitHub API Rate Limit or Network Error
- Trigger — The
analyse_githubnode’s call toanalyse_org(GhClientinternally) is throttled, the token is missing, or the network is down. - Guard —
analyse_githubwraps the entire operation in atry/exceptthat catches any exception and returns{"agent_timings": …}. The source explicitly names “rate-limit, missing token, network” as caught failures. - Posture — fail-soft. The
companies.github_*columns remain as they were (or are left null). The rest of the enrichment is unaffected. - Operator signal — No exception propagates;
agent_timingsis recorded. The operator would seegithub_analyzed_atunchanged and no new GitHub data in the company record. - Recovery — No automatic retry. The node will re-attempt next time it runs (after
_GH_ANALYSE_REFRESH_DAYS), but if the problem persists, manual intervention is needed (e.g., rotate token, adjust rate limits).
Q1 (Warm-up)
How does the subsystem decide whether to trust or ignore a trend signal like hiring velocity when scoring a company?
A
The score node in company_enrichment_graph.py checks hv_grounded by verifying that evidence exists and confidence >= 0.5. If those conditions fail, the trend is set to empty and appended as “ungrounded,ignored” to the reasons list – no boost or drag is applied. This ensures no fabricated signal moves the ranking.
Follow-up
Why not simply discard the whole hiring_velocity object when it’s ungrounded?
One‑line answer
The code still records the reason for ignoring it so that downstream audit or debugging can see why the trend was skipped.
Weak answer misses
The groundedness gate applies to both evidence existence and a confidence threshold (≥0.5), not just one of them.
Q2 (Medium)
What happens when a company’s classification is produced by a heuristic fallback instead of an LLM? How does the pipeline handle grading?
A
In the grade node, the code explicitly checks state.get("classify_source") == "heuristic". If true, it immediately returns a verdict of ok with a note skipped: "heuristic", bypassing the LLM‑based grader. This is because a heuristic-sourced classification cannot be improved by retrying an LLM call – a retry would produce the same regex‑based answer.
Follow-up
Is there any other guard to prevent heuristic guesses from being treated as high‑confidence facts?
One‑line answer
Yes, the classify function itself sets "confidence": 0.3 and "source": "heuristic", so downstream scoring (which multiplies by confidence) naturally weights it less.
Weak answer misses
The heuristic branch also records matched keywords as evidence, but its confidence is fixed at 0.3 – the article says “a guess must never pass as a grounded fact.”
Q3 (Design question – medium/hard)
Why does the pipeline use a CRAG (Corrective RAG) loop for classification – why not just rely on a single high‑temperature LLM call and move on?
A
The grade node acts as a quality gate. When the LLM grader flags low‑confidence fields (such as category_ok, tier_ok, remote_policy_ok), the _grade_router conditional edge loops back to classify for a single retry. This mirrors the “grade‑then‑rewrite” pattern from LangGraph examples, allowing the second pass to correct mistakes without wasting multiple calls on the same input.
Follow-up
What stops an infinite loop if the retry still fails the grade?
One‑line answer
The constant _CRAG_MAX_ATTEMPTS = 2 caps the retries, enforced by the router logic (not shown but implied by the conditional edge).
Weak answer misses
The retry reuses the already‑fetched markdown (home_markdown and careers_markdown) – there is no reason to scrape again, and the cap prevents unbounded cost.
Q4 (Hard)
The vertical‑specific signal extraction nodes (e.g., extract_voice_ops_signals, extract_fintech_signals) are chained in a fixed order after persist. Why chain them sequentially instead of running them in parallel, given that each no‑ops for non‑relevant verticals?
A
The chain is defined as a sequence of add_edge calls (e.g., persist → analyse_github → enrich_vertical_fit → extract_pi_signals → … → extract_fintech_signals). Each node checks state["vertical"] first; if it doesn’t match, the node returns early with empty results. This sequential design avoids the complexity of a branching conditional graph while still ensuring that every vertical gets its own tailored signal extractor – the overhead of a no‑op is minimal compared to managing concurrency.
Follow-up
Could a failure in one vertical signal block subsequent extractors?
One‑line answer
Every vertical signal node is documented as “non‑fatal – any failure here does not block enrichment that already committed in persist.”
Weak answer misses
The edge order is not arbitrary: enrich_vertical_fit must run before extract_pi_signals because the PI extractor depends on the vertical already being set (the “V13”/“V14” labels hint at version dependency).
Q5 (Hardest)
How does the subsystem enforce that “every fact added must be grounded in a real, checkable source rather than guessed,” as stated in the chapter? Cite at least two concrete mechanisms from the code.
A
Two mechanisms are:
- Hiring‑velocity grounding gate: The
scorenode requireshv.get("evidence")to be truthy andconfidence >= 0.5before applying any trend – otherwise the trend is dropped with a reason. - Heuristic source tagging: The
classifyfunction setssource: "heuristic"andconfidence: 0.3for regex‑based answers, and thegradenode skips LLM grading for heuristic‑sourced classifications, ensuring a guess never passes as a grounded fact.
Follow-up
What about the vertical‑fit enrichment – does it have a similar groundedness rule?
One‑line answer
The enrich_vertical_fit node only runs when state["vertical"] is set and the micro‑vertical definition exists; it’s gated by a LLM kill‑switch and the prompt explicitly demands evidence, but the core grounding is enforced by the LLM prompt itself rather than a post‑hoc confidence filter.
Weak answer misses
The CRAG grade node (grade) adds a second layer of validation specifically for the classification output, verifying that claims are supported by the page text – this is the “checkable source” requirement for LLM‑produced facts.