Lead Discovery — Transcript

📄 11 chapters · read at your own pace

01. What Discovery Is

Discovery is the very first stage of the lead pipeline. It scans real hiring sources for companies that are actively building teams. Those sources are applicant tracking boards, startup launch feeds, and a large web crawl. A seed query can steer the search toward a niche when you supply one. But the system never trusts a name a language model simply invented. Any candidate that came only from brainstorming is dropped before it can reach the database. That write boundary is what keeps the stored records honest.

Discovery sits at the top of the funnel, so everything later depends on it. Each candidate must show a genuine hiring signal, such as a minimum number of open artificial intelligence roles. A company that fails the check is quietly skipped. The trade-off is deliberate. We accept fewer leads in exchange for trusting every one of them. A short list of real companies beats a long list of guesses, because every downstream step rests on facts rather than fiction.

Persisting seed-query discovery candidates with deduplication, confidence, and evidence.

python
async def _persist_seed_candidates(state: DiscoveryState) -> tuple[list[int], list[int], int]:
    scored = state.get("scored") or []
    if not scored:
        return [], [], 0
    vertical = state.get("vertical")
    geography = state.get("geography")
    profile_tags = _profile(state).tags
    inserted_ids: list[int] = []
    existing_ids: list[int] = []
    now = datetime.now(timezone.utc).isoformat()
    for c in scored:
        tags_list = ["discovery-candidate"]
        if vertical: tags_list.append(vertical)
        if geography: tags_list.append(geography)
        # … language/market tag handling …
        tags_list.extend(profile_tags)
        key = _slugify(c["domain"])[:200]
        meta = await d1_run(
            """INSERT INTO companies (key,name,canonical_domain,website,category,tier,
               tags,score,score_reasons,created_at,updated_at)
               VALUES (?,?,?,?,?,?,?,?,?,?,?)
               ON CONFLICT (key) DO NOTHING""",
            [key, c["name"], c["domain"], f"https://{c['domain']}", "UNKNOWN", 0,
             json.dumps(tags_list), c["pre_score"],
             json.dumps([c["why"], f"confidence:{float(c.get('confidence') or 0.0):.2f}",
                         f"evidence:{str(c.get('evidence') or '').strip()[:200]}",
                         "source:seed_query"]),
             now, now],
        )
        if int(meta.get("changes") or 0) >= 1:
            new_id = int(meta.get("last_row_id") or 0)
            if new_id: inserted_ids.append(new_id)
            continue
        existing = await d1_one("SELECT id FROM companies WHERE key = ?", [key])
        if existing: existing_ids.append(int(existing["id"]))
    return inserted_ids, existing_ids, 0
ELI5 — the plain-language version

Picture a scout who visits different farmers’ markets—applicant tracking boards, startup launch feeds, and a huge web crawl—to find real farms that are actively growing crops (companies that are building teams). He carries a shopping list (a seed query) if you want to aim at a specific niche, but he never buys produce that a storyteller simply claimed existed; any candidate that came only from imagination gets thrown out before it reaches the cart. This subsystem is the very first stage of the lead pipeline, built to find real companies from honest sources and nothing else.

The scout works step by step. He first scans boards where employers post jobs (ATS sources like commoncrawl), then browses recent startup announcements (launchfeed, the default source), and also pulls from the Common Crawl archive of the web. If you give him a seed query, he runs it through a function called _brainstorm_direction that asks DeepSeek to suggest real companies—this path is marked as a sanctioned SEED_SOURCE and is allowed to persist. But there's a separate, older channel called brainstorm that fabricates candidates rather than finding them. The code keeps them apart using a frozenset named _SYNTHETIC_SOURCES that holds only "brainstorm". Any candidate that emerged from that fake channel is hard-stripped in the discover step before it ever reaches the database.

The trickiest part is why even the seed‑query path, which also asks an LLM for suggestions, is considered honest. The key edge is that the seed_query source is not in _SYNTHETIC_SOURCES—it expands from a real user-provided query and is gated differently. Meanwhile, the legacy brainstorm channel is excluded by a write boundary: the _SYNTHETIC_SOURCES set ensures that no invented company can slip through. Without this subsystem, the pipeline would eagerly store fabricated companies, filling the database with names that don't exist and causing every later step—outreach, scoring, everything—to waste time chasing ghosts. That’s the concrete failure a beginner would feel: sending emails to businesses that never were.

Data flow — one request, in order
  1. plan_targets is called with the incoming DiscoveryState; it resolves the list of active directions via micro_verticals.resolve_directions(state.get("directions")) and, unless the source set is limited to seed‑query‑only (the _ATS_PATH_SOURCES check), reads and advances the per‑direction sub‑niche cursor from a D1 cursor table, returning a _direction_sub_niches dict mapping each direction to the chosen sub‑niche string.

    • reads: state.get("directions"), state.get("sub_niche"), state.get("sources")
    • writes: {"_direction_sub_niches": dict[str, str]}
    • branch: if the active sources are not ATS‑path sources (i.e., only seed_query is present), it short‑circuits and returns an empty _direction_sub_niches — the rotation logic is skipped.
  2. discover is invoked with the state that now includes _direction_sub_niches; it first calls _resolve_sources(state) to determine the active data channels and, if any synthetic source (like brainstorm) was explicitly passed, logs a warning and strips it from the channel list.

    • reads: state.get("sources"), state.get("directions"), state.get("concurrency"), state.get("per_host"), state.get("cc_warc"), state.get("cc_max")
    • writes: internal sources set (list of channels), sets up a HostLimiter and httpx.AsyncClient
    • branch: synthetic sources are silently dropped; only real‑data channels move forward.
  3. discover resets the list of active directions by calling micro_verticals.resolve_directions(state.get("directions")) and, for each direction, prepares to run Wave 1 — the LLM brainstorm for that micro‑vertical. The per‑direction exclude_domains from the registry (not shown fully in snippets) is merged with the run‑level state.get("exclude_domains").

    • reads: state.get("exclude_domains"), merged with registry per‑vertical exclude list
    • writes: no return yet; the merged exclude list is used per direction
  4. _brainstorm_direction(mv, exclude_domains=..., sub_niche=...) is called for each MicroVertical object mv with the merged exclude list and the sub‑niche string from _direction_sub_niches (if any). It constructs the brainstorm payload with seed_query set to f"micro-vertical:{mv.vertical}", vertical, keywords from mv.keyword_signals, geography hard‑coded to "remote", and the exclude_domains and sub_niche.

    • reads: mv.vertical, mv.keyword_signals, mv.sub_niches (via earlier cursor), exclude_domains parameter
    • writes: returns a list of candidate dicts with keys name, domain, why, why_in_vertical, confidence, evidence, vertical
  5. Inside _brainstorm_direction, the actual LLM call goes to the external brainstorm function; a failed call (exception) logs a warning and returns an empty list.

    • branch: on any exception from brainstorm, the function returns [] — the entire direction yields no candidates.
  6. For each candidate returned by brainstorm, _brainstorm_direction canonicalises the domain via blocklist.canonicalize_domain and discards any candidate whose domain is empty or lacks a dot; surviving candidates are appended to the result list with the fields renamed to match the internal format (e.g., why_in_vertical is populated from either why_in_vertical or why in the raw response).

    • reads: raw candidate fields domain, name, why_in_vertical, why, confidence, evidence
    • writes: result list of cleaned dicts
    • branch: a candidate with an invalid domain is silently dropped.
  7. Back in discover, the output of all _brainstorm_direction calls for the active directions is collected into the local brainstorm list (Wave 1 result).

    • writes: local brainstorm list (list of dicts)
  8. After all wave 1 calls complete, discover continues with subsequent waves (Common Crawl, launchfeed — not detailed in this part), then calls _persist_seed_candidates(state) which writes the deduplicated candidates to the companies table.

    • reads: state (including brainstorm field)
    • writes: seed_inserted_ids, _seed_existing, seed_blocked
    • branch: if _persist_seed_candidates raises a D1Error, discover returns {"_error": str(e)} immediately instead of the normal result.
  9. In the normal path, the _persist_seed_candidates step grounds the write: it refuses LLM‑invented (synthetic) rows — only real‑data sources (launchfeed, Common Crawl) are allowed; the brainstorm channel (synthetic) is stripped, meaning the brainstorm list from Wave 1 is not persisted. (This is the guard described in the “Grounding guard” comment.)

  10. discover returns the final dict containing the seed_inserted_ids, _seed_existing, seed_blocked, and an agent_timings map; the terminal step is the return of that dict to the graph runner.

    • reads: internal results from persistence
    • writes: final return dict with _error (on failure) or normal keys.

Control flow summary: The request loops (fans out) over the list of directions in step 4, each direction being processed independently via _brainstorm_direction. The happy path proceeds through all branches that yield valid candidates, the deduplication (via exclude_domains in the brainstorm payload) happening per‑direction and later at persistence. Synthetic brainstorm results are ultimately stripped before writing to the company database.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The discovery subsystem begins with plan_targets, which resolves the active micro-verticals and, for each, reads and advances a per-vertical sub‑niche cursor stored in the discovery_cursors table. That cursor selects a single sub‑niche from the vertical’s sub_niches tuple, rotating the framing on each tick to force the LLM to explore fresh pockets of the vertical. After that, per direction, _brainstorm_direction is called with a seed query, keyword signals, geography, and the current sub_niche plus an exclude_domains list built from already-known companies. It returns a list of candidate dictionaries containing name, domain, why_in_vertical, confidence, and evidence. Those candidates are then handed to _persist_seed_candidates. If that D1 operation raises a D1Error, the node immediately returns a dict with an "_error" key holding the string representation of the exception, along with agent_timings; otherwise the normal state is returned and the pipeline continues.

The central invariant is the “Grounding guard” — a design rule that the write boundary (the companies table) must refuse LLM‑invented (synthetic) rows. Only real‑data sources such as launchfeed and CommonCrawl pass through untouched; the brainstorm channel is the sole synthetic path and its output must never directly enter the table without additional verification or a separate staging layer. This guarantees that scoring and outreach decisions always operate on validated, non‑hallucinated company records, even though the brainstorm is used to generate candidate suggestions.

The key trade‑off is the choice of one parameterised graph over several separate workers. The obvious alternative would be a dedicated worker per micro‑vertical, each with its own logic. Instead, the entire discovery flow is driven by a frozen configuration table (MicroVertical dataclasses) that differs only in seed query, keyword signals, and sub‑niche list. This eliminates per‑vertical code duplication and the maintenance cost of keeping multiple near‑identical workers in sync. The rejected alternative would also introduce deployment overhead (more workers to monitor, more DDL schemas) and amplify the blast radius of any bug—a single parameterised graph can be fixed in one place.

A concrete failure mode occurs when _persist_seed_candidates encounters a database constraint violation (e.g., duplicate domain or invalid vertical tag) and raises a D1Error. In that case the node returns {"_error": str(exc), "agent_timings": ...}. An operator monitoring the discovery logs would see a warning line prefixed with the direction name and the textual error message from the D1 runtime—no further candidate insertion is attempted for that tick, and the state proceeds without the seed candidates. The pipeline does not retry; it moves on to the next direction or ends, relying on the next scheduled tick to attempt those candidates again (if they survive deduplication).

Cost & performance — the real knobs

Concurrency

  • Knobconcurrency input (overrides DEFAULT_CONCURRENCY = 6).
  • Bounds — Global parallelism cap, shared across all hosts and channels.
  • Effect — Higher values lower wall-clock time by running more fetches in parallel, but increase load on external sources and the local connection pool.
  • Risk — Too high triggers rate‑limiting, timeouts, or connection exhaustion; too low serializes work, raising latency.

Per‑host limits

  • Knobper_host input (overrides DEFAULT_PER_HOST = {"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2}).
  • Bounds – Per‑source fan‑out; prevents one throttled host from starving others or causing retry storms.
  • Effect — Raising a host’s cap allows it to absorb more concurrent requests, reducing that source’s share of wall‑clock time; lowering the cap conserves its capacity but delays its completion.
  • Risk — Too high on a rate‑limited host (e.g. Common Crawl) causes retry storms and wasted time; too low serializes that channel, increasing overall latency.

Common Crawl max domains

  • Knobcc_max input (overrides constant CC_MAX_DOMAINS = 40).
  • Bounds — Number of CDX lookups and page extractions per run.
  • Effect — Higher values harvest more candidate domains from CC, increasing both discovery breadth and runtime; lower values reduce bandwidth and processing cost.
  • Risk — Too high may hit CC quotas or slow the run; too low misses potential leads.

Sources

  • Knobsources input (overrides DEFAULT_SOURCES = ("launchfeed",); brainstorm is always stripped by _resolve_sources).
  • Bounds — Which discovery channels are enabled: launchfeed, commoncrawl, and optionally seed_query (but not brainstorm).
  • Effect — Adding channels increases the candidate pool and runtime proportionally; removing channels reduces data volume and cost.
  • Risk — Enabling brainstorm is futile (synthetic data is stripped at persist), wasting cycles; excluding real channels (e.g. launchfeed) may starve the pipeline of recent signals.
Failure modes — what breaks, what catches it

LLM API Exception

  • Trigger — Network timeout, rate limit exhaustion, or service unavailability during the call to brainstorm.
  • Guard — The except Exception as exc clause in _brainstorm_direction.
  • Posture — Fail-soft: the function returns [] and does not propagate the exception upward.
  • Operator signal — A warning log line: log.warning("brainstorm failed direction=%s: %r", mv.vertical, exc).
  • Recovery — No retry or backoff is implemented; the direction produces zero candidates, and the discovery graph continues with other directions if any.

LLM Returns Empty or Missing Candidates

  • Trigger — The brainstorm call succeeds (no exception) but the returned dictionary either lacks a candidates key, or its value is None or an empty list.
  • Guardout.get("candidates") or [] — no exception handler is engaged; the code silently treats the absence as an empty list.
  • Posture — Fail-soft: degrades without any log or error signal.
  • Operator signal — No log line is emitted; the operator would see only that the direction yielded no candidate records.
  • Recovery — The function returns []; downstream steps (dedupe, scoring) receive nothing for this direction.

Candidate Lacking a Valid Domain

  • Trigger — A candidate dict contains no domain field, an empty string, or a string without a . character (e.g. "example").
  • Guard — The inline check if not dom or "." not in dom: continue after calling blocklist.canonicalize_domain(...).
  • Posture — Fail-soft: the invalid candidate is silently skipped; the loop proceeds to the next candidate.
  • Operator signal — No log or metric; the candidate is simply absent from the returned list.
  • Recovery — No retry; the remaining candidates are processed and returned.

Missing exclude_domains Input Leading to Duplicate Suggestions

  • Trigger — The caller does not supply an exclude_domains argument (or supplies an empty list) while some of the potential candidates already exist in the database.
  • Guard — No guard exists inside _brainstorm_direction; the exclude_domains parameter defaults to [] when None. The later persist step for the seed‑query path uses ON CONFLICT DO NOTHING (in _persist_seed_candidates), but the brainstorm channel itself is not persisted.
  • Posture — Fail-soft for the brainstorm channel: duplicates are returned but are stripped later. For the seed‑query path, duplicates are silently ignored at insert time.
  • Operator signal — No immediate signal; the operator may notice that fewer new companies are inserted than expected.
  • Recovery — No retry; the duplicate companies are not persisted, and the discovery continues normally.

Blocklist Canonicalization Exception (e.g. Malformed Input)

  • Triggerblocklist.canonicalize_domain raises an exception (unexpected) while processing a candidate’s domain string.
  • Guard — The outer except Exception as exc catches any exception from inside the for c in cands loop, including this one.
  • Posture — Fail-soft for the entire direction: the function returns [] and logs a warning.
  • Operator signal — Same warning log: log.warning("brainstorm failed direction=%s: %r", mv.vertical, exc).
  • Recovery — No retry; all candidates for that direction are discarded, and the graph continues with other directions.
Interview — could you explain it?

Q — Walk me through the two different discovery channels that involve an LLM. How does the system decide which one to use and what makes each one distinct?
A — The sanctioned, persist‑eligible path is the seed_query channel, identified by the constant SEED_SOURCE = "seed_query". It runs through expand_seedbrainstormdedupepre_score and is driven by the lead‑gen pipeline via {seed_query, industry}. In contrast, the legacy brainstorm channel is listed in _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) and is hard‑stripped in the discover function so it cannot be re‑enabled per run — it is treated as synthetic data that invents companies.
Follow-up — How does the code guarantee that the synthetic brainstorm channel never leaks into the output?
Answer — It is explicitly stripped in the discover function and the persist logic; the module docstring states it is “hard‑stripped in discover” and the _SYNTHETIC_SOURCES frozenset is used to filter it out.
Weak answer misses — The SEED_SOURCE constant and the _SYNTHETIC_SOURCES frozenset, as well as the “hard‑stripped” comment in the source.


Q — The AI brainstorm is supposed to return up to twenty candidates, but the same vertical is called over multiple ticks. How does the system avoid mining the same well‑known companies each time?
A — The plan_targets function uses a per‑vertical sub_niche cursor (_read_and_advance_cursor) to rotate through MicroVertical.sub_niches one per tick. The chosen sub_niche is stored in direction_sub_niches[direction] and passed to _brainstorm_direction as the sub_niche parameter, which changes the framing so the LLM explores a fresh pocket. If the caller already pinned a sub_niche, the cursor logic is skipped.
Follow-up — What happens if a MicroVertical has no sub_niches defined?
Answer — The plan_targets function checks if not niches: continue, so that direction gets no sub_niche, and the brainstorm uses only the seed_query/keyword_signals framing.
Weak answer misses — The exact cursor mechanics (_read_and_advance_cursor, direction_sub_niches dict) and the sub_niches field on MicroVertical.


Q (design question) — Why does the code treat the seed_query brainstorm as a non‑synthetic, persist‑eligible source, while the legacy brainstorm channel is classified as synthetic and forbidden? They both call an LLM.
A — The seed_query channel is anchored to a real user query and passes through expand_seed (LLM facet extraction) before the brainstorm; it is explicitly called “the sanctioned, persist‑eligible company‑discovery brainstorm path” (docstring of SEED_SOURCE). The legacy brainstorm channel, however, has no seed query from a user — it is a generic synthetic micro‑vertical generator and is therefore listed in _SYNTHETIC_SOURCES and hard‑stripped so it “cannot be re‑enabled per run” (comment in the module).
Follow-up — How does the _resolve_sources function distinguish them?
Answer_resolve_sources returns a set of active sources; the code then checks whether SEED_SOURCE is in that set to run expand_seed, while the brainstorm source is always filtered out by _SYNTHETIC_SOURCES.
Weak answer misses — The existence of _SYNTHETIC_SOURCES as a frozenset and the module‑level comment about the brainstorm channel being “hard‑stripped”.


Q — The user‑facing description says the AI returns a confidence score. Where in the source is that score actually captured and passed forward?
A — In the _brainstorm_direction function, after calling the brainstorm LLM, each candidate dict is built with a "confidence" key taken directly from the LLM output field c.get("confidence"). That dict is appended to the result list and eventually stored as part of the discovery state.
Follow-up — What happens if the LLM call raises an exception during _brainstorm_direction?
Answer — The function catches all exceptions with except Exception as exc and returns an empty list [], failing soft so the whole discovery wave isn’t blocked.
Weak answer misses — The exact "confidence" key assignment and the fail‑soft except block in _brainstorm_direction.


Q (tough) — When multiple real sources (commoncrawl, launchfeed) are active, how does the system control concurrency to avoid overwhelming hosts or starving slower channels?
A — A shared httpx.AsyncClient is injected into every fetcher, and a HostLimiter gates the fan‑out with a global cap (DEFAULT_CONCURRENCY = 6) plus per‑host caps defined in DEFAULT_PER_HOST (e.g., "commoncrawl": 6, "launchfeed": 2). Per‑host caps “stop one throttling host from causing retry storms or starving the others.” All caps are overridable via the run input.
Follow-up — Why is the launchfeed cap set lower than commoncrawl’s?
Answer — The comment says launchfeed caps are 2 (while commoncrawl is 6), likely because launchfeed endpoints are more sensitive to load, but that specific rationale is not further explained in the snippet.
Weak answer misses — The HostLimiter class (implicit), DEFAULT_CONCURRENCY, and DEFAULT_PER_HOST dictionary with per‑host values.

02. Only Real Companies

Every candidate company has to be real and checkable. The pipeline never saves a name that a language model dreamed up. Instead it leans on three live data streams. The first is an applicant tracking system, which we will call A T S, and it lists actual job openings. The second is a launch feed that follows brand-new startups. The third is a large web crawl. All three surface genuine, operating businesses.

The system can still run a brainstorm step, where a model proposes possible names. That path is treated as synthetic, so its output is not trusted on its own. At the database write boundary, a guard removes every candidate that came from brainstorming. Only names that arrived through the tracking system, the launch feed, or the web crawl are kept. Here is the honest trade-off. We may miss a promising company that has not yet shown up in any of those feeds. In return, every lead we store is anchored in real hiring activity, never in a lucky guess.

Guard filtering brainstorm candidates: only domains with a dot pass.

python
async def _brainstorm_direction(
    mv: "micro_verticals.MicroVertical",
    *,
    exclude_domains: list[str] | None = None,
    sub_niche: str | None = None,
) -> list[dict[str, Any]]:
    # … (brainstorm call elided)
    result: list[dict[str, Any]] = []
    for c in cands:
        dom = blocklist.canonicalize_domain(str(c.get("domain") or ""))
        # Guard: require a real domain with a dot, skip synthetic names
        if not dom or "." not in dom:
            continue
        result.append({
            "name": c.get("name"),
            "domain": dom,
            "why": c.get("why_in_vertical") or c.get("why"),
            "why_in_vertical": c.get("why_in_vertical") or c.get("why"),
            "confidence": c.get("confidence"),
            "evidence": c.get("evidence"),
            "vertical": mv.vertical,
        })
    return result
ELI5 — the plain-language version

Imagine a museum curator who gets leads from explorers' stories but only adds an artifact to the collection after verifying it matches known patterns and isn't already in the vault. This subsystem is designed to find real, operating businesses for a sales outreach pipeline, ensuring no made-up names from a language model ever get saved.

The pipeline starts with a brainstorm step where a model suggests company names based on a seed query and keyword signals—like an explorer's tip. But these suggestions are treated as synthetic: the system doesn't trust them blindly. Each candidate is run through a filter that checks against a list of keyword signals (must-haves for the vertical) and negative signals (automatic exclusions). The _brainstorm_direction function extracts each company, validates its domain with canonicalize_domain, and discards any without a real dot-com structure. An exclusion list of already-known domains prevents duplicates. Only after passing these checks do the candidates proceed toward the database write.

The trickiest guard is sub-niche rotation. Instead of reusing the same brainstorm prompt each time, the system cycles through a set of sub-niches stored in the MicroVertical dataclass. A cursor in the discovery_cursors table tracks which sub-niche to use next, advancing and wrapping automatically. This forces the model to explore fresh pockets of the vertical, avoiding the same well-known names on every run. Without this subsystem, a single brainstorm would repeatedly return the same few companies or, worse, hallucinate nonexistent domains—leaving the pipeline with fake leads, wasted outreach, and a database full of fantasy.

Data flow — one request, in order
  1. _brainstorm_direction is invoked with mv (MicroVertical), exclude_domains, sub_niche.

    • reads / writes: consumes mv.vertical, mv.keyword_signals, exclude_domains, sub_niche; no writes yet.
    • branch: happy path proceeds to call brainstorm; failure path: if any exception is raised in the outer try, returns [].
  2. Inside the try, calls brainstorm with a dict containing seed_query (formatted as f"micro-vertical:{mv.vertical}"), vertical, keywords, geography, exclude_domains, sub_niche.

    • reads / writes: consumes the dict; writes the return value into out.
    • branch: if brainstorm raises an exception, the outer catch returns []; happy path continues.
  3. out.get("candidates") retrieves the list of candidate dicts.

    • reads: out key "candidates".
    • branch: if cands is None or empty, the for-loop does nothing and the function returns [].
  4. For each candidate c in cands, blocklist.canonicalize_domain(str(c.get("domain") or "")) extracts a canonical dom.

    • reads: c["domain"].
    • branch: if dom is falsy or "." not in dom, that candidate is skipped (the domain-dot guard).
  5. For a kept candidate, a dict is appended to result with keys name, domain, why, why_in_vertical, confidence, evidence, vertical (where vertical comes from mv.vertical).

    • writes: appends to result list.
  6. Returns the result list from _brainstorm_direction.

    • writes: the return value.
  7. The returned list becomes the "candidates" value in the discovery graph state. The dedupe node reads state.get("candidates").

    • reads: state["candidates"].
    • branch: if the list is empty, returns {"filtered": [], "skipped_existing": 0} and stops.
  8. dedupe extracts a flat list of domains from each candidate dict and queries D1 with in_clause to find existing canonical_domain values in the companies table.

    • reads: each candidate’s "domain" key.
    • writes: populates existing set from database rows.
    • branch: on D1Error, returns {"_error": f"dedupe: {e}"}; happy path continues.
  9. dedupe filters kept = [c for c in candidates if c["domain"] not in existing] and returns {"filtered": kept, "skipped_existing": count}.

    • writes: "filtered" list and "skipped_existing" count into the return dict.
  10. pre_score node reads state.get("filtered"), state.get("vertical"), state.get("sub_niche").

    • reads: "filtered", "vertical", "sub_niche".
    • branch: if "filtered" is empty, the function likely returns an empty scored list early (not shown in snippet but implied).
  11. After scoring, the state holds "scored" candidates. The persist node (referenced as _persist_seed_candidates in the seed-query path; for brainstorm the persist is different) reads state.get("scored") and also state.get("brainstorm").

    • reads: "scored", "brainstorm".
    • branch: because "brainstorm" is non‑empty (the synthetic channel), the grounding guard in persist refuses to write these rows — no INSERT occurs.
  12. The persist function returns (inserted_ids, existing_ids, blocked_skipped). For the brainstorm path, inserted_ids is empty, indicating no synthetic company reached the database. This is the terminal step of the subsystem.

    • writes: the return tuple with zero inserted IDs.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem enforces a strict write boundary that prevents any LLM-invented company from entering the companies table. The ordered mechanism begins with the brainstorming step: for a given micro‑vertical, _brainstorm_direction calls the LLM, then each returned candidate undergoes a domain‑validity check via blocklist.canonicalize_domain. If the domain string is empty or lacks a ., the candidate is discarded immediately. Surviving candidates transition to the persist node, which first applies a blocklist filter (dropping domains on the explicit blocklist) and then performs an INSERT with ON CONFLICT DO NOTHING. The persist function returns (inserted_ids, existing_ids, blocked_skipped); failures from a D1Error surface as _error in the state, while the upstream discover node’s guard (the “Grounding guard”) independently strips any rows that originated from the synthetic brainstorm channel.

The invariant preserved is the write boundary that refuses LLM‑invented rows. The source explicitly calls this the “Grounding guard”: the companies table feeds scoring and outreach decisions, so the persistence layer itself must reject synthetic data (“never rely solely on the upstream discover strip”). The design identifies _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) as the only synthetic channel; launchfeed and commoncrawl are real‑data sources that pass through untouched, while the sanctioned seed_query path (ex‑company_discovery) is also allowed because it follows the same domain‑validity and blocklist checks. No magic field or heuristic is trusted—only the channel identity and the physical domain‑dot test.

The key trade‑off is sacrificing potential coverage from LLM‑hallucinated leads to avoid polluting the scoring and outreach pipeline with fabricated companies. The obvious alternative is to trust the LLM output and insert every candidate that appears credible, relying on later human review or downstream validation. That rejection avoids the cost of false positives propagating into sales decisions (e.g., sending outreach to a non‑existent company), which would waste human effort and possibly damage partner trust. The cost of the chosen design is that some genuine but very new or obscure companies whose domain lacks a dot or who are mistakenly blocklisted will be lost—but this is deemed acceptable because the launchfeed and commoncrawl channels provide a real‑signal baseline.

A concrete failure mode occurs when the LLM returns a candidate with an empty or malformed domain (e.g., null or "company"). In _brainstorm_direction, the line if not dom or "." not in dom: continue silently drops the candidate. The operator sees a log warning like "brainstorm failed direction=<vertical>: <exception>" only if the entire LLM call fails; for a single invalid domain, the candidate simply vanishes from the returned list with no explicit error. The downstream persist node then sees no scored items for that direction and returns empty inserts, so no new company appears. An operator monitoring inserted_ids counts would notice zero inserts for a direction that normally yields candidates, and could inspect the brainstorm logs to confirm the LLM provided a domain without a dot.

Cost & performance — the real knobs

DEFAULT_CONCURRENCY — global cap of 6 concurrent connections across all hosts.

  • Knob: DEFAULT_CONCURRENCY = 6 (overridable per run via state["concurrency"]).
  • Bounds: Total parallel HTTP requests issued at once.
  • Effect: Raising it increases wall-clock parallelism (faster discovery) but also multiplies simultaneous calls to LLM APIs, Common Crawl CDX, and launchfeed endpoints, raising peak network and API‑cost burn rate. Lowering it throttles throughput, protecting against rate limits and reducing concurrency‑driven expense spikes.
  • Risk: Too high → easy rate‑limit triggers, connection‑pool exhaustion, or retry storms when a host slows; too low → discovery becomes sequential and latency multiplies across sources.

DEFAULT_PER_HOST — per‑host concurrency caps that constrain each source independently.

  • Knob: DEFAULT_PER_HOST: dict[str, int] with values "commoncrawl":6, "llm":4, "d1":6, "launchfeed":2 (overridable via state["per_host"]).
  • Bounds: Fan‑out per individual host/endpoint; prevents one throttling service (e.g. a slow LLM) from starving others.
  • Effect: Lowering a host’s cap reduces its share of the total concurrency, smoothing response latency for that host at the cost of slower per‑host throughput. Raising it lets that host drive more parallelism, potentially burning through its quota or triggering server‑side limits faster.
  • Risk: Set too high per‑host → that source becomes a bottleneck or gets blacklisted; too low → the whole discovery stalls waiting for a low‑concurrency host to finish.

CC_MAX_DOMAINS — ceiling on Common Crawl domain lookups per run.

  • Knob: CC_MAX_DOMAINS = 40 (overridable via state["cc_max"]).
  • Bounds: Maximum number of unique domains that will be queried against Common Crawl’s CDX index and cached pages.
  • Effect: Raising it increases the volume of crawled data (more org facts/contacts) and extends total wall‑clock time, but also drives up CDX request volume and D1 write costs. Lowering it restricts the crawl to fewer domains, reducing both latency and expense.
  • Risk: Too high → excessive D1 writes and CDX requests may push past quota or cause timeouts; too low → many real‑world signals are missed.

httpx client timeout (20.0 seconds) — per‑request timeout for all discovery HTTP calls.

  • Knob: httpx.AsyncClient(timeout=20.0, ...) in discover(); overridable only by modifying the code (no run‑input override shown).
  • Bounds: Maximum wait for a single response from any source (LLM API, CDX, launchfeed endpoint).
  • Effect: Shortening it drops slow responses faster, reducing stall latency but increasing retries (and hence costs) if the service is legitimately slow. Lengthening it tolerates more latency, improving success rate at the expense of longer tail waits.
  • Risk: Too short → reliable sources with occasional spikes get aggressively dropped, losing data; too long → a single slow host can hold the entire concurrency slot hostage, delaying discovery completion.

_LAUNCH_LOOKBACK_DAYS (90) — window for surfacing recently launched companies from launch‑feed.

  • Knob: _LAUNCH_LOOKBACK_DAYS = 90 (constant in discovery_graph.py).
  • Bounds: How many days backward from “now” to consider a launch as “recent”.
  • Effect: Increasing it expands the candidate pool (more launches to evaluate), raising the number of D1 inserts and LLM pre‑scoring calls — both time‑ and cost‑wise. Decreasing it narrows the window, reducing volume but potentially missing valuable new entrants.
  • Risk: Too large → stale or irrelevant launches flood the pipeline, wasting scoring budget; too small → true recent signals are omitted.

_LAUNCH_SIGNAL_DECAY_DAYS (180) — decay horizon for product‑launch provenance signals.

  • Knob: _LAUNCH_SIGNAL_DECAY_DAYS = 180 (constant).
  • Bounds: How long a “product_launch” signal remains weighted in company scoring/outreach decisions.
  • Effect: Extending it keeps older launches scoring higher for longer, increasing the effective candidate set (more time spent on pre‑scoring and outreach). Shortening it discards weaker signals sooner, reducing downstream work but possibly aging out companies that still have present‑day relevance.
  • Risk: Too long → stale launches consume scoring/outreach budget; too short → valuable leads from a product launch plateau are dropped prematurely.
Failure modes — what breaks, what catches it

LLM Brainstorm Unavailable

  • Trigger — The _brainstorm_direction function calls brainstorm() and that call raises any exception (e.g. missing API key, network timeout, rate limit).
  • Guard — The bare except Exception clause (captured as exc) inside _brainstorm_direction.
  • PostureFail‑soft — the function returns [] and the discovery run continues with zero candidates from that direction.
  • Operator signal — A log.warning line: "brainstorm failed direction=%s: %r" with the vertical name and the exception repr.
  • Recovery — No retry; the run proceeds with an empty candidate list for that vertical. The operator can later re‑trigger the run manually.

LLM‑Invented Domain Passes Dot‑Check

  • Trigger — The seed_query path’s brainstorm() returns a candidate whose domain contains a dot (e.g. "invented-company.io") but does not correspond to a real company.
  • Guard — Only the dot‑presence test in _brainstorm_direction: if not dom or "." not in dom: continue. There is no further existence check (DNS, live fetch) in the persist chain for the seed_query source.
  • PostureFail‑open — the fake domain passes validation and enters the database; no guard refuses the write.
  • Operator signal — No error or log; the company appears silently in the companies table.
  • Recovery — Manual review of discovered companies and deletion of any synthetic rows; no automatic remediation.

Dedupe D1Error

  • Trigger — The dedupe node runs a SELECT DISTINCT canonical_domain FROM companies query and the D1 database raises a D1Error (e.g. transient timeout, connectivity loss).
  • Guard — The except D1Error as e: clause inside dedupe.
  • PostureFail‑hard — the function returns {"_error": f"dedupe: {e}"}, which sets state._error; subsequent nodes (pre_score, persist) see the error and return early, aborting the run.
  • Operator signal — The _error field in the returned state (visible in logs or the run’s final status).
  • Recovery — The run terminates; the operator must retry the entire discovery run. No automatic retry or backoff is implemented.

Persist Seed Candidates D1Error

  • Trigger_persist_seed_candidates executes an INSERT ON CONFLICT DO NOTHING and the D1 database throws a D1Error (e.g. constraint violation unrelated to ON CONFLICT, or storage failure).
  • Guard — The except D1Error as e: clause inside the persist‑candidates block.
  • PostureFail‑hard — the error is returned as {"_error": str(e), "agent_timings": {"persist": ...}} and the run stops.
  • Operator signal — The _error field in the state, with the database error message.
  • Recovery — Manual inspection of the database and re‑run; no automatic retry.

Blocklist Bypass for LLM Brainstorm Path

  • Trigger — The _brainstorm_direction function (used in the legacy micro‑vertical brainstorm channel) returns a domain that is blocklisted (e.g. a known spam or competitor domain).
  • Guard — No blocklist check exists in the LLM brainstorm path. The comment in _persist_seed_candidates explicitly states “the LLM brainstorm path bypasses the blocklist that pipeline_graph.run_discover applies on the explicit path.”
  • PostureFail‑open — the blocklisted domain is allowed through to the database if the run does not also strip the brainstorm source (though _SYNTHETIC_SOURCES hard‑strips brainstorm in the consolidated graph, so this failure only applies if the run incorrectly re‑enables brainstorm or if the seed_query path’s brainstorm is not covered).
  • Operator signal — No signal; the blocklisted domain appears without any warning or filtering.
  • Recovery — Manual audit of discovered companies and removal of any blocklisted entries; adjustments to the run configuration to ensure brainstorm source stays stripped.
Interview — could you explain it?

Q — How does the system guarantee that only real-world companies are written to the database, especially when the discovery step uses an LLM to brainstorm candidates?
A — Two hardened gates enforce this. First, inside _brainstorm_direction, every LLM‑generated candidate is run through blocklist.canonicalize_domain and then dropped if the result lacks a dot (e.g. if not dom or "." not in dom). Second, at the persist boundary, any row whose source is the synthetic channel (_SYNTHETIC_SOURCES = frozenset({"brainstorm"})) is refused — the comment says “the write boundary itself must refuse LLM‑invented (synthetic) rows — never rely solely on the upstream discover strip”.
Follow-up — Why have two separate guards when one domain‑check in _brainstorm_direction appears sufficient?
A — Defense in depth: the domain check catches missing or malformed domains early, but a determined LLM could still produce a domain field that passes the dot test (e.g. “example.com”) yet corresponds to a fully invented company. The second guard at persist, using _SYNTHETIC_SOURCES, ensures that even such crafted rows are never committed because the entire brainstorm channel is classified as synthetic and hard‑stripped.
Weak answer misses — That the persist guard checks the source tag (_SYNTHETIC_SOURCES) not the domain content, and that the seed‑query path (tagged SEED_SOURCE) is explicitly exempt from this synthetic strip.


Q — The system has two brainstorm‑like signals: the legacy micro‑vertical brainstorm and the seed‑query brainstorm. Why is one considered synthetic and stripped from the database while the other is persist‑eligible?
A — The context separates them by source classification: the legacy brainstorm channel (in _SYNTHETIC_SOURCES) is hard‑stripped because “it invents candidate companies (synthetic data)”. In contrast, the seed‑query channel (SEED_SOURCE) is “the sanctioned, persist‑eligible company‑discovery brainstorm path” — it is driven by a user‑supplied seed_query and goes through expand_seed (LLM facet extraction) before the brainstorm, making it grounded in a real query rather than a purely synthetic generation.
Follow-up — If both use an LLM to produce company names, how does the seed‑query path avoid inventing fake companies?
A — The seed‑query path still uses LLM brainstorm, but its output passes through dedupe and pre_score nodes, and more importantly the write boundary does not strip it because its source tag is SEED_SOURCE, not brainstorm. The design trusts that the seed‑query invocation, combined with real‑world deduplication and scoring, produces candidates that correspond to actual companies the user is looking for.
Weak answer misses — That the seed‑query path’s brainstorm is still an LLM call; the difference is purely the source tag and the additional validation nodes in the graph, not an inherent guarantee of reality.


Q — How does the system prevent duplicate companies from being added when the discovery graph runs multiple ticks for the same micro‑vertical?
A — Two mechanisms work together. First, each call to _brainstorm_direction accepts an exclude_domains list — the context says it “tells the model which companies we already have so it stops re‑suggesting them”. Second, the seed‑query graph includes a dedicated dedupe node (listed in the seed‑query path: “expand_seed → brainstorm → dedupe → pre_score → persist”) that removes any candidates whose domain already exists in the database.
Follow-up — What happens if the LLM ignores the exclude_domains hint and suggests a duplicate anyway?
A — The dedupe node (in the seed‑query path) and the domain‑canonicalization step (blocklist.canonicalize_domain) act as a hard filter; even if the LLM returns a known domain, it will be identified and dropped before persist. The context also shows that _brainstorm_direction itself does not remove duplicates after the LLM response, so the downstream dedupe node is essential.
Weak answer misses — That exclude_domains is only a prompt‑level hint with no enforcement; the actual deduplication relies on the explicit dedupe node and domain canonicalization.


Q — Why does the _brainstorm_direction function not perform deduplication internally, instead relying on the upstream exclude_domains hint and a separate dedupe node?
A — This is a deliberate separation of concerns. _brainstorm_direction is a “fail‑soft” LLM call meant to be best‑effort; it returns raw candidate suggestions. Deduplication is a cross‑source, data‑integrity concern that belongs in a dedicated graph node (dedupe), so it applies uniformly to all channels (seed‑query, commoncrawl, launchfeed) and can be improved without touching the LLM interaction. The context shows the seed‑query path explicitly includes a dedupe step after the brainstorm.
Follow-up — But doesn’t that mean a deduplication failure in _brainstorm_direction could waste LLM tokens on known companies?
A — Yes, it wastes tokens, but the design accepts that inefficiency for correctness. The exclude_domains parameter reduces redundancy for free, and the LLM cost is considered acceptable compared to the risk of accidentally dropping a real new candidate or introducing a duplicate due to a flawed dedup inside the brainstorm function.
Weak answer misses — That _brainstorm_direction is explicitly annotated as “best‑effort” and “fail‑soft” (except Exception: return []), so it is intentionally simple and leaves robust operations to the graph pipeline.


Q — (Design) Why does the system define DEFAULT_SOURCES = ("launchfeed",) yet still include a commoncrawl channel that is real‑data but opt‑in? Why not make all real sources active by default?
A — The context explains that “discovery decisions (scoring/outreach) must rest on companies sourced from real signals”. Launchfeed (YC + ProductHunt) provides immediately actionable, recently‑launched candidates with high relevance for B2B lead gen. Common Crawl, while real, is “opt‑in only” — likely because its CDX queries are more expensive and yield lower‑quality or older domains (the context mentions CC_MAX_DOMAINS = 40). Defaulting to the highest‑signal, lowest‑overhead source avoids overwhelming the pipeline while keeping richer sources available.
Follow-up — How does the multi‑source engine prevent a slow Common Crawl request from blocking the faster launchfeed results?
A — The concurrency model runs all active channels under a single asyncio.gather per wave, so wall‑clock time is “≈ slowest host, not the sum”. Additionally, per‑host caps (DEFAULT_PER_HOST includes "commoncrawl": 6, "launchfeed": 2) prevent one throttling host from stalling others.
Weak answer misses — That the default source choice is a deliberate trade‑off between signal quality and operational cost, not a simple technical limitation; and that per‑host caps (DEFAULT_PER_HOST) are separate from the global concurrency limit.

03. The Verifiable Sources

Discovery draws on two kinds of sources, and you can verify both yourself. The first kind is public job boards that companies host on applicant tracking systems. Two common ones are Ashby and Greenhouse. The system keeps a table of short company names, the ones that appear in a board web address. It then polls those boards directly. If a name is wrong or stale, the board just returns an empty page, so nothing breaks. Every company we surface this way has live, open roles you can go and read right now.

The second kind is a seed query route. It starts a language model brainstorm, fed a focus niche and a region. The model returns real companies whose web addresses actually resolve, and an exclusion list stops it from repeating ones we already know. You can take any address it offers and open the job page to confirm it. The trade-off is clear. The board route is slower but certain, while the brainstorm route is fast but needs that verification step. Both paths point at public, live, independently checkable hiring.

The system constructs public, verifiable job board URLs for each applicant tracking system vendor using the company's unique slug.

python
def _job_board_url(vendor: str, slug: str) -> str:
    if vendor == "greenhouse":
        return f"https://boards.greenhouse.io/{slug}"
    if vendor == "lever":
        return f"https://jobs.lever.co/{slug}"
    if vendor == "workable":
        return f"https://apply.workable.com/{slug}"
    if vendor == "rippling":
        return f"https://ats.rippling.com/{slug}"
    if vendor == "smartrecruiters":
        return f"https://jobs.smartrecruiters.com/{slug}"
    if vendor == "teamtailor":
        return f"https://career.teamtailor.com/{slug}"
    return f"https://jobs.ashbyhq.com/{slug}"
ELI5 — the plain-language version

Imagine you're a savvy shopper looking for rare ingredients in a big city. You use two methods: first, you check the store's own signs and shelves directly (if they have a board outside, you read it), and second, you ask a knowledgeable friend who suggests hidden specialty shops based on what you're looking for. This system does exactly that for finding companies—it is built to discover real businesses that are actively hiring in very specific AI fields.

The first method is polling public job boards that companies host on applicant tracking systems like Ashby and Greenhouse. The system keeps a simple table of short company names that appear in board web addresses, then asks those boards directly for open roles. If a name is wrong or stale, the board just returns an empty page—no harm done. The second method is a brainstorm: the system feeds a focus niche (like “AI month‑end close automation”) to a language model, along with keywords and exclusion lists from keyword_signals and exclude_domains. The model suggests real companies it believes fit that niche, and each suggestion includes a domain that gets validated. To keep the suggestions fresh, the system rotates through sub‑niches stored in a discovery_cursors table: a cursor per vertical advances one step each tick, so the friend never re‑mines the same old recommendations.

The trickiest part is that without that rotation, the language model would keep suggesting the same well‑known companies, missing smaller, newer startups. The sub_niche_idx in the D1 table ensures the focus shifts every cycle, exploring every pocket of the vertical over time. Without this entire subsystem, you'd either get stale, repetitive results from the brainstorm or miss companies altogether because the direct board poll alone can't find hidden AI‑focused startups. You'd end up with a shallow list and no confidence that the companies are still actively hiring.

Data flow — one request, in order
  1. expand_seed node
    Reads state["seed_query"], state["vertical"], state["keywords"], and the profile’s default_seed_query.
    Returns {"seed_query": ...} or {"vertical": ..., "geography": ..., "size_band": ..., "keywords": ...}.
    Branch: If SEED_SOURCE is not in the resolved sources set, it returns an empty dict immediately (happy path for multi-source runs). If caller already supplied both vertical and keywords, it skips facet extraction and only returns the seed query. Otherwise it attempts LLM extraction; on exception it returns {"_error": ...}.

  2. plan_targets node
    Reads state (implicitly the resolved sources set) and state.get("directions").
    Returns {"_direction_sub_niches": dict} populated with the next sub‑niche identifier for each direction that has sub‑niches.
    Branch: If no ATS source is active (_resolve_sources(state) & _ATS_PATH_SOURCES is false), it short‑circuits and returns an empty dict – this is the seed‑only happy path. For ATS‑enabled runs, it iterates over directions, reads and advances the cursor per vertical.

  3. brainstorm node (referred to in the comment expand_seed → brainstorm → dedupe → pre_score)
    Iterates over the resolved set of MicroVertical directions. For each direction it calls _brainstorm_direction (the per‑direction implementation).
    Aggregates the returned candidate lists into a single "candidates" key in the state.
    Branch: No early return; empty candidate lists per direction are simply added.

  4. _brainstorm_direction function (called inside the brainstorm loop)
    Reads mv.vertical, mv.keyword_signals, the exclude_domains parameter (list of already‑known domains) and the optional sub_niche string.
    Constructs a payload dict with keys "seed_query", "vertical", "keywords", "geography", "exclude_domains", "sub_niche".
    Returns a list of candidate dicts – each with name, domain, why_in_vertical, confidence, evidence, vertical – after canonicalizing domains.
    Branch: If the external brainstorm call fails (any exception, including missing API key), the function logs a warning and returns an empty list. Only candidates with a non‑empty domain containing a dot are kept; others are skipped.

  5. dedupe node (referenced in the seed‑query path comment)
    Consumes the aggregated candidate list, removes entries whose domain already appears in the exclude_domains set (passed from earlier ticks) or that are duplicates.
    Returns a deduplicated candidate list.
    Branch: If the candidate list is already empty, this node is a no‑op.

  6. pre_score node (referenced in the seed‑query path comment)
    Reads the deduplicated candidates and the MicroVertical’s pre_score_keywords (or falls back to the profile’s score_tiers). For each candidate, it computes a confidence score based on matched keywords.
    Returns candidates with a "confidence" field (float between 0 and 1).
    Branch: If a candidate has no matched keywords and no fallback tier, its confidence may remain low; no early return.

  7. _persist_seed_candidates function
    Reads state.get("candidates") and the seed_query source flag.
    Inserts each candidate into the companies table, distinguishing new inserts from existing rows and blocked domains.
    Returns a tuple (seed_inserted_ids, _seed_existing, seed_blocked).
    Branch: If a D1Error occurs, it returns {"_error": str(e), "agent_timings": ...} and does not proceed. On success, the inserted IDs are used downstream.

  8. _emit_channel_spans function
    Reads the environment variable OTEL_EXPORTER_OTLP_ENDPOINT; if empty, returns immediately. Otherwise reads by_vertical (a dict of per‑vertical counts), ats_results, cc_results, and sources set.
    Emits one OpenTelemetry span per (channel, vertical) pair with count attributes.
    Branch: This function is called after all source channels have been processed. It is a best‑effort telemetry hook that never blocks the main flow.

  9. Terminal step – state return
    The graph returns a final DiscoveryState containing (at minimum) candidates, _error (if any), agent_timings, and the optional _direction_sub_niches from plan_targets.
    No further node processes the state; the caller (e.g., an HTTP handler) reads the state for downstream use.
    Branch: If any node set _error, the terminal step propagates that error; otherwise the output is the full state with all successful fields.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The discovery system’s “Verifiable Sources” chapter is implemented inside discovery_graph.py through a two-phase pipeline that separates synthetic hypothesis generation from ground-truth ingestion. The ordered mechanism begins with plan_targets(), which reads the configured micro‑vertical directions from micro_verticals.resolve_directions and, for each direction that has sub_niches, advances a per‑vertical cursor stored in the discovery_cursors table (created by _ensure_cursor_table()). This cursor rotation ensures successive ticks explore a fresh sub‑niche framing. Next, for each direction, _brainstorm_direction() is called—it issues a DeepSeek brainstorm seeded with the micro‑vertical’s seed_query, keyword_signals, an optional sub_niche, and a list of already‑known domains to discourage re‑suggestion. The brainstorm returns candidate companies; any that pass domain canonicalization are collected. Finally, _persist_seed_candidates() writes the brainstorm results into the companies table. Critically, the actual verifiable sources (ATS job boards, Common Crawl, launch feeds) are not produced by the brainstorm node; they flow through a separate discover path that is referenced only indirectly. The code states that launchfeed and commoncrawl are “real‑data sources and pass through untouched,” whereas the brainstorm channel is explicitly tagged as “the only synthetic channel” and must be scrutinized at the write boundary.

The central invariant is a write boundary called the “grounding guard” in the source: the companies table—which feeds scoring and outreach—must “refuse LLM‑invented (synthetic) rows.” The guard is enforced not by the brainstorm node itself but by the persist node (_persist_seed_candidates), which is the only place where synthetic data can enter. Real‑data sources bypass this check because they carry inherent verifiability: anyone can independently replay the ATS job‑page request using the company’s slug, or inspect the Common Crawl snapshot. This guarantee means that the downstream scoring and outreach system never acts on fictional entities, even if the LLM hallucinates plausible‑sounding companies.

The key trade‑off is that the system rejects using only verifiable sources for discovery, and instead intentionally leans on an LLM brainstorm to generate candidate ideas. The obvious alternative would be to restrict discovery exclusively to known real‑data feeds (ATS, Common Crawl, launch feeds)—which would guarantee zero synthetic rows. That alternative is rejected because it would be slow to surface emerging micro‑verticals or novel niches that no public feed yet indexes. By using the LLM to explore fresh applied pockets (rotated via sub_niches), the system gains breadth, but at the cost of introducing a synthetic channel that requires the grounding guard. The cost that this rejection avoids is the missed‑opportunity cost of never discovering untracked segments—a cost that would grow as the taxonomy evolves. The trade‑off is managed by the explicit write boundary: synthetic candidates are allowed in, but only the persist node decides whether to actually store them, using the same guard that rejects obviously fake rows.

A concrete failure mode occurs when the DeepSeek API is unavailable (e.g., no key set in a dry run). In that case, _brainstorm_direction() catches the exception with except Exception as exc and logs a warning: "brainstorm failed direction=%s: %r" with the mv.vertical tag. The function returns an empty list, and no candidate is written for that direction. An operator would see that log line in the backend output, along with the direction name (e.g., legal-pi-demand), and could verify that the OTEL_EXPORTER_OTLP_ENDPOINT variable is unset—so the telemetry spans (_emit_channel_spans) would also be suppressed, but the log remains the primary signal. No companies enter the database for that cycle, and the rotation cursor for that vertical is still advanced (since plan_targets runs before the brainstorm), so the next tick will skip to the following sub‑niche, preventing infinite retries on a dead endpoint.

Cost & performance — the real knobs

The subsystem allocates time and money to concurrent network requests to Common Crawl, LLM brainstorm endpoints, and launch-feed APIs, plus D1 database writes. The four tunable performance knobs explicitly present in the source are:

  • DEFAULT_CONCURRENCY (default 6). This global ceiling controls how many simultaneous httpx.AsyncClient connections can be open across all sources. Raising it increases parallel throughput and reduces wall‑clock latency but raises network bandwidth cost and risks overwhelming rate‑limiting or downstream services. Lowering it avoids connection throttling but serializes work, lengthening run duration. Mis‑setting too high triggers HTTP 429s or timeouts; too low starves the pipeline.

  • DEFAULT_PER_HOST (dictionary with per‑host keys: commoncrawl:6, llm:4, d1:6, launchfeed:2). These per‑host caps prevent a single throttling source from blocking others. Increasing commoncrawl yields more parallel CDX queries, accelerating domain extraction but increasing Common Crawl’s cost and the risk of being blocked. Decreasing llm caps the number of concurrent LLM brainstorm calls, saving token cost ($) per run at the expense of slower candidate generation. Mis‑setting a host cap too high causes repeated failures from that host; too low idles the pipeline on that source.

  • CC_MAX_DOMAINS (default 40). This constant caps the number of Common Crawl lookups per run. Raising it retrieves more candidate domains, increasing the chance of finding relevant companies but also linearly increasing WARC extraction time and CDX API costs. Lowering it curbs spending and latency but may miss valuable leads. Setting it above the host’s acceptable request volume triggers rate‑limiting; setting it to zero disables Common Crawl entirely.

  • _LAUNCH_LOOKBACK_DAYS (default 90) and _LAUNCH_SIGNAL_DECAY_DAYS (default 180). These constants control the time window for surfacing new launches and the decay of product‑launch signals. A larger lookback retrieves more launch‑feed candidates, increasing memory and processing time, and may include stale leads that reduce scoring accuracy. A shorter window narrows the funnel, saving compute but possibly missing slower‑to‑appear companies. Mis‑setting the decay too high retains outdated signals; too low discards nascent leads prematurely.

Failure modes — what breaks, what catches it

Failure 1: D1Error During Seed-Query Persistence

  • Trigger — A database write failure when executing _persist_seed_candidates (e.g., connection timeout, constraint violation, or D1 unavailability).
  • Guard — The except D1Error as e: clause that immediately follows the await _persist_seed_candidates(state) call, returning {"_error": str(e), "agent_timings": {...}}.
  • Posturefail-soft: The graph receives an error field in the returned dictionary instead of re-raising; subsequent steps check for state.get("_error") and short-circuit, so the run continues but with no new companies persisted for that tick.
  • Operator signal — The _error field in the returned dict (e.g., "_error: D1Error: ..."). No explicit log line is shown, but the error surfaces in the downstream state and could be collected by the caller.
  • Recovery — No automatic retry or backoff is present. The run must be manually re-triggered after the underlying D1 issue is resolved.

Failure 2: D1Error During Domain Deduplication

  • Trigger — The dedupe function’s D1 SELECT query fails (e.g., transient network error or database overload).
  • Guard — The except D1Error as e: return {"_error": f"dedupe: {e}"} inside the dedupe coroutine.
  • Posturefail-soft: The deduplication step returns an error dictionary, which is propagated as state["_error"]. Later functions (pre_score, _persist_seed_candidates, etc.) all short-circuit when they detect state.get("_error"), so the entire discovery tick is gracefully aborted without crashing the worker.
  • Operator signal — The returned _error field contains "dedupe: <D1Error message>". Again, no explicit log, but the caller can inspect the state.
  • Recovery — Manual re-run after the D1 outage or query issue is fixed. No retry logic is implemented.

Failure 3: Common Crawl Persistence Exception (Unhandled)

  • Trigger — The call cc.persist_crawl_async(cid, cco) inside _persist_launch_candidate raises an exception (e.g., D1 error, invalid crawl object, or network failure).
  • GuardNo guard exists. The coroutine does not wrap this call in a try-except. If the exception is raised, it propagates out of _persist_launch_candidate and, because asyncio.gather does not set return_exceptions=True, it will abort the entire gather and cause the parent conductor to fail hard.
  • Posturefail-hard: The exception escapes and the whole discovery run terminates with a traceback.
  • Operator signal — An unhandled exception traceback in the worker logs; the run stops before completing.
  • Recovery — The run must be restarted manually. No automatic retry or fallback exists in the source.

Failure 4: Launch-Feed Upsert / Signal Failure (Unhandled)

  • Trigger_upsert_company(key, ...) or _emit_launch_signal(cid, lc) inside _persist_launch_candidate raises a D1Error or other exception.
  • GuardNo guard is shown for these calls. The only guard is the if not cid: return 0,0,0,0 check when _upsert_company returns falsy (e.g., because of a blocklist rejection), but that is a validation guard, not an exception handler. Exceptions from _upsert_company or _emit_launch_signal will bubble up and, as in Failure 3, cause asyncio.gather to fail hard.
  • Posturefail-hard: The exception terminates the gather and the entire discovery tick.
  • Operator signal — Unhandled exception traceback in logs.
  • Recovery — Manual restart. No retry or fallback is implemented.

Failure 5: Blocklist-Domain Rejection (Silent Skip)

  • Trigger — A candidate domain matches a blocklist entry, so blocklist.canonicalize_domain(dom) returns None or the resulting key is rejected. The code then executes cid = await _upsert_company(...) which either returns 0 or the if not cid: branch is taken.
  • Guard — The explicit if not cid: return 0, 0, 0, 0 in _persist_launch_candidate. This is a validation guard that prevents the synthetic or blocked domain from being written.
  • Posturefail-soft: The candidate is silently skipped; the overall persist step continues with remaining candidates. The function returns zero counts.
  • Operator signal — No log line or error is produced in the shown source. The operator would see fewer persisted companies than expected (e.g., net_new count lower than the number of launch candidates). No metric is emitted.
  • Recovery — No action required; the rejection is intentional. If a domain was wrongly blocklisted, an operator must update the blocklist manually and re-run discovery.
Interview — could you explain it?

Q — What are the two real, verifiable sources of company discovery in this subsystem, and how do they differ from the synthetic brainstorm channel?

A — The two real sources are Common Crawl (CDX query + cached-page extraction) and launchfeed (YC + ProductHunt), as identified by the constant _ATS_PATH_SOURCES = frozenset({"commoncrawl", "launchfeed"}) and the default DEFAULT_SOURCES = ("launchfeed",). They produce independently checkable data from public archives and launch announcements, whereas the brainstorm channel is marked as synthetic (_SYNTHETIC_SOURCES = frozenset({"brainstorm"})) and hard-stripped in discover so that discovery decisions rest only on real signals.

  • Follow-up: Does the system ever use LLM-generated candidates for final discoveries?
    Answer: Yes, via the seed_query channel (ex-company_discovery), which is sanctioned and persist-eligible; it is distinct from the legacy synthetic brainstorm channel and uses a different pipeline (expand_seed → brainstorm → dedupe → pre_score).

  • Weak answer misses: The distinction between seed_query (sanctioned, non-synthetic) and brainstorm (synthetic, forbidden) is critical; a shallow answer might lump all LLM paths together, ignoring the explicit _SYNTHETIC_SOURCES set and the separate role of SEED_SOURCE.


Q — Why does the system impose per-host concurrency caps (DEFAULT_PER_HOST) in addition to the global DEFAULT_CONCURRENCY ceiling?

A — The HostLimiter in discovery_graph.py gates fan-out with a global cap plus per-host caps (e.g., "commoncrawl": 6, "llm": 4, "launchfeed": 2). This prevents one throttling host from causing retry storms or starving the others, so wall-clock time approximates the slowest host rather than the sum. All caps are overridable via run inputs (concurrency, per_host).

  • Follow-up: What happens if a host cap is hit while the global ceiling is not?
    Answer: The host limiter blocks that specific host’s requests, allowing other hosts to proceed up to their own caps, controlled per wave via asyncio.gather.

  • Weak answer misses: The per-host caps are defined per run and affect the asyncio.gather per wave; a shallow answer might omit that DEFAULT_PER_HOST is a dictionary keyed by host name, not a single number.


Q — How does the launchfeed source handle signal decay to avoid surfacing stale candidates?

A — Constants _LAUNCH_LOOKBACK_DAYS = 90 and _LAUNCH_SIGNAL_DECAY_DAYS = 180 control recency; only launches within 90 days are surfaced, and product_launch signals decay slowly over 180 days. This is defined under the "V60: launch-feed channel constants" section in discovery_graph.py.

  • Follow-up: What mechanism ensures that a company appearing in both launchfeed and Common Crawl isn't double-counted?
    Answer: The dedupe node (part of the seed-query path) and the overall discovery graph’s state management deduplicate candidates across sources.

  • Weak answer misses: The distinction between lookback (what is fetched) and decay (signal weight) is separate; a shallow answer might conflate them or miss that _LAUNCH_SIGNAL_DECAY_DAYS affects scoring persistence, not just recency.


Q — Given that both brainstorm and the seed_query path use LLM to generate candidate companies, why is the legacy brainstorm channel hard-stripped while seed_query is sanctioned as a persist-eligible source?

A — The _SYNTHETIC_SOURCES set explicitly forbids the brainstorm channel, and discover hard-strips it. In contrast, seed_query is declared "NOT synthetic" and is the sanctioned path for the lead-gen pipeline (ex-company_discovery). The key difference is that seed_query follows a controlled pipeline: expand_seed (LLM facet parse) → brainstorm (DeepSeek 12-20 real companies) → dedupepre_score, whereas the legacy brainstorm channel invented companies without that verification chain.

  • Follow-up: What concrete identifier in the code distinguishes the two brainstorm invocations?
    Answer: In _brainstorm_direction, the brainstorm call uses a marker "seed_query": f"micro-vertical:{mv.vertical}" to gate the node, whereas the legacy brainstorm channel lacks that seed-query framing and is routed differently.

  • Weak answer misses: The codebase has two separate brainstorm invocations (direction brainstorm per micro-vertical vs. seed-query brainstorm); a shallow answer would treat them as identical, ignoring the distinct _direction_sub_niches cursor logic and the seed_query marker.

04. The Merged Graph

Discovery used to run as two separate systems. One was a seed-query path. It took a fuzzy request, like find consultancies in Europe, and asked a language model to brainstorm real companies. The other was a multi-source engine that scraped many tracking boards, such as Ashby and Greenhouse. Engineers folded both into a single discovery graph behind one identity. A shared state and a single resolver now decide which channels switch on for each run.

The seed-query path is still the approved way to find and store real companies. The board side now polls across five hiring directions. Each direction lives in a plain table of settings, with no special code of its own. So one flexible graph can serve every direction, in place of five near-identical workers. Merging carries a real trade-off. A single graph is harder to reason about than five tiny ones. But it stops the two old systems from drifting apart, and it gives every run one source of truth to trust.

The channel resolver unifies seed_query and multi-source paths, always stripping the synthetic brainstorm channel.

python
def _resolve_sources(state: DiscoveryState) -> set[str]:
    # Resolve effective discovery channels for this run.
    raw = state.get("sources")
    if raw:
        srcs = set(raw)
    elif state.get("seed_query") and not state.get("directions"):
        srcs = {SEED_SOURCE}
    else:
        srcs = set(DEFAULT_SOURCES)
    return srcs - _SYNTHETIC_SOURCES
ELI5 — the plain-language version

Think of a restaurant that used to have two separate kitchens: one for cooking meals from orders scribbled on paper, and another for prepping ingredients from delivery trucks. Now the manager merged them into one kitchen that decides which stove to turn on based on what’s coming in. This subsystem finds new businesses to contact for sales by mixing customer requests with automatic scanning of job boards.

The old paper-order kitchen (the seed‑query path) took a fuzzy request like “find AI consultancies in Europe” and had a chef ask a language model to brainstorm real restaurant names. The delivery‑truck kitchen (the multi‑source engine) scraped job boards like Ashby and Greenhouse to find companies that are hiring. Now the merged system uses a switchboard called _resolve_sources that checks what type of run is happening: if a seed query is provided, it lights up the brainstorm stove; if it’s a board scan, it polls across five hiring directions. The state keeps track of what’s been found so far, and a function called expand_seed parses the request into facets like geography and keywords before brainstorming. The approved path for saving real companies is still the seed‑query route—not the board one.

One tricky rule: the brainstorm node gates on a marker called seed_query. For the legacy micro‑vertical path, the code injects a fake seed query like "micro‑vertical:legal‑pi‑demand" so that the brainstorm runs even when there’s no human query. But the persist step strips these synthetic sources—they are not saved as real discoveries. Also, the engine excludes already‑known companies up to 1000 domains to avoid re‑suggesting the same names, and rotates through sub_niches each tick so the language model explores fresh pockets instead of re‑mining the same well‑known companies. Without this merged graph, the two systems would run independently, double‑finding companies or missing ones that only appear on one channel. A beginner would feel the pain when the same business gets contacted twice from two different lists, or when a promising startup hired a key engineer but was never spotted because the job‑board scanner and the brainstormer didn’t share notes.

Data flow — one request, in order
  1. expand_seed node — Extracts B2B lead-gen facets (vertical, geography, size_band, keywords) from the user’s seed query using the DeepSeek LLM.

    • reads / writes — Reads state["seed_query"], state["_error"], and state["vertical"]/state["keywords"] to skip facet extraction if already provided. Writes state["seed_query"] (passed through), and conditionally writes state["vertical"], state["keywords"], state["geography"], or state["_error"] on failure.
    • branch — If SEED_SOURCE is not in the set returned by _resolve_sources(state), the node returns {} immediately (empty path). If the LLM call raises an exception, it writes state["_error"] and returns the error dict (failure path). Otherwise, it returns the extracted facets (happy path).
  2. brainstorm node — Generates up to 20 candidate companies by calling DeepSeek with the seed query, vertical, keywords, and an optional sub-niche for rotation, then canonicalizes domains and enriches each candidate with why_in_vertical, confidence, and evidence.

    • reads / writes — Reads state["seed_query"], state["vertical"], state["keywords"], state["geography"], state["exclude_domains"], and state["sub_niche"]. Writes state["candidates"] (list of candidate dicts) or leaves it empty.
    • branch — If the LLM call fails (e.g., API key missing), the node logs a warning and returns an empty list (fail‑soft). It also skips any candidate whose domain cannot be canonicalized (branch: skip that entry).
  3. dedupe node — Deduplicates the candidate list against already ingested companies and removes any entries with blocklisted domains before passing the list onward.

    • reads / writes — Reads state["candidates"] (from previous step). Writes state["candidates"] (deduplicated). May also read an internal blocklist or existing company store (not shown in snippet but implied by exclude_domains logic).
    • branch — If candidates is empty, the node returns the empty list unchanged (no‑op). If a candidate’s domain is blocklisted, it is dropped (skip branch).
  4. pre_score node — Scores each candidate using per‑keyword calibrated weights from the vertical’s pre_score_keywords (or falls back to IndustryProfile.score_tiers) and produces a scored list with confidence values capped at 1.0.

    • reads / writes — Reads state["candidates"], state["vertical"] (to look up the MicroVertical definition). Writes state["scored"] (list of scored candidate dicts with added confidence field).
    • branch — If candidates is empty, the node returns an empty scored list (no‑op). If the vertical has no pre_score_keywords, it falls back to the default scoring path (branch in internal logic, but still writes scored).
  5. _persist_seed_candidates function — Inserts the scored candidates into the companies table using ON CONFLICT DO NOTHING, separating newly inserted IDs, existing IDs, and the count of blocklisted‑domain skips.

    • reads / writes — Reads state["scored"], state["vertical"], state["geography"], and state["profile_tags"] (from _profile(state).tags). Writes nothing to state; returns a tuple (inserted_ids, existing_ids, blocked_skipped).
    • branch — If scored is empty, returns ([], [], 0) immediately (empty path). On D1Error exception, the caller returns an _error dict (failure path). Blocklisted domains are dropped before the INSERT (skip branch).
  6. persist node (graph‑level) — Orchestrates the seed‑query persist by calling _persist_seed_candidates and collecting agent timings; it is the terminal step of the seed‑query channel.

    • reads / writes — Reads state["scored"], state["vertical"], state["geography"]. Writes state["inserted_ids"], state["existing_ids"], state["blocked_skipped"], and state["agent_timings"]["persist"] (or state["_error"] on failure).
    • branch — If _persist_seed_candidates raises a D1Error, the node returns a dict with _error (failure path). Otherwise, it returns the persist summary (happy path). Brainstorm‑synthetic candidates are hard‑stripped here and never written (branching on source channel).
  7. _resolve_sources helper (called inside expand_seed) — Determines which discovery channels (seed‑query, micro‑vertical, ATS, etc.) are active for the current run, based on the state and configuration.

    • reads / writes — Reads state keys that indicate source preference (exact key not shown, likely sources or profile). Writes nothing; returns a set of source identifiers.
    • branch — Returns a set that may or may not contain SEED_SOURCE; the calling node uses this to decide whether to run its logic.
  8. _profile helper (called inside expand_seed and _persist_seed_candidates) — Resolves the focus profile from state, used to supply default seed queries and profile tags.

    • reads / writes — Reads state["profile"] or equivalent. Writes nothing; returns a profile object with default_seed_query and tags.
    • branch — If the state has no profile, it may fall back to a default (not shown; assumed stable).
  9. ainvoke_json_with_telemetry call (inside expand_seed) — Invokes the DeepSeek LLM with a system prompt and the user’s seed query, returning structured JSON and telemetry.

    • reads / writes — Reads the constructed prompt string, seed_query. Writes the LLM response as a dict (payload).
    • branch — If the LLM raises an exception, the calling node writes _error and returns early (failure path).
  10. brainstorm internal _brainstorm_direction call (for seed‑query path, this is the actual brainstorm logic) — For each candidate returned by the LLM, canonicalizes the domain, assembles a candidate dict with vertical, why_in_vertical, confidence, and evidence, and appends it to the result list.

    • reads / writes — Reads the raw candidate list from the LLM response. Writes the result list (domains, reasons, confidence).
    • branch — If a candidate has no valid domain (empty or no dot), it is skipped. The overall function returns an empty list if the LLM response contains no candidates key (empty path).
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The merged graph operates as a unified orchestration layer where a single DiscoveryState dictionary flows through sequenced nodes, each returning declared fields to prevent silent data loss. Execution begins with plan_targets, which reads the discovery_cursors table to rotate sub-niche indices per vertical, advancing the cursor atomically so successive ticks explore fresh framings. A resolver then picks active channels from DEFAULT_SOURCES (normally just launchfeed) while explicitly forbidding the synthetic brainstorm source — _SYNTHETIC_SOURCES is a frozen set ensuring LLM-fabricated candidates are never persisted. For the seed-query path, _brainstorm_direction accepts a MicroVertical with its seed_query, keyword_signals, and optional sub_niche, calling the DeepSeek model to brainstorm up to twenty candidates; failure there returns an empty list with a log warning, maintaining a fail-soft posture. Real-data sources like launchfeed and commoncrawl then run in parallel waves, with per-host concurrency caps (DEFAULT_PER_HOST) that let different hosts run simultaneously while capping each individually to avoid resource exhaustion.

The invariant the design preserves is a write boundary against synthetic data: the companies table refuses LLM-invented rows, and the brainstorm channel is hard-stripped in the discover node so it cannot be re-enabled per-run. Only the seed_query channel—ex‑company_discovery—is sanctioned for persist-eligible brainstorming because it is grounded by the resolver’s source selection. Additionally, the cursor’s atomic read-and-advance (_read_and_advance_cursor) ensures each vertical’s sub-niche rotation is exactly-once per tick, avoiding repeated recommendations of the same niche. The state dictionary’s declared return fields act as a structural invariant: any node that returns an undeclared field is caught at composition time, eliminating the old problem of silently dropped data when two subsystems serialized different shapes.

The key trade-off is unified state machine vs. separate isolated graphs. The merged design rejects the obvious alternative of keeping two independent pipelines—one for seed-query lead gen and one for vertical-scraping—which would have required costly inter‑graph coordination (e.g., shared databases, reconciliation jobs) to avoid duplicate companies or conflicting scoring. Instead, a single DiscoveryState and a resolver that selects channels based on a set of sources (_ATS_PATH_SOURCES for multi-source waves vs. pure seed-query) allow the system to reuse deduplication, scoring, and persistence logic across both paths. The cost this rejection avoids is the operational overhead of synchronization: separate graphs would need explicit idempotency keys, cross‑system dedup tables, and manual error propagation—each a source of eventual inconsistency. The merged design pays the price of a slightly larger state object and more complex cursor management (the discovery_cursors DDL and per-vertical rotation), but eliminates whole classes of silent drop and double‑write bugs.

A concrete failure mode is a stuck sub‑niche cursor caused by a D1 write error during _read_and_advance_cursor. If the discovery_cursors table update fails (e.g., a transient D1Error), the cursor index is never advanced, and on the next tick plan_targets will read the same index again, re‑emitting the same sub‑niche for that vertical. The operator would see repeated log.info lines in the discovery graph output with identical sub_niche strings for the same direction across successive runs, accompanied by a warning from the DDL attempt: "discovery_cursors DDL failed (continuing): ...". Meanwhile, the brainstorm node would receive the same framing, producing stale candidate suggestions that the deduplication layer would then reject as already persisted, leading to a gradual decline in fresh company discovery for that vertical. The fail‑soft design means the graph continues, but the direction’s coverage stagnates until the cursor table issue is repaired.

Cost & performance — the real knobs

The discovery graph spends time and money primarily on network I/O (HTTP requests to Common Crawl CDX, launch feeds, and LLM endpoints) and database writes (D1 upserts). The sink of money is LLM inference tokens (for brainstorming companies), external API calls (Common Crawl, ProductHunt, YC), and D1 write operations. The following knobs directly control how much parallelism, how many external requests are fired, and how many candidates are persisted, thereby governing latency and dollar cost.

  • DEFAULT_CONCURRENCY — The exact identifier is the constant DEFAULT_CONCURRENCY = 6, overridable via the concurrency input. It caps the global number of concurrent HTTP connections across all sources (Common Crawl, launch feed, LLM brainstorm). Raising it increases parallelism and reduces wall‑clock time, but multiplies the number of in‑flight requests, which can spike API billable calls per second. Setting it too low serializes work and lengthens run duration; too high risks hitting remote rate limits and causing retry storms that waste both time and money.

  • DEFAULT_PER_HOST — This is a dictionary ({"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2}) that limits per‑host concurrency below the global ceiling. Bounds parallelism toward each specific source (e.g., only 4 concurrent LLM calls). Increasing a per‑host cap allows that source to fan out more aggressively, reducing the time spent waiting on that host, but raises the chance of being throttled (causing retries and added cost). Decreasing it protects a fragile host but may make that source the bottleneck, stretching total runtime.

  • CC_MAX_DOMAINS — The constant CC_MAX_DOMAINS = 40, overridable via the cc_max input. It limits the number of Common Crawl domain lookups performed per run. Bounds the volume of CDX index queries and cached‑page fetches (WARC extraction). Raising it yields more candidate domains for that data source, improving coverage at the cost of more HTTP requests and longer processing time. Setting it too low may miss valuable candidates; too high can cause runaway cost from thousands of archived page retrievals.

  • _LAUNCH_LOOKBACK_DAYS — The constant _LAUNCH_LOOKBACK_DAYS = 90 controls how many days into the past the launch‑feed scanner fetches recent product launches (YC, ProductHunt). Bounds the number of candidates surfaced per run. A wider lookback pulls in more launches, increasing the pool of candidates but also the number of API calls and subsequent pre‑scoring/persist work, raising both time and API costs. A narrower lookback reduces latency and expense but may miss promising new companies.

  • _LAUNCH_SIGNAL_DECAY_DAYS — The constant _LAUNCH_SIGNAL_DECAY_DAYS = 180 sets a decay window for product‑launch signals; candidates older than this threshold are considered stale and may be deprioritised or dropped. Bounds memory and storage: a longer decay keeps more historical candidates in play, increasing the set of companies that need to be scored and persisted, which costs D1 writes and downstream processing. A shorter decay prunes the candidate set aggressively, saving time and money but potentially discarding relevant older companies.

  • timeout — The httpx.AsyncClient is created with timeout=20.0 seconds (hard‑coded, not overridable in the shown context). Bounds how long the system waits for a single HTTP response before considering it failed. A shorter timeout reduces the time spent waiting on slow endpoints, freeing connections for other work, but increases the chance of false‑positive timeouts on legitimate but slow services, triggering retries and wasting budget. A longer timeout reduces retry frequency but can stall the entire run waiting for a slow host.

Failure modes — what breaks, what catches it

1. Brainstorm LLM Unavailability

  • Trigger — The DeepSeek API call inside _brainstorm_direction times out, returns a non‑JSON response, or the service is unreachable. Also applies to the micro‑vertical brainstorm path.
  • Guardexcept Exception as exc in _brainstorm_direction. The function catches any exception, logs it, and returns an empty list (fail‑soft).
  • Posture — Fail‑soft. The code explicitly states “Fail‑soft: returns [] if the LLM is unavailable.” The graph continues with an empty candidate list for that direction.
  • Operator signal — The log line log.warning("brainstorm failed direction=%s: %r", mv.vertical, exc). No error field is set in the state; the missing candidates are silent in the final output.
  • Recovery — No retry or backoff is implemented. The node returns [] immediately, and the calling wave proceeds without those candidates. The next tick will try again if the LLM becomes available.

2. Seed‑Query Facet Extraction Failure

  • Trigger — The expand_seed node calls ainvoke_json_with_telemetry on DeepSeek (Pro), and the call fails (e.g., network error, malformed response, or the LLM returns non‑JSON).
  • Guardexcept Exception as e in expand_seed. The entire try‑block catches any exception and returns {"_error": f"expand_seed: {e}"}.
  • Posture — Fail‑hard. Setting _error in the state propagates upward; the caller (presumably the graph runner) sees the error and stops the tick. No further nodes execute.
  • Operator signal — The _error field in the DiscoveryState dictionary, containing the string "expand_seed: <exception message>". No log line is written (the exception is swallowed into the state).
  • Recovery — No automatic retry. An operator must inspect the error, fix the root cause (e.g., API key, model name), and re‑run the discovery.

3. Persist Database Write Error (D1Error)

  • Trigger — The _persist_seed_candidates function (used for the seed‑query path) attempts an INSERT into the companies D1 table and encounters a D1Error, such as a schema mismatch, constraint violation, or database connection loss.
  • Guardexcept D1Error as e in the persist block (shown in the source snippet). The caught error causes a return of {"_error": str(e), "agent_timings": {"persist": round(time.perf_counter() - t0, 3)}}.
  • Posture — Fail‑hard. The run aborts by writing an _error field into the state, just like the expand_seed failure. No candidates are persisted.
  • Operator signal — The _error field contains the string representation of the D1Error. No separate log line is shown; the error surfaces through the state dictionary.
  • Recovery — Manual intervention required: diagnose the D1 issue (e.g., table not created, duplicate key, insufficient permissions) and re‑run after fixing it. No retry or backoff is coded.

4. Invalid Domain from Brainstorm Candidates

  • Trigger — The LLM returns a candidate with a domain that is empty, None, or lacks a dot (e.g., "example" or ""). This can happen because the LLM hallucinates malformed domains.
  • Guardif not dom or "." not in dom: continue inside the _brainstorm_direction loop, after blocklist.canonicalize_domain(str(c.get("domain") or "")).
  • Posture — Fail‑soft. The malformed candidate is silently skipped; the rest of the candidates are processed. The function returns a list that excludes the bad entry.
  • Operator signal — No log, warning, or metric is emitted for the skipped domain. The operator only sees a lower‑than‑expected number of candidates in the direction output.
  • Recovery — Automatic: the loop continues to the next candidate. No retry of the brainstorm call itself.

5. Seed‑Query Source Not Activated

  • Trigger — The _resolve_sources(state) function does not include SEED_SOURCE (because a multi‑source run supplies only Common Crawl, launchfeed, or brainstorm). The expand_seed node therefore gates on if SEED_SOURCE not in _resolve_sources(state): return {}.
  • Guard — The explicit return {} when the seed‑query source is inactive. No error is raised; the node simply produces an empty dict.
  • Posture — Fail‑soft. The seed‑query path is skipped entirely, and other active channels (e.g., Common Crawl) continue to produce candidates.
  • Operator signal — The summary dictionary (built later) includes "sources": sorted(_resolve_sources(state)). If "seed_query" is absent from the list, the operator knows the path was not activated. No error field or log is produced.
  • Recovery — No action needed; the tick uses whatever sources were configured. If seed‑query was intended, the operator must adjust the run input to include a seed_query field or change the source selection logic.
Interview — could you explain it?
  • Q — How does the merged graph decide which discovery channels to activate during a run, given it now combines the seed-query path and the multi‑source ATS engine?
    A — The node plan_targets calls _resolve_sources(state) and checks _ATS_PATH_SOURCES; if no ATS source is active it short‑circuits, leaving the seed‑query path free to run. expand_seed similarly calls _resolve_sources and only proceeds if the SEED_SOURCE is present. The resolver itself uses the run‑time sources input, defaulting to ("launchfeed",).
    Follow-up — What happens if both a seed_query and ATS sources are provided in the same state?
    A — They run in parallel because the state keys that each node writes are disjoint (seed‑query writes candidates, ATS writes _direction_sub_niches and later launchfeed_results), and the resolver returns both channels as active.
    Weak answer misses — The specific constant SEED_SOURCE = "seed_query" and the short‑circuit check in plan_targets (“if not (_resolve_sources(state) & _ATS_PATH_SOURCES): return {"_direction_sub_niches": {}}”) are the exact mechanism; a shallow answer might say “it just checks a flag” without naming _resolve_sources.

  • Q — Why does the seed‑query path extract facets from the original query using an LLM instead of a simpler regex or keyword‑based parser?
    A — Inside expand_seed, the LLM call is structured by the system prompt “Extract B2B lead‑gen facets from a user query. Return strict JSON: {"vertical", "geography", "size_band", "keywords"}”. This allows a single turn to convert a fuzzy phrase into structured fields that drive the downstream brainstorm (which uses vertical, keywords, and geography). A regex‑based parser would require a fixed vocabulary and miss implicit facets like “remote” vs “US‑only”.
    Follow-up — What happens if the LLM call fails—does the seed‑query path halt?
    A — No; expand_seed catches the exception and writes {"_error": f"expand_seed: {e}"}, which the graph treats as a soft error and continues (the next nodes check for _error and return early).
    Weak answer misses — The explicit JSON schema in the system prompt and the temperature setting (temperature=0.2) are key details; a shallow answer might claim the LLM just “guesses” without mentioning the structured output contract.

  • Q — The merged graph uses a single DiscoveryState dict with typed fields. How does it prevent data from being silently dropped when two nodes write to the same key?
    A — Each node returns a dict that is merged into the state by the graph runner; the state type is declared in schemas.state (the canonical DiscoveryState). Nodes that could conflict—like expand_seed and plan_targets—write to distinct keys (seed_query vs _direction_sub_niches). Additionally, the _emit_channel_spans function shows that per‑vertical counts are aggregated by a separate dictionary (by_vertical) before being emitted, avoiding direct writes to a shared key.
    Follow-up — What specific field in the state indicates that the brainstorm channel produced a synthetic candidate that must be stripped?
    A — There is no separate field; the constant _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) is used in discover to strip results from that channel. The seed‑query path’s brainstorm writes to the brainstorm key, but it is the only channel allowed to persist via _persist_seed_candidates because the code explicitly checks state.get("brainstorm") and writes those rows separately from ATS‑sourced rows.
    Weak answer misses — The comment in the code that states “the write boundary itself must refuse LLM‑invented rows” and the fact that brainstorm is the only synthetic channel while seed_query is explicitly declared “NOT synthetic” are essential; a shallow answer might say “all LLM output is blocked” without distinguishing the two paths.

  • Q — Why does the ATS path use a cursor to rotate sub_niche selections for each direction every tick, rather than always using the same brainstorming framing?
    A — In plan_targets, the code reads the current cursor index from a D1 table (_read_and_advance_cursor), uses it to pick a sub_niche from mv.sub_niches, and then advances the cursor. This ensures that over successive runs the brainstorm (which calls _brainstorm_direction with that sub_niche parameter) explores a fresh “applied pocket” instead of repeatedly mining the same well‑known companies—as stated in the design note for MicroVertical.sub_niches.
    Follow-up — What happens if a direction has no sub_niches defined—does the cursor logic still run?
    A — No; plan_targets explicitly checks if not niches: continue before touching the cursor, so directions without sub_niches use only the seed query and keyword signals for brainstorming.
    Weak answer misses — The D1 table (mentioned in await _ensure_cursor_table()) and the per‑direction index (direction_sub_niches[direction] = niches[idx]) are the actual persistence mechanism; a shallow answer might say “it cycles through a list” without explaining how the index persists across graph runs.

  • Q — (Design) Why did the team merge two separate systems—company_discovery (seed‑query brainstorm) and micro‑vertical discovery (multi‑source ATS)—into one graph, rather than keeping them as independent workflows?
    A — The code comment in micro_verticals.py states: “Exactly why one parameterised graph beats several separate workers.” The merged graph uses a unified DiscoveryState and a shared set of nodes (e.g., both paths use the same _brainstorm_direction function, keyed on seed_query vs ATS‑derived directions). This avoids duplicating the dedupe, pre‑score, and persist logic. The resolver (_resolve_sources) activates only the relevant path(s), so a single run‑time configuration can handle pure seed‑query, pure ATS, or hybrid runs without code duplication.
    Follow-up — How does the graph prevent the seed‑query path from interfering with the ATS path’s state when both are active?
    A — The seed‑query nodes check if not (_resolve_sources(state) & _ATS_PATH_SOURCES) and skip their multi‑source logic, while ATS nodes like plan_targets make the opposite check. Additionally, the two paths write to different state keys: seed‑query populates candidates (via _persist_seed_candidates), and ATS populates _direction_sub_niches, launchfeed_results, etc.
    Weak answer misses — The constants SEED_SOURCE and _ATS_PATH_SOURCES are the explicit identifiers the resolver uses; also the comment in expand_seed (“Seed‑query path only — ATS / multi‑source runs skip facet extraction”) shows the separation is enforced per‑node, not just at the graph level.

05. Inside A Discovery Run

A discovery run starts by planning what to look for. The system is built around five narrow verticals, each one a specific hiring direction. Every vertical carries its own seed queries, keyword signals, and a rotating list of sub-niches. A cursor steps through those sub-niches one at a time, so the search never goes stale.

Next comes the concurrent fetch. The run polls real job boards through tracking system connectors like Ashby, Greenhouse, and Workable. At the same moment, a model brainstorms fresh names from the vertical seed query. To avoid repeats, it skips up to a thousand domains it already knows.

Then the survivors are filtered. A dedup step compares each new name and domain against the existing database, and only truly new ones pass. Each survivor gets a quick pre-score from weighted keywords, and weak matches fall away. The winners are written to the database, tagged with their vertical, and counted. The trade-off is that scoring early can drop a real company that simply hides its keywords, but it saves a lot of wasted outreach later. The run ends with one plain summary line.

The discovery graph’s seed-query pipeline stages as documented in the source code.

python
""" ...
  * **seed_query** (ex-``company_discovery``) — a fuzzy seed query
    (e.g. "AI consultancies in Europe doing RAG") → ``expand_seed`` (LLM facet
    parse) → ``brainstorm`` (DeepSeek 12-20 real companies) → ``dedupe`` →
    ``pre_score`` → persist.
"""
ELI5 — the plain-language version

Think of a librarian who searches for rare books across five distinct genres. Each genre has a list of sub‑topics to explore in rotation, like mystery's "locked‑room" or "cozy" plots. The librarian uses a rotating index to pick a new sub‑topic each time, so they never check the same shelf twice in a row. That is the core idea of this discovery planning subsystem: it decides which specific sub‑niche to investigate next, keeping the search fresh and efficient.

The system defines each genre (vertical) as a MicroVertical object—a frozen config carrying seed queries and keyword signals. At run start, the plan_targets node reads a persistent cursor for each vertical and advances it to select the next sub‑niche from that vertical's tuple. For example, if the vertical "legal-pi-demand" has sub‑niches like "contract review" and "e-discovery", the cursor rotates through them one per tick. To make smarter choices, the steering mechanism reorders the list using _steering_priority, which queries discovery_yield_history for recent net‑new counts and cost. This pushes cheap, productive sub‑niches to the front while starving expensive ones sink to the bottom.

The trickiest part is the steering weight formula: it computes a priority weight from trailing yield with a cost penalty. If a sub‑niche's cost‑per‑net‑new exceeds a ceiling (e.g., a certain dollar amount), its weight is slashed—even if it found many companies. Sub‑niches with zero net‑new get weight 0.0, so they sort last and the cursor never picks them until all others are exhausted. Without this subsystem, the librarian would either mine the same sub‑topic until barren or blindly rotate ignoring past performance—leading to stale results and wasted budget on expensive searches that yield nothing.

Data flow — one request, in order
  1. plan_targets – Entry node; reads state, checks _resolve_sources(state) for ATS path sources.

    • reads / writes: consumes state (keys: directions, sub_niche); writes _direction_sub_niches (dict of direction → sub_niche string).
    • branch: if no ATS sources (_ATS_PATH_SOURCES), short‑circuits with empty _direction_sub_niches (happy path for seed‑query‑only runs goes this way); otherwise loops over micro_verticals.resolve_directions(state.get("directions")).
  2. _ensure_cursor_table (called inside plan_targets) – Ensures the D1 cursor table exists (no return).

    • reads / writes: writes to D1 cursor table (implied).
    • branch: no branch; always runs before cursor read.
  3. _read_and_advance_cursor (called per direction inside plan_targets) – Reads the current sub‑niche index for the direction from D1 and advances it (returns index).

    • reads / writes: reads cursor table by direction key; writes updated index back.
    • branch: loops over each direction in directions (fan‑out). Happy path: index obtained; if no sub‑niches defined (mv.sub_niches empty), skips.
  4. expand_seed – Seed‑query path node; checks _resolve_sources(state) for SEED_SOURCE.

    • reads / writes: reads state keys: seed_query, vertical, keywords, _error; writes seed_query (updated), vertical, keywords, _error on failure.
    • branch: if SEED_SOURCE not active, returns empty dict (no‑op). Happy path: calls _profile(state) then make_deepseek_pro + ainvoke_json_with_telemetry to extract facets. On exception, writes _error and returns early.
  5. _brainstorm_direction (called per direction, typically from a concurrent node) – Builds the prompt dict with seed_query, vertical, keywords, exclude_domains, sub_niche and calls external brainstorm.

    • reads / writes: reads its arguments (derived from state); writes list of candidate dicts (keys: name, domain, why, why_in_vertical, confidence, evidence, vertical).
    • branch: if brainstorm raises any exception, returns empty list (fail‑soft). Happy path: iterates over returned candidates, canonicalizes domain via blocklist.canonicalize_domain, skips invalid domains.
  6. dedupe (node in the seed‑query path) – Reads the accumulated candidate list and removes entries whose domain already exists in the D1 companies table.

    • reads / writes: reads candidates from state; writes deduplicated list back to candidates.
    • branch: no explicit branch shown; happy path continues if any candidates remain.
  7. pre_score (node in the seed‑query path) – Applies per‑keyword calibrated weights (from MicroVertical.pre_score_keywords) to each candidate, capping confidence at 1.0.

    • reads / writes: reads candidates; writes scored list (appended to state).
    • branch: no explicit branch; empty input yields empty scored.
  8. _persist_seed_candidates – Sanctioned seed‑query persist; reads scored and writes each candidate to D1 companies using ON CONFLICT DO NOTHING.

    • reads / writes: reads scored, vertical, geography, _profile(state).tags; writes to D1, returns (inserted_ids, existing_ids, blocked_skipped).
    • branch: if scored is empty, returns early with empty results. On D1Error, writes _error and returns error dict.
  9. persist (the broader persist node, which includes grounding guard) – Strips synthetic brainstorm rows before writing; for seed‑query path, delegates to _persist_seed_candidates.

    • reads / writes: reads state; writes _error on failure.
    • branch: no explicit branch in provided context; always runs after pre_score.
  10. _emit_channel_spans (final telemetry step) – Emits one OTel span per (channel, vertical) pair using per‑vertical counts from the discovery waves.

    • reads / writes: reads by_vertical dict (built from results), ats_results, cc_results, sources; writes OTel spans (no state mutation).
    • branch: if OTEL_EXPORTER_OTLP_ENDPOINT env var is unset, no‑op; otherwise imports opentelemetry and creates spans.

(Total 10 steps; the request passes through these, with fan‑out over directions at step 3 and per‑direction _brainstorm_direction calls. Control then converges back to dedupe and pre_score for the entire batch.)

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The discovery run begins with the plan_targets node, which for each configured MicroVertical (e.g., legal-pi-demand) reads and advances a per-vertical rotation cursor over the sub_niches tuple, returning a _direction_sub_niches dict. This cursor is only used when the run includes _ATS_PATH_SOURCES (i.e., commoncrawl or launchfeed); seed‑query runs short‑circuit to avoid touching the rotation. Next, the concurrent discovery step fires _brainstorm_direction for each direction. It calls the brainstorm function with a seed_query formed from the vertical tag, the keyword_signals, and the rotated sub_niche (if any), requesting 8–12 candidates. The result is a list of dicts with domain, name, why_in_vertical, and confidence. The call is best‑effort: any exception is logged and [] is returned. The candidates then flow into a deduplication step inside _persist_seed_candidates, which looks up each domain in the Cloudflare D1 database and drops any that already exist, returning seed_inserted_ids, _seed_existing, and seed_blocked. Finally, the surviving rows enter pre_score, which applies the per‑keyword calibrated weights from pre_score_keywords (or the fallback IndustryProfile.score_tiers path) to produce a confidence score capped at 1.0, after which they are persisted into the companies table.

The design enforces a single hard invariant: the write boundary must refuse LLM‑invented (synthetic) rows. This is implemented by marking brainstorm as a _SYNTHETIC_SOURCE and hard‑stripping it in the discover node so that the brainstorm channel can never be re‑enabled per‑run. Only real‑data sources (launchfeed, commoncrawl) and the sanctioned SEED_SOURCE path (which uses LLM brainstorming but is explicitly labelled non‑synthetic) may persist entries. This guarantee protects the companies table from fabricated companies that would corrupt downstream scoring and outreach decisions.

The key trade‑off is running a single parameterised graph over many directions instead of one dedicated worker per vertical. The alternative—a separate worker for each micro‑vertical—would duplicate infrastructure and require per‑direction logic updates; the chosen design centralises all direction‑specific data (seed query, keyword signals, sub‑niches) in the frozen MicroVertical dataclass and lets discovery_graph.py treat every direction identically. The cost of this rejection is avoided: no separate deployment, no duplicated DEFAULT_CONCURRENCY or DEFAULT_PER_HOST tuning, and no need to synchronise per‑vertical cursor state across multiple workers. The sub‑niche rotation cursor further avoids re‑mining the same well‑known names by windows through a fresh pocket each tick.

A concrete failure mode occurs when the DeepSeek call inside _brainstorm_direction raises an exception (e.g., a network timeout or missing API key). The node catches it with except Exception and returns an empty list. The operator sees the warning log: "brainstorm failed direction=%s: %r" (e.g., brainstorm failed direction='legal-pi-demand': ConnectionTimeout(...)). No error is propagated to the pipeline; the run continues with zero candidates from that direction, and the operator inspects the log to determine whether to retry or adjust the API configuration.

Cost & performance — the real knobs

Based solely on the provided source, the discovery subsystem reveals three explicit performance knobs that control where time and money are spent. No retry counts, backoff, batch sizes, caches, model choices, or retrieval top‑k are present in the context. Each knob below is a real identifier from the code, with its default value and bounds drawn from the source.

  • DEFAULT_CONCURRENCY (overridable via the concurrency input; default = 6)
    Bounds: Limits the total number of concurrent HTTP connections (parallelism) across all hosts.
    Effect: Increasing it lets more candidate fetches happen in parallel, reducing wall‑clock time but raising simultaneous request volume and potential API cost. Decreasing it serializes work, increasing latency.
    Risk: Too high risks rate‑limiting or connection timeouts; too low starves throughput and extends runtime.

  • DEFAULT_PER_HOST (overridable via the per_host input; default = {"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2})
    Bounds: Per‑host concurrency caps prevent any single source (e.g., the LLM or D1 database) from dominating the global slot pool.
    Effect: Lower per‑host caps throttle that specific host’s requests – slowing its stage but protecting others from starvation. Higher caps speed up that source but risk overloading it and causing errors.
    Risk: A too‑high cap on a slow host (e.g., llm) can cause retry storms and waste money; a too‑low cap on a fast host (e.g., d1) serializes database writes and lengthens persist time.

  • CC_MAX_DOMAINS (overridable via the cc_max input; default = 40)
    Bounds: Caps the number of Common‑Crawl lookups per run.
    Effect: Increasing it allows more candidate domains to be enriched from Common‑Crawl, improving data quality but adding WARC extraction time and HTTP request volume. Decreasing it limits enrichment and reduces cost.
    Risk: Too high may exhaust budget or hit CDX API rate limits; too low leaves many candidates without crawl data, weakening scoring/outreach decisions.

These three knobs are the only real performance identifiers the source provides. The context also contains constants like _LAUNCH_LOOKBACK_DAYS = 90 and _LAUNCH_SIGNAL_DECAY_DAYS = 180, but they control signal decay, not runtime concurrency or cost trade‑offs, and are not exposed as runtime overrides.

Failure modes — what breaks, what catches it

Failure 1: LLM Unavailable (API Timeout / Missing Key)

  • Trigger — The _brainstorm_direction function calls brainstorm with a seed_query and vertical, but the DeepSeek API is unreachable (e.g., network failure, expired API key, or rate-limit exhaustion in a dry run).
  • Guard — The try/except Exception as exc block in _brainstorm_direction catches any exception (identified by the bare except Exception as exc clause, with comment # noqa: BLE001 — brainstorm is best-effort).
  • PostureFail-soft: the function returns an empty list [] and continues, allowing the discovery pipeline to proceed with zero candidates from this direction. The run is not aborted.
  • Operator signal — A warning log line emitted by log.warning("brainstorm failed direction=%s: %r", mv.vertical, exc). The operator sees the vertical tag and the exception repr (e.g., “Connection refused”).
  • Recovery — No retry is implemented. The discovery tick proceeds without any candidate companies from that direction. On the next tick (rotating to a new sub_niche), the same direction will be retried automatically because the loop windows through sub_niches by offset.

Failure 2: Malformed LLM Response (Missing or Invalid Domain)

  • Trigger — The LLM returns a candidate dictionary that lacks a domain key, or the domain does not contain a dot (e.g., "domain": "" or "acme").
  • Guard — The for c in cands: loop inside _brainstorm_direction applies an inline validation: if not dom or "." not in dom: continue (where dom = blocklist.canonicalize_domain(str(c.get("domain") or ""))).
  • PostureFail-soft: the invalid candidate is silently skipped; other valid candidates are still processed. The function returns whatever valid entries remain.
  • Operator signal — No explicit log or metric is emitted for the skipped entry. The operator would observe that the final candidate count is lower than expected (e.g., 5 instead of 12), but the reason is invisible without ad hoc debugging.
  • Recovery — No action required; the system continues with fewer candidates. The operator might re-run the discovery with a different seed query or sub-niche to compensate.

Failure 3: Duplicate Domain Already in Database

  • Trigger — The _persist_seed_candidates function attempts to insert a company whose domain already exists in the Cloudflare D1 companies table. The INSERT uses ON CONFLICT DO NOTHING.
  • Guard — The ON CONFLICT DO NOTHING clause in the SQL statement, combined with the changes return value to distinguish fresh inserts from collisions. The caller receives (inserted_ids, existing_ids, blocked_skipped) — duplicates appear in existing_ids.
  • PostureFail-soft: the duplicate is not inserted, and the function continues normally. No exception or error is raised.
  • Operator signal — The state returned from _persist_seed_candidates includes _seed_existing (the count of duplicates). The operator can see this in the run’s output (e.g., print(state["_seed_existing"]) in logs). No log warning is generated.
  • Recovery — No action needed. The duplicate is silently ignored, and the existing record remains unchanged. The discovery run completes successfully.

Failure 4: Blocklisted Domain in LLM Candidates

  • Trigger — The _brainstorm_direction function returns a candidate whose domain matches an entry in the blocklist (e.g., a known spam or excluded company). The LLM path bypasses the pre-persist blocklist that the explicit pipeline_graph.run_discover applies.
  • Guard — Inside _persist_seed_candidates, before the INSERT, the code explicitly drops blocklisted domains: “Drop blocklisted domains before INSERT — the LLM brainstorm path bypasses the blocklist that pipeline_graph.run_discover applies on the explicit path.” (This is stated in the docstring of the persist function.)
  • PostureFail-closed: the blocked domain is prevented from being written to the database. The candidate is skipped, and the persisted state includes blocked_skipped count.
  • Operator signal — The _persist_seed_candidates returns blocked_skipped as an integer; the operator can observe this in the run state (e.g., state["_persist_result"]["blocked_skipped"]). No separate log line is guaranteed.
  • Recovery — The operator must manually review the blocklist and either remove the false positive or update the seed query to avoid suggesting that domain. No automatic retry occurs.

Failure 5: D1 Database Write Error (D1Error)

  • Trigger — The _persist_seed_candidates function calls the D1 database and encounters a D1Error (e.g., connection failure, constraint violation unrelated to ON CONFLICT, or resource exhaustion).
  • Guard — The try/except D1Error as e block (exact identifier D1Error) in the persist segment catches the error. The handler returns a state dictionary with {"_error": str(e)} and a "agent_timings" field.
  • PostureFail-hard: the entire persist step returns early with an error state, which propagates upward and likely halts the discovery run. No companies are written.
  • Operator signal — The returned state contains _error with the string representation of the exception. The operator sees this in the run logs as a top-level failure. The timing metric is also recorded as "persist": round(time.perf_counter() - t0, 3).
  • Recovery — No automatic retry is shown in the source. The operator must investigate the D1Error (e.g., database capacity, schema mismatch) and re-run the discovery after remediation.

Failure 6: Synthetic Data Channel Misrouted (Persist Strips brainstorm Source)

  • Trigger — The discovery pipeline incorrectly assigns the source label "brainstorm" (the legacy synthetic channel) to candidates that are actually from the sanctioned "seed_query" path, or a bug causes the persist function to treat the incoming candidates as synthetic. The persist function uses _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) to identify synthetic rows.
  • Guard — The persist code checks the source attribute and refuses to write any candidate whose source is in _SYNTHETIC_SOURCES. The guard is described as: “the write boundary itself must refuse LLM-invented (synthetic) rows — never rely solely on the upstream discover strip.” (Exact: the persist function strips synthetic rows based on the _SYNTHETIC_SOURCES set.)
  • PostureFail-closed: zero candidates are persisted from that run, even if they are legitimate. The run returns no inserted IDs, and the _persist_seed_candidates report zero new rows.
  • Operator signal — The operator sees inserted_ids empty and existing_ids potentially empty, with no error. The only clue is that the seed_query path’s output is missing. No explicit log warning is present.
  • Recovery — The operator must inspect the source label assignment in the upstream nodes (expand_seed, brainstorm) and correct the mapping. A re-run with proper source tagging ("seed_query") is required.
Interview — could you explain it?

Q — In the planning stage, how does the system avoid repeatedly exploring the same micro-vertical niches across successive runs?

A — The plan_targets node reads and advances a per-vertical sub_niche cursor stored in a database table. Each direction (e.g., “AI voice agents for dental scheduling”) has a list of sub_niches. On every run, _read_and_advance_cursor increments the index and wraps around, so the next tick uses a different framing. This rotates the brainstorm’s framing and prevents re‑mining the same well-known names.

Follow-up — Why use an explicit cursor instead of random sampling?
Weak answer misses — The cursor logic is in plan_targets and relies on _ensure_cursor_table; a shallow answer might overlook that the rotation is deterministic and stateful, not random.


Q — The discovery graph uses two different LLM‑based sources: brainstorm and seed_query. How are they treated differently regarding data persistence and why?

Aseed_query is a sanctioned, persist‑eligible channel (ex‑company_discovery) that goes through expand_seed → brainstorm → dedupe → pre_score → persist. In contrast, the legacy brainstorm channel is classified as synthetic (_SYNTHETIC_SOURCES = {"brainstorm"}) and is hard‑stripped in discover and persist so its invented companies never reach the database. The justification, stated in the source, is that “Discovery decisions must rest on companies sourced from real signals — never on the LLM brainstorm channel, which invents candidate companies.”

Follow-up — How does the system ensure the seed_query path does not accidentally use the banned brainstorm source?
Weak answer misses — The distinction is enforced by _resolve_sources and the separate SEED_SOURCE constant; a shallow answer might conflate the two paths and miss the hard‑stripping in discover.


Q — What concurrency controls prevent one discovery channel from throttling or starving the others when they run in parallel?

A — A single HostLimiter gates fan‑out with a global cap (DEFAULT_CONCURRENCY = 6) and per‑host caps defined in DEFAULT_PER_HOST (e.g., commoncrawl: 6, llm: 4). This stops, for example, a slow Common Crawl endpoint from causing retry storms or starving the launchfeed and LLM channels. The caps are overridable per run via input parameters.

Follow-up — Why is there a separate per‑host cap for the LLM even though brainstorm is banned?
Weak answer misses — The per‑host caps apply to all non‑synthetic channels (including the seed_query path that also calls an LLM); a shallow answer might overlook that the LLM per‑host cap (llm: 4) is still active for the approved brainstorm step.


Q — After the system has a list of candidate companies, how does it pre‑score them, and how are keyword weights calibrated per vertical?

A — The pre_score step uses keyword weights from the MicroVertical’s pre_score_keywords tuple. V43 introduced per‑keyword calibrated boost weights stored as (keyword, boost_weight) pairs in the micro‑vertical definition. The pre_score function sums matched weights and caps at 1.0. If the per‑keyword weights are empty, it falls back to the IndustryProfile.score_tiers path.

Follow-up — Is the confidence score from the LLM brainstorm used directly in the pre‑score, or is it replaced?
Weak answer misses — The source shows that each candidate record carries a confidence field from the brainstorm, but the pre‑score logic is separate (keyword‑weight based); a shallow answer might think the LLM’s confidence is the final score.

06. Concurrent Source Waves

The discovery engine reaches out to many sources at once. It pulls live postings from the public interfaces of several major tracking systems. Ashby, Greenhouse, and Workable are three of them. It also queries a web archive for cached company pages, and it checks startup launch feeds. When a seed query is set, it adds a model brainstorm too. Those brainstormed entries are stripped later, so the final decisions stay grounded in real data.

A safety gate keeps that fan-out under control. It uses one global limit on total requests in flight, plus a smaller limit for each separate host. So a slow or throttled board cannot eat the whole budget and starve the others. A worker has to claim a global slot and then a host slot before it may call out. The payoff is speed. The whole phase finishes in about the time of the single slowest source, since the faster ones complete while you wait. The trade-off is added complexity. Two layers of limits are harder to tune than one.

The HostLimiter pairs a global cap with per-host semaphores so a throttled board consumes only its own tokens and cannot slow down other channels.

python
DEFAULT_PER_HOST: dict[str, int] = {
    "ashby": 3, "greenhouse": 3, "lever": 3, "workable": 2, "rippling": 2,
    "smartrecruiters": 2, "teamtailor": 2, "commoncrawl": 6, "launchfeed": 2,
}

class HostLimiter:
    def __init__(self, global_cap: int, per_host: dict[str, int]):
        self._global = asyncio.Semaphore(max(1, global_cap))
        self._hosts = {h: asyncio.Semaphore(max(1, n)) for h, n in per_host.items()}

    @asynccontextmanager
    async def slot(self, host: str):
        hsem = self._hosts.get(host)
        async with self._global:
            if hsem is None:
                yield
            else:
                async with hsem:
                    yield


async def _scrape_board(tgt: dict[str, Any]) -> dict[str, Any]:
    vendor = tgt["vendor"]
    async with limiter.slot(vendor):
        jobs = await fetcher(client, tgt["slug"])
    # … classify AI roles and return …
ELI5 — the plain-language version

Think of this like a busy restaurant kitchen where several chefs cook different dishes at once, but a single expediter limits how many orders each chef can handle at the same time so that one slow cook doesn’t clog the whole kitchen. This subsystem is for efficiently gathering company leads from many different data sources without overwhelming any one provider.

In practice, the expediter uses two layers of caps: a global limit (DEFAULT_CONCURRENCY = 6) on total orders in flight, plus per‑chef limits (DEFAULT_PER_HOST) like six for Common Crawl and only four for the AI model. The function _resolve_sources decides which chefs to activate — for example, the launch‑feed chef and the Common Crawl chef are real, but any dishes from the legacy brainstorm channel (in _SYNTHETIC_SOURCES) are thrown out at the end. Only the seed‑query brainstorm is kept because it starts from a real seed query instead of inventing names out of thin air.

The trickiest part is why the AI brainstorm is allowed for seed‑query orders but banned otherwise. The seed‑query source (SEED_SOURCE) goes through expand_seed first, which grounds the brainstorm in a real user query like “AI consultancies in Europe,” so it suggests actual companies. The legacy brainstorm channel lacks that anchor and is always stripped by discover — that’s the edge case a beginner would miss. Without this safety gate, one slow host would freeze all discovery, and without the synthetic strip, fake companies would ruin the lead database — a concrete failure a beginner would feel when every prospect turns out to be imaginary.

Data flow — one request, in order
  1. discover(state) — Entry point for the concurrent source waves.

    • reads / writes: Reads state.get("directions"), state.get("sources"), state.get("concurrency"), state.get("per_host"), state.get("cc_warc"), state.get("cc_max"), state.get("exclude_domains"). Writes nothing yet.
    • branch: None initially — all paths proceed to resolve directions.
  2. micro_verticals.resolve_directions(state.get("directions")) — Resolves the list of active micro-vertical directions from the state.

    • reads / writes: Reads state["directions"]; returns directions list.
    • branch: If state["directions"] is empty, the resolved list may be empty — subsequent loops over directions become no-ops.
  3. _resolve_sources(state) — Determines the set of real discovery channels (ATS, launchfeed, Common Crawl, etc.) stripping any synthetic sources.

    • reads / writes: Reads state.get("sources"); returns sources set.
    • branch: If synthetic sources are present, logs a warning but continues; the happy path proceeds with only real sources.
  4. Read concurrency and per‑host limitsint(state.get("concurrency") or DEFAULT_CONCURRENCY) and per_host = {**DEFAULT_PER_HOST, **(state.get("per_host") or {})}.

    • reads / writes: Reads state["concurrency"], state["per_host"]; writes local concurrency and per_host dicts.
    • branch: Falls back to DEFAULT_CONCURRENCY and DEFAULT_PER_HOST when state keys are missing.
  5. limiter = HostLimiter(concurrency, per_host) and limits = httpx.Limits(max_connections=concurrency, max_keepalive_connections=concurrency) — Instantiates the concurrency manager that issues per‑host tokens (e.g., 2 for Ashby) and a global pool.

    • reads / writes: Reads concurrency, per_host; writes limiter and limits objects.
    • branch: None; always created.
  6. Initialize result accumulatorsats_results = [], brainstorm = [], launch_candidates = [], cc_results = {}.

    • reads / writes: Writes the four named lists/dicts.
    • branch: None.
  7. Enter async with httpx.AsyncClient(...) as client: — Opens an HTTP client with the configured limits and shared headers.

    • reads / writes: Uses limits, User-Agent, Accept headers; writes client.
    • branch: Client creation always succeeds; network errors happen inside per‑request calls.
  8. Wave 1: LLM brainstorm per direction — Iterates over directions, calling _brainstorm_direction(mv, exclude_domains=exclude_domains, sub_niche=...) for each.

    • reads / writes: Reads mv.vertical, mv.keyword_signals, exclude_domains from state, and per‑vertical sub_niche from the optionally pre‑populated _direction_sub_niches dict (set by plan_targets node). Appends results to brainstorm list.
    • branch: Fail‑soft — if the LLM call raises an exception, _brainstorm_direction returns [] and logs a warning; the list is not mutated. On the happy path, each candidate is canonicalized via blocklist.canonicalize_domain and appended.
  9. Wave 2: ATS calls — For each ATS source (e.g., Ashby, Greenhouse, Lever) in sources, the system acquires a token from limiter for that host, then calls the corresponding job posting API.

    • reads / writes: Reads client, limiter, host‑specific tokens; writes discovered companies into ats_results.
    • branch: If a host is throttled or returns an error, that request fails independently; ats_results may be partially filled.
  10. Wave 2: Launch‑feed calls — In parallel, the system calls the product‑launch announcement API (e.g., via launch_candidates), again using limiter for per‑host tokens.

    • reads / writes: Reads client, limiter, launch‑feed configuration; writes launch_candidates.
    • branch: Same fail‑independent pattern as ATS.
  11. Wave 2: Common Crawl scan — Calls the CC‑archive endpoint with parameters cc_warc and cc_max (from state) to scrape historical company signals.

    • reads / writes: Reads cc_warc, cc_max, client, limiter; writes cc_results.
    • branch: If cc_warc is falsy, the scan may be skipped (branch not explicitly shown but implied by configuration).
  12. Collect and return aggregated results — After all waves complete, discover builds a final dict containing the four accumulators.

    • reads / writes: Reads ats_results, brainstorm, launch_candidates, cc_results; returns dict[str, Any] (exact keys not shown in snippet, but typically mirror these names).
    • branch: The return is unconditional; no subsequent steps in this chapter.
  13. Post‑return OTel span emission (optional) — After discover returns, the caller (likely the graph runner) invokes _emit_channel_spans(...) to emit telemetry per (channel, vertical) pair.

    • reads / writes: Reads ats_results, cc_results, launch_candidates and the resolved channels; writes OTel spans if the exporter endpoint is configured.
    • branch: No‑op when OTEL_EXPORTER_OTLP_ENDPOINT is unset or opentelemetry is not importable. This is the terminal step in the concurrent‑waves chapter.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The discovery engine in this chapter fans out across multiple data streams — job-posting APIs (Ashby, Greenhouse, Lever), product-launch announcements, and Common Crawl — in parallel. The ordered mechanism begins with plan_targets, which resolves which sources are active (e.g., _ATS_PATH_SOURCES like "commoncrawl" or "launchfeed") and advances a per‑vertical sub‑niche cursor to rotate brainstorm framing. Next, the discover node launches concurrent waves against each host, governed by two explicit caps: DEFAULT_CONCURRENCY (a global token pool of 6) and DEFAULT_PER_HOST (e.g., "launchfeed": 2, "commoncrawl": 6). The concurrency manager grants tokens per host per wave; if a host is slow or throttled, its token pool becomes empty and blocks further requests to that host while other hosts continue. On failure — an HTTP error or timeout — the wave catches the exception at the node level and surfaces an _error key in the state, aggregated later by _emit_channel_spans which emits per‑(channel, vertical) OTel spans containing count‑only attributes and error strings.

The invariant the design preserves is the write boundary against synthetic rows. This is enforced by _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) — the brainstorm channel is stripped from discovery so it cannot insert LLM‑invented companies into the companies table. Only real‑data sources ("launchfeed", "commoncrawl") and the sanctioned SEED_SOURCE ("seed_query", which is not synthetic) are persisted. This guarantee is reinforced in the discover node’s grounding guard: “the companies table feeds scoring/outreach decisions, so the write boundary itself must refuse LLM‑invented (synthetic) rows — never rely solely on the upstream discover strip.” No discovery decision can flow from fabricated data.

The key trade‑off is limiting per‑host concurrency versus the obvious alternative of unlimited per‑host parallelism. The design rejects a naive fan‑out that would fire requests as fast as the system could issue them, which would trigger API rate‑limiting errors, wasteful retries, and hard‑to‑diagnose backpressure across the entire pipeline. By capping each host independently (e.g., only 2 concurrent requests to "launchfeed") and imposing a global ceiling (DEFAULT_CONCURRENCY = 6), the system avoids overwhelming upstream hosts and keeps the wave structure predictable. The cost of this rejection is slightly lower peak throughput on a single host, but it avoids the much higher cost of throttled or banned API access, which would collapse the entire discovery engine.

A concrete failure mode: the Common Crawl host ("commoncrawl": 6) hits a sustained 429 (Too Many Requests) because the crawl endpoint is under load. The concurrency manager cannot issue new tokens for that host; the wave’s _ats_search or _crawl call raises an exception. The discover node catches it and writes {"_error": "commoncrawl: rate limited"} into the state. An operator monitoring OTel spans generated by _emit_channel_spans would see a commoncrawl channel span with an error attribute containing the string "rate limited" alongside zero candidate counts. Meanwhile, the launchfeed and ats waves continue independently. The operator then knows only one source failed and can investigate the crawl service without affecting the other streams.

Cost & performance — the real knobs

Based on the provided source code, the discovery subsystem spends time waiting on external HTTP requests (to Common Crawl, launch feeds, and LLM APIs) and money on network egress, API calls, and compute for concurrent tasks. The following real performance knobs directly control these costs:

  • DEFAULT_CONCURRENCY
    KnobDEFAULT_CONCURRENCY = 6 (constant) and the concurrency input that overrides it.
    Bounds — Global cap on the total number of concurrent HTTP requests across all hosts (commoncrawl, llm, d1, launchfeed).
    Effect — Increasing it allows more parallelism, reducing wall-clock time per discovery wave but raising peak network and D1 I/O. Decreasing it serialises requests, lowering peak resource usage at the cost of longer runtimes.
    Risk — Too high a value can overwhelm external APIs with connection storms, trigger rate limits, or exhaust the D1 connection pool. Too low a value wastes idle capacity and extends the discovery phase.

  • DEFAULT_PER_HOST
    KnobDEFAULT_PER_HOST = {"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2} and the per_host input that merges overrides.
    Bounds — Per-host token limit; each host (e.g., launchfeed max 2) can only have that many outstanding requests at once, independent of the global cap.
    Effect — Raising a per-host limit increases throughput to that specific source (e.g., faster launch feed ingestion) but may cause that host to dominate the global token pool. Lowering it protects a slow or throttled host without starving others.
    Risk — Setting a per-host cap too high makes the system vulnerable to a single host’s rate-limit cascade; too low can artificially slow down an otherwise fast source.

  • CC_MAX_DOMAINS
    KnobCC_MAX_DOMAINS = 40 (constant) and the cc_max input override.
    Bounds — Maximum number of Common Crawl (CDX) lookups performed per discovery run.
    Effect — Increasing it captures more candidate domains from the web archive, improving coverage at the cost of more WARC fetches and API calls. Decreasing it limits the CC work, saving bandwidth and compute time.
    Risk — A high value can cause a long-running CC phase without proportional gains if domains are low quality; a low value may miss valuable candidates.

  • HTTP client timeout (20.0s)
    Knobtimeout=20.0 (literal in httpx.AsyncClient constructor).
    Bounds — Maximum wall-clock wait for any single HTTP request (connection, read, or write timeout).
    Effect — A shorter timeout fails fast on slow endpoints, reducing tail latency but causing more retries. A longer timeout keeps the system waiting for a lagging API, increasing total run time if many hosts are slow.
    Risk — Too short: legitimate slower responses are prematurely dropped, leading to missed data. Too long: the concurrency slot is held hostage by a hung server, starving other hosts.

These four knobs are the only numeric performance parameters that appear in the source. There are no retry counts, backoff formulas, batch sizes, model selection constants, or retrieval top‑k values in the provided code; the subsystem relies on the concurrency manager (HostLimiter) and the above caps to trade off latency, throughput, and cost.

Failure modes — what breaks, what catches it

Failure 1: CommonCrawl Host Throttling or Unresponsive

  • Trigger – CommonCrawl’s CDX or page-extraction API returns HTTP 429, 503, or times out while the discover wave is running.
  • Guard – The HostLimiter enforces a per‑host cap ("commoncrawl": 6 in DEFAULT_PER_HOST) that stops new requests to that host once the cap is reached, but there is no retry or exception handler shown for the CommonCrawl fetcher itself; the fan‑out simply waits for pending requests to finish or fail.
  • Posture – fail‑soft: other channels (launchfeed, seed_query) continue unaffected; CommonCrawl results may be partial or absent for that run.
  • Operator signal – Silently fewer or missing CommonCrawl‑sourced candidates in the output; no explicit error log from the CommonCrawl code (the snippet does not display a catch for its fetchers). Wall‑clock time may increase if requests hang.
  • Recovery – No automatic retry. The operator would re‑run discovery (likely with the same seed_query); the next CommonCrawl wave will attempt again. Persistent throttling may require lowering "commoncrawl": 6 in the run input.

Failure 2: LLM Brainstorm (DeepSeek) API Failure

  • Trigger – The brainstorm call in _brainstorm_direction raises an exception (network timeout, API error, or empty response).
  • Guard – The except Exception as exc clause (line marked # noqa: BLE001 — brainstorm is best-effort) catches any runtime error, logs warning, and returns [].
  • Posture – fail‑soft: that particular direction yields no candidates; the rest of the discovery graph proceeds.
  • Operator signal – Log line "brainstorm failed direction=%s: %r" (logged via log.warning). The _direction_sub_niches dict for that vertical will have no contributions.
  • Recovery – Automatic on the next tick: the cursor in plan_targets rotates to a different sub_niche, so the next call to _brainstorm_direction will try a fresh framing. No backoff beyond the tick interval.

Failure 3: Global Concurrency Cap Saturation

  • Trigger – Multiple hosts (CommonCrawl, LLM, launchfeed, plus D1 queries) all have active requests at once, reaching the global limit DEFAULT_CONCURRENCY = 6. Subsequent tasks queue until a slot frees.
  • Guard – The global cap is enforced by HostLimiter; there is no overflow handler or error thrown when the queue is full – the system simply stalls until capacity opens.
  • Posture – fail‑soft: no request is dropped, but wall‑clock time for the asyncio.gather wave increases linearly with queue depth.
  • Operator signal – Elapsed time of the discovery run is noticeably longer than expected. No log or metric explicitly indicates queuing unless custom instrumentation is added (the snippet does not show such).
  • Recovery – The operator can raise the global cap via the concurrency run input, or reduce the number of active sources (e.g., disable CommonCrawl by not supplying --sources commoncrawl).

Failure 4: D1 Database Error During Deduplication

  • Trigger – The d1_all query inside dedupe raises a D1Error (connection failure, timeout, or constraint).
  • Guard – The except D1Error as e: return {"_error": f"dedupe: {e}"} clause catches the exception and replaces filtered with an error state.
  • Posture – fail‑hard: the discovery run is effectively aborted because state.get("_error") becomes truthy, and downstream nodes (pre_score, persist) short‑circuit via if state.get("_error"): return {}.
  • Operator signal – The run output will contain _error: "dedupe: ..." in the state; no companies are persisted.
  • Recovery – Manual. The operator must check D1 health and re‑run discovery. There is no automatic retry in the shown code.

Failure 5: Cursor Table Read/Advance Failure

  • Trigger – The _read_and_advance_cursor call inside plan_targets fails (e.g., D1 error or missing table) when trying to rotate the sub_niche for a vertical.
  • GuardNo exception handler is shown around _read_and_advance_cursor or _ensure_cursor_table in the snippet. If an exception occurs, it will propagate uncaught and abort the node.
  • Posture – fail‑hard: the plan_targets node raises, and the entire graph run fails (unless the caller catches it, which is not demonstrated).
  • Operator signal – An unhandled exception traceback in the worker logs; no state output from that node.
  • Recovery – Manual. The operator would need to investigate the cursor table in D1, fix the issue (e.g., re‑create the table), and re‑run. The function _ensure_cursor_table is called first, so a missing table would be created if that works, but if the D1 error is persistent, no automatic recovery exists.
Interview — could you explain it?

1. Warm-up

Q – How does the discovery engine manage concurrent requests when fanning out to multiple data streams like Common Crawl, launch feeds, and LLM brainstorm?

A – It uses a HostLimiter that enforces both a global token pool (DEFAULT_CONCURRENCY = 6) and per‑host caps (DEFAULT_PER_HOST dict, e.g., "commoncrawl": 6, "llm": 4, "launchfeed": 2). All channels run under a single asyncio.gather per wave, so wall‑clock time is bounded by the slowest host, not the sum of individual requests.

Follow-up – What stops a slow or throttled host from blocking the entire wave?
One‑line answer – Per‑host caps (defined in DEFAULT_PER_HOST) ensure that host‑level throttling does not starve other hosts or cause retry storms.

Weak answer misses – The concrete default values (DEFAULT_PER_HOST dictionary) and the HostLimiter class itself are the key details a shallow answer omits.


2. Moderate (design question)

Q – Why implement per‑host caps alongside a global limit instead of using a single global concurrency limit for all outbound requests?

A – Per‑host caps stop any single throttling host from causing retry storms or starving the other hosts, as explicitly stated in the concurrency model documentation. The global pool (DEFAULT_CONCURRENCY) caps total concurrency, while per‑host limits (e.g., "launchfeed": 2) allow different hosts to run in parallel without one hogging all tokens.

Follow-up – How are these per‑host caps made configurable per run?
One‑line answer – They are overridable via the per_host input dictionary, with defaults sourced from DEFAULT_PER_HOST.

Weak answer misses – The exact keys in DEFAULT_PER_HOST ("commoncrawl", "llm", "d1", "launchfeed") and the fact that the global cap is a separate constant (DEFAULT_CONCURRENCY) are essential details.


3. Hard

Q – The seed_query source also uses an LLM brainstorm step (in _brainstorm_direction). Does that path respect the same concurrency limits as the legacy brainstorm channel?

A – Yes, the LLM calls in _brainstorm_direction use the same shared HostLimiter with the per‑host cap for "llm" (default 4) and the global pool. The seed_query path runs inside the same asyncio.gather wave when it is active, but it is not considered synthetic; the source explicitly says seed_query is a “sanctioned, persist‑eligible” channel, unlike the legacy brainstorm channel which is hard‑stripped in discover/persist.

Follow-up – How does the system prevent the synthetic brainstorm channel from being re‑enabled per run?
One‑line answer – The _SYNTHETIC_SOURCES frozenset contains "brainstorm", and discover/persist hard‑strip that channel so it cannot be re‑enabled.

Weak answer misses – The distinction between seed_query (persist‑eligible) and brainstorm (synthetic, stripped) and the role of _SYNTHETIC_SOURCES are critical.


4. Hard (design question)

Q – Why does the system group all channels into a single asyncio.gather per wave instead of launching each source as an independent background task with its own retry logic?

A – The concurrency model notes that using asyncio.gather makes wall‑clock time “≈ slowest host, not the sum,” while also simplifying error handling: if a channel fails, it returns early (e.g., _brainstorm_direction catches exceptions and returns []) and the wave continues. This avoids the complexity of coordinating independent tasks and ensures the whole discovery tick finishes promptly.

Follow-up – What happens when a node like expand_seed encounters an error inside a wave?
One‑line answer – The function returns a dict with _error key, which later nodes check and skip processing (e.g., if state.get("_error"): return {}).

Weak answer misses – The explicit fail‑soft pattern in _brainstorm_direction and the _error key propagation through the graph state are the real implementation details.


5. Hard (sub‑niche rotation and concurrency interaction)

Q – The plan_targets node rotates sub‑niches per vertical using a cursor. How does this cursor logic interact with the concurrent wave of discovery operations?

Aplan_targets reads and advances a per‑vertical cursor via _read_and_advance_cursor, storing results in _direction_sub_niches. That dict is later consumed by _brainstorm_direction (or the launch‑feed path) to steer the brainstorming framing. The cursor rotation is independent of concurrency—it happens before the wave starts—so it does not affect host‑level limits or the asyncio.gather schedule.

Follow-up – What data structure backs the cursor, and how is it advanced atomically?
One‑line answer – A dedicated table (ensured by _ensure_cursor_table) stores the current index; each tick reads the index, computes idx % len(niches), and increments the stored value atomically.

Weak answer misses – The existence of _ensure_cursor_table and the modulo‑based rotation (not a simple increment) are the specific design points that a shallow answer ignores.

07. Micro-Verticals

Discovery hunts inside five applied artificial intelligence verticals. Each vertical uses positive keyword signals to find a good fit. In accounting automation, for example, a phrase like accounts payable is a strong signal and carries a weight of zero point six. A broader term like invoice processing carries zero point five. The stronger the signal, the higher the weight, so the best matches rise to the top. Negative signals then push obvious mismatches back down.

Every vertical also keeps a list of domains to ignore. The accounting list skips giant platforms and well-known ledger products, because they are far too large to be hiring targets. A different vertical chases voice agents instead. Its signals include words like voice, phone, and telephony. It aims at niches such as dental scheduling. The trade-off is that hand-tuned keywords go stale as the market shifts, and someone has to maintain them. But pairing weighted signals with an ignore list keeps the search tight on genuine applied companies, not infrastructure vendors.

The micro-vertical for plaintiff‑side personal‑injury AI uses keyword signals to surface demand‑letter and medical‑chronology startups while filtering insurance‑defense, billing, and legal‑research platforms.

python
"legal-pi-demand": MicroVertical(
    vertical="legal-pi-demand",
    label="Legal — personal-injury demand letters & case-doc AI",
    keyword_signals=(
        "demand letter", "personal injury", "medical chronology",
        "medical records", "plaintiff", "settlement", "case intake",
        "mass tort", "demand package", "litigation",
    ),
    negative_signals=(
        "insurance defense", "claims adjuster", "claims management",
        "debt collection", "medical billing", "revenue cycle management",
        "law firm staffing", "legal research platform", "legal research",
        "contract lifecycle management", "corporate legal", "e-discovery vendor",
        "commercial litigation", "corporate transactions", "real estate law",
        "contract management", "legal analytics", "compliance management",
    ),
    # … ats_targets, score_weights, etc.
)
ELI5 — the plain-language version

Imagine you are a restaurant critic searching for hidden-gem eateries. You have a mental checklist: strong signals like “affordable” (weighing 0.6) count more than weaker ones like “local ingredients” (0.5). You also have negative flags—“chain restaurant” or “fast food”—that push those places down, and a list of famous chains you skip entirely. That is exactly how this subsystem hunts for startups inside five applied‑AI verticals like accounting automation. It uses positive keyword signals with different weights to score companies: for the accounting vertical, “accounts payable” carries a weight of 0.60, while “invoice processing” gets 0.50, so the best matches rise to the top. Negative signals then push obvious mismatches back down, and a list of excluded domains (like quickbooks.intuit.com or blackline.com) keeps out giant incumbent platforms.

Going deeper, the system doesn’t just run the same search every time. It keeps a rotating cursor in a table called discovery_cursors that advances through the vertical’s sub‑niche list—for example, one day it focuses on “AI month‑end close automation,” the next on “AI accounts‑payable invoice capture.” Each sub‑niche has its own set of weighted keywords, so the scoring adapts: when exploring AP automation, the signal “accounts payable” gets boosted even higher to 0.60, while “month‑end close” stays lower. The brainstorm loop in persist_vertical_companies.py asks an LLM to generate companies fitting that exact sub‑niche, then deduplicates against the database and inserts only net‑new rows.

The trickiest part is why this rotation prevents stagnation. Without it, the system would keep returning the same familiar names—like FloQast or BlackLine—because those match the broad “accounting AI” signal. By cycling through sub‑niches, the discovery forces itself to explore fresh pockets: vendor statement reconciliation, three‑way PO matching, or flux analysis. This is why sub_niche_idx wraps around mod the count, ensuring that over time every angle gets a turn. Finally, consider what would go wrong without this subsystem: you’d constantly surface only the largest, most obvious startups (the “chains” you already know), miss the obscure innovators, and the pipeline would fill with stale, irrelevant leads.

Data flow — one request, in order
  1. plan_targets – Reads state["directions"] (e.g., ["voice-operations"]), resolves the MicroVertical object via micro_verticals.resolve_directions(), then for each direction reads the per-vertical sub_niche cursor from a D1 table (via _read_and_advance_cursor) and advances it by one.
    reads / writes – Reads state["directions"]; returns {"_direction_sub_niches": {"voice-operations": "AI voice agents for dental & medical front-desk / patient scheduling"}}.
    branch – If state.get("sub_niche") is already set, the cursor logic is skipped. Happy path: cursor exists and advances to the next framing.

  2. _brainstorm_direction – Invoked by the graph with the MicroVertical for voice operations and the sub_niche string from the previous step.
    reads / writes – Reads mv.vertical, mv.keyword_signals, mv.seed_query; writes a list of candidate dicts.
    branch – No early return here; the function always attempts the LLM call.

  3. brainstorm – Called inside _brainstorm_direction with a dict containing "seed_query": "micro-vertical:voice-operations", "vertical": "voice-operations", "keywords": list(mv.keyword_signals), "geography": "remote", "exclude_domains", and "sub_niche".
    reads / writes – Reads the LLM configuration (API key etc.) from environment; returns {"candidates": [...]} on success.
    branch – If the LLM is unavailable (e.g., no key) or raises an exception, _brainstorm_direction catches it and returns an empty list (fail-soft). Happy path: returns a candidate list.

  4. Loop over candidates – In _brainstorm_direction, each candidate dict from brainstorm is processed.
    reads / writes – Reads c.get("domain"); writes a canonicalized domain via blocklist.canonicalize_domain.
    branch – If the canonicalized domain is empty or lacks a dot, the candidate is skipped. Happy path: domain is valid and passes.

  5. Result dict construction – For a valid candidate, a new dict is appended to the result list with keys name, domain, why_in_vertical, confidence, evidence, and vertical (set to mv.vertical).
    reads / writes – Reads original candidate fields; writes the formatted dict into the result list.
    branch – No conditional; always appends if domain passed the previous check.

  6. Return to graph – The result list is returned from _brainstorm_direction and collected by the graph node (likely in an asyncio.gather over all directions).
    reads / writes – The list is stored in an intermediate state key (e.g., candidates_by_direction).
    branch – No branch; the list may be empty if brainstorm failed.

  7. dedupe – The consolidated candidates from all channels (here only voice-ops brainstorm) are deduplicated by domain.
    reads / writes – Reads the list of candidate dicts; writes a deduplicated list.
    branch – If the list is empty, dedupe returns an empty list.

  8. pre_score – For each candidate, the function reads mv.pre_score_keywords, mv.score_weights, and mv.negative_signals. It matches the candidate’s why_in_vertical text against the positive keywords, summing the associated weights (capped at 1.0). If any negative signal matches, the candidate’s score is set to 0 (or the candidate is removed).
    reads / writes – Reads candidate text and the MicroVertical configuration; writes a score field into the candidate dict (or filters out failed candidates).
    branch – If score is 0 (due to negative signal or no positive match), the candidate is excluded from further processing. Happy path: score > 0.

  9. persist – The scored candidates are written to the companies/opportunities table with the vertical tag "voice-operations".
    reads / writes – Reads the final candidate list; writes rows to the database (returns inserted_ids).
    branch – If the source is the brainstorm channel (legacy), some callers may hard-strip persistence; for the micro-vertical seed-query path (tagged with "seed_query"), persistence is sanctioned. Happy path: rows are inserted.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The discovery system for Micro-Verticals operates in a strictly ordered pipeline: first, plan_targets reads and advances a per-vertical cursor from the D1 discovery_cursors table via _read_and_advance_cursor, selecting the next sub_niche from the MicroVertical.sub_niches tuple. That cursor is then injected into _direction_sub_niches on the state, and for each direction the orchestrator calls _brainstorm_direction, passing the chosen sub_niche along with the immutable keyword_signals and negative_signals from the MicroVertical dataclass. On failure — if the LLM brainstorm raises an exception or returns no candidates — _brainstorm_direction logs a warning and returns an empty list; the system does not halt but simply produces zero candidates for that direction in that tick. This fail-soft design mirrors the “best-effort” annotation in the source, ensuring that a transient LLM outage does not block the entire discovery wave.

The invariant that the design preserves is cursor-based idempotent rotation across the sub-niches of each vertical. The discovery_cursors table stores (vertical, sub_niche_idx, updated_at) with an upsert in _read_and_advance_cursor that atomically advances the index modulo the sub-niche count, wrapping at the end. This guarantees that multiple concurrent or retried runs never skip or duplicate a sub-niche window: the advance is a single-idempotent write, and the cursor always yields a deterministic index for that vertical at that tick. The vertical’s keyword_signals and negative_signals act as a second invariant: they are pure config, frozen in the MicroVertical dataclass, so the recognition of relevant companies (e.g., phone-agent builders for a voice-ops vertical) never depends on runtime state or branching logic.

The key trade-off is captured by the design principle “Direction is data, not code.” Instead of writing one worker per micro-vertical (the obvious alternative), the system uses a single parameterised graph that consumes the frozen MicroVertical rows as configuration. This rejects the per-vertical branching logic that would have required a separate worker for each niche (e.g., voice_ops_discovery.py, legal_pi_discovery.py), avoiding the cost of duplicated orchestration code, divergent bug fixes, and the maintenance burden of keeping N pipelines in sync as the scoring or ingestion logic evolves. The cost of this rejection is that all verticals share the same brainstorm and pre-score machinery—but because the differences are purely data (seed query, keyword signals, weights), the shared graph is both simpler and more auditable.

A concrete failure mode occurs when the LLM endpoint for the brainstorm step is unreachable or returns an error. In that case, _brainstorm_direction catches the exception, logs a warning such as brainstorm failed direction=voice-ops: <exception repr>, and returns an empty list. An operator monitoring the system would see zero candidates emitted for that vertical in the tick’s results, along with that log line. The _emit_channel_spans function would also produce a span for that (channel, vertical) pair with zero counts, which, if OTEL is configured, surfaces in metrics dashboards as a sudden drop in discovery yield for that direction—without any crash or state corruption.

Cost & performance — the real knobs

DEFAULT_CONCURRENCY

  • Knob: The constant DEFAULT_CONCURRENCY is defined in discovery_graph.py with a default value of 6.
  • Bounds: Global ceiling on the number of simultaneous async operations (HTTP requests, fetches) across all channels.
  • Effect: Raising it increases the number of in‑flight requests, reducing wall‑clock time for a batch of discovery runs at the expense of higher network/D1 concurrency pressure and memory usage. Lowering it serializes more work, increasing latency but reducing resource contention.
  • Risk: Too high a value can overwhelm upstream APIs (Common Crawl CDX, launch‑feed endpoints) or the D1 hub, causing throttling errors and retry storms. Too low a value leaves bandwidth underutilized, prolonging the discovery cycle and delaying lead generation.

DEFAULT_PER_HOST

  • Knob: The dictionary DEFAULT_PER_HOST in discovery_graph.py supplies per‑host caps: {"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2}. These can be overridden per‑run via the per_host input.
  • Bounds: Maximum concurrent requests to each specific host (Common Crawl, LLM API, D1, launch‑feed). This is a sub‑limit beneath DEFAULT_CONCURRENCY.
  • Effect: Increasing a per‑host cap (e.g. commoncrawl to 10) lets that channel fetch more domains in parallel, reducing its latency share of the overall wave. Decreasing the cap protects a slow or rate‑limited host from backpressure that would cause retries and degrade the other channels’ throughput.
  • Risk: Setting a per‑host cap too high against a rate‑limited endpoint (e.g. LLM API) results in retry storms and wasted cost; too low starves that channel, making the wave wait on its serialized fetches.

CC_MAX_DOMAINS

  • Knob: The constant CC_MAX_DOMAINS is defined as 40 in discovery_graph.py. It caps the number of Common‑Crawl lookups performed per discovery run.
  • Bounds: Maximum number of candidate domains submitted to the Common Crawl CDX index and cached page extraction pipeline.
  • Effect: Raising this value increases the breadth of organizations discovered per run (more domains → more potential leads) but raises the total time and D1 fee cost of the Common Crawl wave. Lowering it reduces expense and latency but may miss relevant companies.
  • Risk: Too high may hit the CDX query rate limit or overflow the available memory/HTTPS connection pool, causing incomplete results; too low makes the Common Crawl channel ineffective, forcing reliance on more expensive LLM‑based discovery.

_LAUNCH_LOOKBACK_DAYS

  • Knob: The constant _LAUNCH_LOOKBACK_DAYS in discovery_graph.py defaults to 90. It governs how far back the launch‑feed channel (YC + ProductHunt) scans for recently‑launched companies.
  • Bounds: Time window (in days) limiting the age of launch‑feed entries considered as candidates.
  • Effect: A larger lookback includes older launches (more candidates), increasing the channel’s yield but also the number of candidate companies to score and deduplicate, driving up runtime and D1 read/write costs. A smaller lookback reduces noise and cost but may miss relevant companies that launched just outside the window.
  • Risk: Setting it too high dilutes relevance (old launches may be outdated or dead) and adds spurious candidates; too low misses many real, still‑active companies, narrowing the lead funnel.

_LAUNCH_SIGNAL_DECAY_DAYS

  • Knob: The constant _LAUNCH_SIGNAL_DECAY_DAYS in discovery_graph.py defaults to 180. It controls how long the product_launch signal remains weighted in pre‑scoring before it decays.
  • Bounds: Half‑life of the launch‑freshness signal used in the scoring algorithm.
  • Effect: A longer decay keeps the launch signal strong for older companies, possibly boosting their priority over competitors with no launch freshness – this increases the number of high‑score candidates flowing to outreach, but also adds compute cycles for scoring and more data to persist. A shorter decay deprioritizes older launches, focusing cost on newer opportunities.
  • Risk: Too long may over‑promote stale companies that are no longer active (wasted outreach cost); too short may ignore recently‑launched firms that haven’t yet built a web presence, missing early lead opportunities.
Failure modes — what breaks, what catches it

D1 Persist Write Failure

  • Trigger — A transient or persistent D1 database error (timeout, constraint violation, or connection drop) during the INSERT of scored seed-query candidates.
  • Guard — The except D1Error as e clause in the persist node; the node returns {"_error": str(e), "agent_timings": {…}} instead of raising.
  • PostureFail-soft. The node completes with an error payload, allowing the calling graph (e.g. pipeline_graph.run_discover) to decide whether to retry or abort. The run is not aborted outright but the inserted IDs are lost.
  • Operator signal — The _error field in the node output (e.g. "D1_ERROR: UNIQUE constraint failed") and the absence of inserted_ids. No log line is shown in the source, but the error string propagates up.
  • Recovery — No automatic retry is shown in the snippet. The operator must re-run the discovery job or manually retry the persist step. If the error is transient, a retry may succeed.

Misconfigured MicroVertical Keyword Signals

  • Trigger — A new direction (e.g. “voice-operations”) is added to micro_verticals.py with empty keyword_signals, empty negative_signals, or an incorrect seed_query such that the pre‑score or brainstorming step fails to match the intended companies (or matches irrelevant ones).
  • GuardNone. The source states “Direction is data, not code.” and MicroVertical is a plain frozen dataclass with no runtime validation of its fields. The plan_targets and pre_score functions consume these fields directly.
  • PostureFail-soft. The pipeline continues; companies may be scored low, brainstorm outputs may be garbage, or blocklisted vendors (e.g. twilio.com, vonage.com) may slip through because negative_signals is missing.
  • Operator signal — Companies in the “voice-operations” vertical with unexpectedly low pre‑scores, or the presence of pure plumbing vendors like Twilio in the output. No explicit error is raised.
  • Recovery — Manual inspection of the companies table and correction of the MicroVertical configuration, followed by a re‑run of the discovery job for that vertical.

Cursor Out‑of‑Bounds for Sub‑niche Rotation

  • Trigger — The D1 cursor value for a direction is corrupted or concurrently advanced beyond the length of the sub_niches tuple (e.g. cursor = 5 for a 4‑element tuple).
  • GuardNone visible. In plan_targets, await _read_and_advance_cursor(direction, len(niches)) returns an index that is directly used as niches[idx] without any bounds check or modulo operation in the shown code.
  • PostureFail-hard. An IndexError propagates uncaught from plan_targets, aborting the entire discovery wave for that run.
  • Operator signal — A traceback containing IndexError: tuple index out of range and the plan_targets node returning an error to the graph executor.
  • Recovery — Manual operation to reset the cursor in the D1 cursor table to a valid value (e.g. 0) and re‑run. No automatic backoff is provided.

Blocklist Bypass on the Brainstorm Path

  • Trigger — A run uses the legacy brainstorm channel (LLM‑synthetic, micro‑vertical framing) instead of the sanctioned seed-query channel. The code comment warns that this path “bypasses the blocklist that pipeline_graph.run_discover applies on the explicit path.”
  • GuardNone in the persist‑side code. The snippet shows a comment (“Drop blocklisted domains before INSERT…”) but no actual guard; the blocklist application is assumed to happen only in pipeline_graph.run_discover.
  • PostureFail-soft (or “fail‑open”). Blocklisted domains (e.g. twilio.com, ringcentral.com) are inserted into the companies table, polluting the dataset for scoring and outreach.
  • Operator signal — Companies with known blocklisted domains appearing in the “voice‑operations” vertical or other micro‑verticals, visible in database queries.
  • Recovery — Manual cleanup of inserted rows and addition of a blocklist check inside _persist_seed_candidates (or a later validation step). The operator must also avoid using the brainstorm channel until the guard is added.

OTEL Span Emission Import Failure

  • Trigger — The environment variable OTEL_EXPORTER_OTLP_ENDPOINT is set, but the opentelemetry package is missing or broken in the runtime (e.g. Pyodide worker).
  • Guard — The except ImportError: clause inside _emit_channel_spans. The function returns immediately without raising.
  • PostureFail-soft. The entire function is a no‑op; discovery continues normally, but no observability spans are emitted for per‑channel × vertical results.
  • Operator signal — Absence of expected OTel spans in the configured exporter; no error or warning is logged because the exception is silently caught.
  • Recovery — Install or restore the opentelemetry package, or unset OTEL_EXPORTER_OTLP_ENDPOINT to suppress the attempted import. No retry or backoff is implemented.
Interview — could you explain it?

Q — How does the system represent a single micro-vertical and why is the dataclass frozen?

A – Each micro-vertical is a frozen MicroVertical dataclass with fields like vertical, keyword_signals, negative_signals, and pre_score_keywords. Freezing makes the rows immutable, guaranteeing the taxonomy is pure config consumed by discovery_graph.py without risk of accidental mutation — a design choice documented in the module header as "direction is data, not code."

Follow-up – What prevents you from just using a plain dict or JSON?
A – The frozen dataclass gives compile-time type safety and a single source of truth; the module header explicitly calls out that it imports nothing from the agentic_sales package, so it stays "cheap to import and safe to read from any runtime" (CPython, Pyodide worker, JSON generator).

Weak answer misses – The key detail is the MicroVertical class itself is in micro_verticals.py and marked as @dataclass(frozen=True), not a generic dict.


Q – How do keyword_signals and negative_signals actually control which companies are considered relevant for a niche?

A – The _brainstorm_direction node passes list(mv.keyword_signals) into the DeepSeek prompt, telling the LLM which signals mark a company as in-vertical. Simultaneously, the pipeline’s scoring uses negative_signals to exclude companies that match those terms — for example, the legal-pi-demand vertical has "insurance defense" and "contract lifecycle management" as negative signals to weed out false positives.

Follow-up – So the negative signals are just used in the brainstorm prompt?
A – No, they are a separate field in MicroVertical; the negative_signals tuple is consumed by the pre_score logic (not just the prompt) to ensure companies matching them are excluded from the vertical, as documented in the dataclass field comment "excludes a company from this micro-vertical."

Weak answer misses – The exact field name negative_signals and its role in both prompt and scoring, plus the per-vertical granularity (not global blocklist).


Q – Why this design of a per-vertical negative signal list instead of a single global blocklist?

A – Each micro-vertical targets a tight industry niche, and a global list would block companies that are perfectly valid in another vertical. For example, "contract lifecycle management" is a negative signal for legal-pi-demand (plaintiff PI only) but could be a core signal for a different vertical like legal-immigration. The MicroVertical dataclass keeps per-vertical negative_signals so the same string can be positive in one niche and negative in another — a decision that avoids a monolithic, hard-to-maintain blocklist.

Follow-up – How does the system avoid re-suggesting already-known companies?
A_brainstorm_direction accepts an exclude_domains parameter (populated from existing candidates), and the LLM prompt instructs it to skip those domains; additionally the dedupe node runs after all channels.

Weak answer misses – The crucial point that negative signals are per-vertical, not global, and that many negative signals are actually positive keywords for other verticals in the MICRO_VERTICALS dict.


Q – Describe how sub-niche rotation works and why it prevents the LLM from re-mining the same names.

A – The plan_targets node reads each direction’s sub_niches tuple and advances a per-vertical cursor via _read_and_advance_cursor. That cursor index picks the next sub-niche string, which is passed as sub_niche to _brainstorm_direction. The module header explains that the discovery loop "windows through these one per tick … so DeepSeek explores a fresh applied pocket each time instead of re-mining the same well-known names."

Follow-up – What happens if a vertical has an empty sub_niches tuple?
A – The docstring says "Empty → the brainstorm uses only the seed_query/keyword_signals framing" — plan_targets skips cursor logic entirely, and _brainstorm_direction receives sub_niche=None.

Weak answer misses – The exact function _read_and_advance_cursor (not just a random index) and that the cursor is stored in an external table (implied by _ensure_cursor_table).


Q – What is the role of pre_score_keywords and how does the scoring fallback chain work when per-vertical weights are defined?

A – The pre_score_keywords tuple marks boost terms used by the pre_score function in discovery_graph.py to give extra weight to companies matching those terms. Additionally, each MicroVertical can define a per-niche sub_niche_score_weights list of (sub_niche, weight) pairs; the method get_sub_niche_weights() linear-scans that list. If it returns None, the pre_score logic falls back to the score_weights tuple defined on the MicroVertical itself — and if that also is empty, it falls back to the IndustryProfile.score_tiers path (documented in the field comment for score_weights).

Follow-up – Why a linear scan instead of a dict lookup?
A – The docstring says "Linear scan over :attr:sub_niche_score_weights; fine for the small registry size" — the taxonomy is small (eight verticals shown in the README), so a dict adds unnecessary complexity when an O(n) scan is negligible.

Weak answer misses – The three-level fallback chain: sub_niche_score_weightsscore_weightsIndustryProfile.score_tiers, and the explicit mention of linear scan being intentional for a small registry.

08. Sub-Niche Rotation

Each vertical owns a list of very narrow sub-niches, and a stored cursor cycles through them. A small table holds one row per vertical, and that row keeps a simple counter. When a new run begins, the system reads the current counter and grabs the matching sub-niche, such as voice agents for dental front desks. Then it nudges the counter forward by one, wrapping back to the start once it runs off the end.

The rotation is deliberate. Each run fishes a different pocket of the market instead of the same broad pond every time. One day it leans on dental scheduling, the next on restaurant ordering. Over enough runs, every sub-niche takes its turn, so no corner stays ignored. The counter lives in the database, which means it survives restarts and overlapping runs, and it advances exactly once per logical run. The trade-off is patience. Any single run sees only a thin slice of the vertical, so broad coverage builds up slowly across many runs rather than all at once.

The discovery_cursors table stores a rotating index per vertical to sequentially cycle through sub-niches.

python
_CURSOR_DDL = """
CREATE TABLE IF NOT EXISTS discovery_cursors (
    vertical   TEXT PRIMARY KEY,
    sub_niche_idx INTEGER NOT NULL DEFAULT 0,
    updated_at TEXT NOT NULL
)
"""

async def _ensure_cursor_table() -> None:
    """Idempotent DDL — creates discovery_cursors if it does not exist yet."""
    try:
        await d1_run(_CURSOR_DDL, [])
    except D1Error as exc:
        log.warning("discovery_cursors DDL failed (continuing): %s", exc)
ELI5 — the plain-language version

Think of a chef who runs a weekly rotating specials menu: Monday is Thai curries, Tuesday is Italian pasta, Wednesday is Mexican tacos—each day explores a different cuisine instead of serving the same dish every time. This subsystem does exactly that for discovering new companies—it systematically cycles through narrow sub-categories (called sub-niches) so that each run explores a fresh pocket of the market rather than the same broad pond.

Here’s how the rotation actually works. Each market vertical (for example, “AI for accounting”) owns a list of very specific sub-niches like “AI month-end close automation” or “AI expense management.” A small database table named discovery_cursors stores one row per vertical, keeping a simple counter called sub_niche_idx. When a new run begins, the function _read_and_advance_cursor() reads that counter, uses it to pick the matching sub-niche (e.g., "voice agents for dental front desks"), then nudges the counter forward by one. When the counter runs off the end of the list, it wraps back to the start—so every sub-niche gets its turn over time.

The trickiest part is that the list can be re-ordered before the cursor even sees it. A feature called steering (via reorder_sub_niches_by_priority) re-ranks sub-niches based on how productive or cheap they were in past runs—using a priority dictionary that reflects actual yield and cost. Sub-niches with low priority can be dropped entirely if they fall below a configurable floor. The cursor then advances over this re-ordered list, so the system naturally favors sub-niches that have delivered good results without ever inventing new ones. Without this rotation, every run would brainstorm the same sub-niche, wasting time re-mining the same well-known companies and missing entire segments of the market—like a chef who only ever serves Monday’s special.

Data flow — one request, in order
  1. _read_and_advance_cursor(vertical, sub_niche_count) is called by the discovery loop for the current vertical, e.g., "voice-ops", to retrieve the next sub‑niche index.

    • reads / writes — reads the discovery_cursors table via d1_one on the vertical column; writes to the same table via d1_run with the updated sub_niche_idx and updated_at timestamp.
    • branch — if the SELECT returns no row, the current index defaults to 0; otherwise it is taken modulo sub_niche_count. The happy path always computes and returns a valid zero‑based index.
  2. d1_one("SELECT sub_niche_idx FROM discovery_cursors WHERE vertical = ?", [vertical]) executes the read.

    • reads / writes — reads the sub_niche_idx value from the discovery_cursors table; no writes.
    • branch — on D1Error (line except D1Error as exc), the function catches the exception and returns 0 as a fallback, then continues to advance. The happy path proceeds with the row.
  3. current = int((row or {}).get("sub_niche_idx") or 0) % sub_niche_count computes the index for this tick.

    • reads / writes — consumes the row dict (or empty), the sub_niche_idx field, and sub_niche_count; returns current.
    • branch — if row is falsy, current becomes 0 (no prior cursor). This is the happy fallback for first tick.
  4. next_idx = (current + 1) % sub_niche_count computes the next index to store.

    • reads / writes — reads current and sub_niche_count; no external state.
    • branch — always executed; no conditional.
  5. d1_run("INSERT INTO discovery_cursors ... ON CONFLICT ... UPDATE SET ...", [vertical, next_idx, now]) persists the advanced index.

    • reads / writes — writes next_idx and updated_at to the discovery_cursors table for the given vertical. Reads exist via conflict detection.
    • branch — on D1Error, the function logs a warning and returns 0 as fallback (safe degrade). Happy path completes the upsert.
  6. Return current from _read_and_advance_cursor to the caller.

    • reads / writes — returns the integer index; no further side effects.
  7. The caller selects sub_niche = mv.sub_niches[current] from the MicroVertical’s sub_niches tuple.

    • reads / writes — reads the sub_niches attribute of the MicroVertical dataclass; writes nothing.
    • branch — if sub_niches is empty (len=0), the _read_and_advance_cursor call would have used sub_niche_count=0 and returned 0, so the tuple is empty and the caller presumably skips sub‑niche framing. Happy path: index is within range.
  8. Calls _brainstorm_direction(mv, sub_niche=sub_niche) with the specific sub‑niche string.

    • reads / writes — consumes the MicroVertical object and the sub‑niche string; will produce a candidate list later.
    • branch — none at this call site.
  9. Inside _brainstorm_direction, constructs a payload dict with keys seed_query, vertical, keywords, geography, exclude_domains, sub_niche.

    • reads / writes — reads mv.vertical, mv.keyword_signals, exclude_domains parameter; writes the dict to be passed to brainstorm.
    • branch — none; always builds the payload.
  10. Calls brainstorm(payload) to invoke the DeepSeek LLM.

    • reads / writes — reads the payload dict; writes nothing directly; the LLM call is external.
    • branch — if brainstorm raises an exception (e.g., missing API key), the except block catches it and returns []. Happy path continues with the response.
  11. Iterates over out.get("candidates") or [] — each candidate is a dict.

    • reads / writes — reads the candidates list from the brainstorm output; writes nothing yet.
    • branch — if the list is empty or out lacks the key, the loop body never executes and the function returns []. Happy path loops over each candidate.
  12. For each candidate, domain = blocklist.canonicalize_domain(str(c.get("domain") or "")) normalises the domain.

    • reads / writes — reads candidate domain field; writes domain variable.
    • branch — if domain is empty or lacks a dot ("." not in dom), the candidate is skipped via continue. Happy path proceeds to build a result dict.
  13. Appends a result dict with keys name, domain, why, why_in_vertical, confidence, evidence, vertical (set to mv.vertical).

    • reads / writes — reads candidate fields name, why_in_vertical or why, confidence, evidence; writes to the accumulating result list.
    • branch — no internal branch; always appends for valid domains.
  14. Returns result (list of candidate dicts) to the caller.

    • reads / writes — returns the list; if no valid candidates survive, returns []. Terminal step of the sub‑niche rotation flow.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The Sub-Niche Rotation subsystem ensures that the discovery platform systematically explores every corner of a market rather than repeatedly mining the same well-known names. The ordered mechanism begins in the plan_targets node of discovery_graph.py. When the _ATS_PATH_SOURCES are active, the node resolves each direction via micro_verticals.resolve_directions and then, for each direction that has a non-empty sub_niches tuple, it calls _read_and_advance_cursor to retrieve the current index into that vertical’s list. The cursor table (discovery_cursors) is guaranteed to exist because _ensure_cursor_table runs first, issuing d1_run with _CURSOR_DDL. The _read_and_advance_cursor function atomically reads the stored sub_niche_idx, returns it, then increments it modulo the length of the sub_niches list. The selected sub-niche is stored in _direction_sub_niches and later consumed by _brainstorm_direction to rotate the framing of the DeepSeek brainstorm query. On failure — for example, a D1Error from _ensure_cursor_table or _read_and_advance_cursor — the code logs a warning but does not crash the entire discovery run; the plan_targets node returns an empty _direction_sub_niches dict, and the brainstorm reverts to using only the seed_query and keyword_signals framing, degrading gracefully.

The core invariant the design preserves is cursor persistence and fair round‑robin per vertical. Each vertical owns its own integer index in discovery_cursors, which is advanced exactly once per tick (the discovery run). This guarantees that over enough runs each sub-niche is visited in strict rotation, and no two consecutive ticks select the same sub-niche for the same vertical. The cursor is never reset, so even after a crash or restart the rotation picks up where it left off. There is no exactly‑once guarantee for the advancement—a partial failure after reading but before writing could cause a double‑visit or skip—but the design tolerates this because re‑visiting a sub‑niche is harmless, simply yielding fewer new candidates.

The key trade‑off is deterministic rotation versus random or static selection. An obvious alternative would be to run the same single seed search for every tick, relying on the LLM’s randomness to uncover new companies. That alternative is rejected because it wastes compute on re‑mining the same high‑visibility names and leaves unexplored pockets of the vertical, violating the “cover every corner” goal. Another alternative would be to run all sub‑niches simultaneously in one tick, which would maximize coverage per run but would massively increase API costs (multiplying the count of DeepSeek calls per tick) and risk rate‑limiting. By rotating one sub‑niche per tick, the system caps each tick’s cost to a single brainstorm per direction while still converging to full coverage over time. The cost avoided is discovery stagnation—the system would otherwise produce diminishing returns as it kept returning the same established companies, making the seed‑query channel ineffective for finding emerging niches.

A concrete failure mode is a persistent D1 unavailability that causes _ensure_cursor_table or _read_and_advance_cursor to raise D1Error repeatedly. An operator would see the log line "discovery_cursors DDL failed (continuing)" or a similar warning around plan_targets, and the _direction_sub_niches dict would remain empty. As a result, the _brainstorm_direction path for that vertical would not receive a sub_niche argument, falling back to the generic seed‑query framing. This means the brainstorm explores the same familiar space each tick, gradually producing fewer unique candidates. The operator could confirm by checking the discovery_cursors table (if accessible) for missing rows or by observing that the opportunity count per vertical stops growing after a few runs, even though the system logs no hard error.

Cost & performance — the real knobs

The subsystem spends time and money primarily in parallel network I/O and database operations during discovery and persistence. The sub‑niche rotation itself is cheap (a single cursor read/upsert), but the breadth of each run is governed by how many sub‑niches are processed and how aggressively the engine fans out. The following real performance knobs from the source code directly control latency, throughput, and cost.

  • DEFAULT_CONCURRENCY — the global cap (DEFAULT_CONCURRENCY = 6) that limits the total number of concurrent HTTP requests across all hosts. It bounds parallelism at the top level. Raising it increases wall‑clock throughput (faster discovery) but risks overwhelming the run‑time environment (Cloudflare Workers) and raising “429” rejection rates. Lowering it serializes the discovery wave, reducing throughput and increasing total run time.

  • DEFAULT_PER_HOST — a per‑host ceiling ("commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2) that stops a single throttled source from starving others. It limits request volume to each specific service. Turning up the per‑host cap for commoncrawl or launchfeed lets those sources throw more parallel requests, accelerating their completion but increasing the chance of HTTP errors or rate‑limit bans. Turning it down protects stability at the cost of slower ingest.

  • CC_MAX_DOMAINS — the constant CC_MAX_DOMAINS = 40 that caps how many domain‑level Common‑Crawl lookups can happen in a single run. This bounds the total number of CDX queries and page extractions. Higher values return more candidates per run, increasing the chance of finding relevant companies but raising D1 write volume and processing time. Lower values limit cost but also limit coverage.

  • timeout — the HTTP session timeout (timeout=20.0) passed to httpx.AsyncClient. It limits how long the client waits for a single request before giving up. A longer timeout improves reliability on slow hosts but delays the entire wave if one endpoint hangs. A shorter timeout fails fast, reducing latency at the cost of more retries or missing data from slower sources.

  • _LAUNCH_LOOKBACK_DAYS — the constant _LAUNCH_LOOKBACK_DAYS = 90 that controls how far into the past the launch‑feed channel scans for recent company launches. This bounds the candidate window size. Widening it (e.g., to 180) brings in more candidates per run, increasing the pool for scoring but also increasing API calls and database writes. Narrowing it reduces noise and cost but may miss valuable older launches.

  • _LAUNCH_SIGNAL_DECAY_DAYS — the constant _LAUNCH_SIGNAL_DECAY_DAYS = 180 that defines after how many days a product‑launch signal is considered stale and eligible for decay. This bounds the useful lifetime of a launch signal in the scoring pipeline. Turning it down forces more frequent re‑scraping or re‑checking of launch data, raising cost. Turning it up lets older signals persist longer, reducing operational cost but potentially using stale data for outreach decisions.

These knobs are overridable per‑run via the concurrency, per_host, cc_max, and sources inputs in the discovery state, allowing operators to trade off speed, coverage, and cloud‑spend on the fly.

Failure modes — what breaks, what catches it

Failure 1: D1Error during cursor read/advance

  • Trigger — A transient database error (e.g., D1 connection drop) when _read_and_advance_cursor calls d1_one or d1_run to fetch or update the discovery_cursors row.
  • Guard — The except D1Error as exc block inside _read_and_advance_cursor catches the exception, logs a warning, and falls back to returning 0.
  • PostureFail-soft — the function degrades by returning index 0, which selects the first sub-niche in the tuple (or zero if no sub-niches exist). The rest of the discovery wave continues without aborting.
  • Operator signal — The exact log line: log.warning("cursor read/advance failed vertical=%s: %s", vertical, exc). No error is raised to the caller, so the operator would see only a warning in the logs.
  • Recovery — No retry is implemented. The next tick will attempt the cursor read again; if the DB recovers, the rotation resumes from index 0 (slightly desynchronized but within acceptable bounds per the comment “one extra step is acceptable”).

Failure 2: Empty sub-niches tuple (misconfiguration)

  • Trigger — A MicroVertical definition (e.g., voice-ops) has an empty sub_niches tuple (()). plan_targets calls mv.sub_niches and finds it falsy.
  • Guard — The explicit check if not niches: continue inside plan_targets skips rotation for that direction entirely. No call to _read_and_advance_cursor occurs.
  • PostureFail-soft — the direction receives no sub-niche rotation; the brainstorm uses only the seed_query and keyword_signals framing. This is a configuration-driven degradation, not a crash.
  • Operator signal — No explicit warning is logged; the operator would notice the absence of any _direction_sub_niches entry for the direction. If monitoring expects rotation, they must compare configured sub-niches (empty) vs. actual state.
  • Recovery — Manual fix: correct the sub_niches tuple in the MicroVertical definition. The next tick automatically picks up the new list.

Failure 3: Concurrent cursor advance (race condition)

  • Trigger — Two discovery waves (or concurrent workers) for the same vertical execute _read_and_advance_cursor nearly simultaneously. Both read the same current index, then both upsert next_idx = (current + 1) % count. The second upsert overwrites the first, causing one sub-niche to be skipped.
  • Guard — The code uses ON CONFLICT (vertical) DO UPDATE SET but does not use a conditional update or compare-and-swap. The comment explicitly notes: “a second call in the same process will advance one extra step in the rotation — one extra step is acceptable for a rotation.”
  • PostureFail-soft — the rotation continues but one sub-niche is skipped. The comment accepts this as tolerable drift.
  • Operator signal — No direct signal. An operator monitoring which sub-niches are used per tick would see one niche never selected and another selected twice (except if carefully logging each cursor read — no such log exists).
  • Recovery — No automatic recovery; the next tick naturally advances to the next usable index. Over many ticks the drift averages out. To prevent, a pessimistic lock on the vertical row would be needed, but is not implemented.

Failure 4: _ensure_cursor_table creation failure

  • Trigger — On first tick for a vertical, plan_targets calls await _ensure_cursor_table() (not shown in snippet, but referenced). This function likely creates the discovery_cursors table if it does not exist. A D1 error (e.g., permissions, schema conflict) raises an unhandled exception.
  • Guard — No exception handler wraps _ensure_cursor_table() in the provided source. The call is inside plan_targets without a try/except. It is also executed before the per-vertical loop, so a failure stops the entire node.
  • PostureFail-hard — the exception propagates out of plan_targets, aborting the discovery wave for that run. The upstream caller (likely the graph runner) would see a crash.
  • Operator signal — The unhandled exception appears as a traceback in the worker logs. No specific warning or error field is set on the state (since the node never returns).
  • Recovery — Manual intervention required: fix the D1 issue (e.g., grant schema permissions, drop conflicting table). The next tick will retry the full plan_targets from scratch.

Failure 5: sub_niche_count ≤ 0 (edge case bypassing the if not niches guard)

  • Trigger — If some future code path calls _read_and_advance_cursor directly with sub_niche_count = 0 (e.g., a misconfigured vertical that has a non-empty sub-niche tuple but a bug causes count to be zero). The guard if sub_niche_count <= 0: return 0 at the top of the function activates.
  • Guard — Explicit early return: if sub_niche_count <= 0: return 0.
  • PostureFail-soft — returns 0 and does not advance any cursor. The caller (presumably plan_targets but currently guarded) would use sub-niche index 0, which might be meaningless if no sub-niches exist. If the caller is not expecting this, it could silently use an invalid index.
  • Operator signal — No log entry is generated by this guard. The operator would see no cursor advance in the discovery_cursors table and possibly stale or repeating sub-niche selections.
  • Recovery — No automatic recovery; the next tick will call the function again with the same count. Only code review or a runtime metric (e.g., cursor index never changing) would alert an operator. The fix is to correct the caller or the sub-niche configuration.
Interview — could you explain it?

Q – Warm-up
How does the discovery engine ensure it doesn’t just re‑mine the same well‑known companies every time it runs a vertical discovery?

A
Each MicroVertical row carries a sub_niches tuple (e.g., 16 niches for voice ops).
The node plan_targets reads the current cursor index for that vertical from a database table, picks the niche at that index, then advances the cursor (wrapping around). This per‑tick rotation forces the LLM brainstorm to explore a fresh pocket of the vertical each run.

Follow-up
What happens if a vertical has no sub‑niches?
Grounded answer: plan_targets skips cursor logic entirely for that direction (the sub_niche of None means no rotation).

Weak answer misses
The fact that the cursor is stored in a table (_ensure_cursor_table, _read_and_advance_cursor) so state persists across runs, not just in‑memory.


Q – Medium
Where exactly is the per‑vertical sub‑niche list defined, and how is it consumed by the brainstorm step?

A
It is defined in micro_verticals.py as the sub_niches field on each MicroVertical dataclass (e.g., voice-ops has 16 entries).
In discovery_graph.py, the node plan_targets reads the cursor, resolves the niche, and stores it in _direction_sub_niches.
Later, _brainstorm_direction passes that niche via the sub_niche parameter into the brainstorm call, which the LLM uses to frame its search.

Follow-up
How does the LLM prompt change when a sub‑niche is supplied?
Grounded answer: The sub_niche parameter is passed as a separate key in the brainstorm input dict; the prompt construction in brainstorm uses it to replace or augment the generic seed_query framing.

Weak answer misses
The rotation mechanism is not a simple increment—the cursor is read, used, and then advanced atomically via _read_and_advance_cursor, which handles the wrap‑around.


Q – Hard (design question)
Why was the cursor stored in a database table instead of using an in‑memory counter or a config‑based sequential list that resets on every deployment?

A
A database cursor (managed by _ensure_cursor_table and _read_and_advance_cursor) survives restarts, deploys, and runs across different workers.
If the index were in‑memory, every process restart or horizontal scale‑out would reset the pointer, causing the same niche to be targeted repeatedly until the cursor happened to advance.
The table gives sticky, durable rotation that lets the system converge on covering all sub‑niches over time.

Follow-up
What happens if two runs fire concurrently for the same vertical?
Grounded answer: The database read‑and‑advance is not shown as transactional, so a race could let both runs see the same index; this would be a correctness bug, but the context does not specify a locking mechanism.

Weak answer misses
The node plan_targets explicitly short‑circuits the cursor logic when the run is a seed‑query‑only path (not an ATS path), meaning rotation only applies to CommonCrawl/Launchfeed multi‑source waves.


Q – Hard
Trace the exact flow: from the MicroVertical definition of sub_niches through to how the brainstorm framing changes for one specific direction. Name every function or node involved.

A

  1. MicroVertical.sub_niches (in micro_verticals.py) holds the tuple, e.g., for voice-ops it has 16 entries.
  2. plan_targets (in discovery_graph.py) calls _ensure_cursor_table(), then for each direction calls _read_and_advance_cursor(direction, len(niches)) to get the index, and stores direction_sub_niches[direction] = niches[idx].
  3. The _direction_sub_niches dict is returned and merged into state.
  4. Later, the node that runs brainstorm (likely the multi‑source wave) reads state["_direction_sub_niches"] and passes the value as sub_niche to _brainstorm_direction.
  5. Inside _brainstorm_direction, the sub_niche string is injected into the brainstorm input dict under the key "sub_niche".
  6. The brainstorm function (imported) constructs a DeepSeek prompt that includes that niche, altering the model’s search framing.

Follow-up
How does the system guarantee it eventually cycles through all sub‑niches if a vertical has, say, 16 niches?
Grounded answer: _read_and_advance_cursor advances the index by 1 each time and wraps when it reaches len(niches), so after 16 runs it will have hit every niche exactly once (assuming no concurrency collisions).

Weak answer misses
The condition if not niches: continue in plan_targets—directions with an empty sub_niches tuple never touch the cursor, so the rotation logic is entirely opt‑in per vertical.

09. Scoring And Dedup

To rank candidates, the system gives each one a pre-score from weighted keywords. Every keyword carries a number that reflects how telling that word really is for the niche. A sharp phrase like accounts payable might weigh zero point six, while a vague word like ledger weighs only a fraction of that. The weights add up, and a minimum threshold keeps just the strongest matches. That stops weak names from soaking up later outreach.

Duplicate removal then works in two layers. First, the system checks an ignore list capped at a thousand domains. It holds large incumbents and job boards that should never be targets, and any domain on it is dropped at once. Second, before a new company is saved, it is compared against every company already stored. If a match turns up, that entry is skipped, and only genuinely new rows are written. The trade-off is that the ignore list is bounded, so a very old excluded domain can eventually age out and reappear. In practice the cap saves far more cost than that rare slip ever risks.

Scoring candidates with per-keyword weights and deduplicating against existing domains.

python
def pre_score(state):
    active_weights = ...  # resolved from sub_niche > vertical > default
    scored = []
    for c in state.get("filtered") or []:
        why_lower = (c.get("why_in_vertical") or "").lower()
        s = sum(w for kw, w in active_weights if kw in why_lower) if active_weights else 0.0
        if s < 0.2:                       # floor threshold
            continue
        scored.append({**c, "pre_score": min(s, 1.0)})
    scored.sort(key=lambda x: x["pre_score"], reverse=True)
    return {"scored": scored}

async def dedupe(state):
    candidates = state.get("candidates") or []
    domains = [c["domain"] for c in candidates]
    existing = set()
    # … batch-check domains against the `companies` table …
    for sub in chunked(domains, MAX_PARAMS):
        rows = await d1_all(
            "SELECT canonical_domain FROM companies WHERE canonical_domain IN ...", ...)
        existing.update(r["canonical_domain"] for r in rows)
    kept = [c for c in candidates if c["domain"] not in existing]
    return {"filtered": kept, "skipped_existing": len(candidates) - len(kept)}
ELI5 — the plain-language version

Imagine a talent scout scoring contestants at a competition, but instead of generic points, each skill has a specific weight—expertise in "accounts payable" might be worth 0.6 points, while a vague term like "ledger" is only a fraction of that. This subsystem exists to rank companies by how closely they fit a niche, then weed out duplicates so no two scouts pitch the same star.

The scoring begins by scanning each company for keyword signals. Every keyword in the pre_score_keywords tuple—like "AI month-end close automation"—carries a calibrated boost_weight. The system sums these weights as it finds matches, then caps the total at 1.0. A minimum threshold filters out weak contenders, so only strong leads survive to later outreach. Next, duplicate removal works in two layers. First, the system consults an ignore list capped at a thousand domains—a blacklist of giant incumbents and job boards that should never be re-suggested. Second, pattern matching on the domain catches exact repeats.

The trickiest edge is the weight cap: no matter how many high-value keywords match, the pre_score function stops at 1.0. This prevents a single company from unrealistically dominating the ranking just because it happens to touch every possible term. At the same time, negative_signals can instantly disqualify a candidate, even if its score is high—for example, excluding firms that mention "recruiting" when the niche is pure product companies. Without this subsystem, weak matches would waste sales effort, and the same company could be contacted twice, making the system look amateurish.

Data flow — one request, in order
  1. expand_seed — Extracts B2B lead‑gen facets (vertical, geography, size_band, keywords) from the seed query using an LLM call.

    • reads / writes — Consumes state.seed_query, state.vertical, state.keywords, and the focus profile’s default_seed_query. Writes seed_query (always), plus vertical, geography, size_band, keywords when the LLM succeeds.
    • branch — If SEED_SOURCE is not in _resolve_sources(state), returns {} immediately (no‑op for multi‑source runs). If state.get("vertical") and state.get("keywords") are already present, returns only {"seed_query": ...} without calling the LLM. On LLM failure, returns {"_error": ...}. Happy path: LLM returns JSON facets, which are written to state.
  2. make_deepseek_pro (called inside expand_seed) — Instantiates the DeepSeek LLM with a temperature of 0.2 for deterministic output.

    • reads / writes — No direct state access; creates an LLM object used in the next step.
    • branch — None (always succeeds or raises an exception caught in expand_seed).
  3. ainvoke_json_with_telemetry (called inside expand_seed) — Sends the system and user prompt to the LLM and returns a parsed JSON dictionary with facets.

    • reads / writes — Reads the system prompt (hardcoded extraction instructions) and user prompt containing seed_query. Returns JSON payload (e.g. {"vertical": ..., "keywords": ...}).
    • branch — If the LLM call raises an exception, it is caught by except Exception in expand_seed, which writes _error. Happy path: returns the parsed dict.
  4. brainstorm node (graph node; internally calls _brainstorm_direction per micro‑vertical) — Generates candidate companies for the given vertical by invoking the LLM with seed query, keywords, and optional sub‑niche framing.

    • reads / writes — Consumed: state.vertical, state.keywords, state.seed_query (set to "micro-vertical:{mv.vertical}"), state.exclude_domains, state.sub_niche (from _direction_sub_niches). Returns a dict with key candidates (list of candidate dicts) under the caller’s context.
    • branch — If the LLM is unavailable or the brainstorm function raises an exception, _brainstorm_direction returns [] and logs a warning. Happy path: returns non‑empty candidate list.
  5. _brainstorm_direction (helper function, called once per micro‑vertical direction) — Orchestrates the LLM brainstorming for a single direction, filtering out invalid domains.

    • reads / writes — Reads mv.vertical, mv.keyword_signals, exclude_domains, sub_niche. Returns a list of candidate dicts with name, domain, why_in_vertical, confidence, evidence, vertical.
    • branch — Iterates over each candidate in out.get("candidates"); skips entries where the domain is missing or invalid (no dot). Fans out over all micro‑verticals selected for this run.
  6. dedupe node (graph node, called after brainstorm) — Removes duplicate candidate entries based on the website domain, keeping only the first occurrence of each unique domain.

    • reads / writes — Consumes the list of candidates (likely from state key candidates or an intermediate accumulator). Returns a deduplicated list (or mutates state to replace the candidate list).
    • branch — If the candidate list is empty, the node is a no‑op. Happy path: deduplication proceeds, eliminating redundant domains.
  7. pre_score node (graph node, called after dedupe) — Applies keyword scoring using per‑vertical pre_score_keywords (each a (keyword, weight) pair) and the candidate’s description; drops candidates whose total weighted score falls below a configurable floor.

    • reads / writes — Reads the deduplicated candidate list and the current MicroVertical’s pre_score_keywords (from micro_verticals module). Writes back the scored and filtered candidate list (with scores attached).
    • branch — For each candidate, scans its description, multiplies matched keyword weights, sums them (capped at 1.0), and drops candidates with a total below the floor. No early return on empty list; it simply returns an empty list.
  8. persist node (implied terminal node) — Inserts the final scored candidates into the database using ON CONFLICT DO NOTHING to avoid duplicates.

    • reads / writes — Consumes the scored list from state (set by pre_score), plus state.vertical, state.geography, and profile tags. Writes nothing to state; inserts records to an external database.
    • branch — If scored is empty, returns immediately with zero inserted. Uses ON CONFLICT DO NOTHING so only new domain+vertical combinations are inserted; existing ones are skipped (counted via changes).
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The scoring and deduplication subsystem begins with the _brainstorm_direction function, which calls the brainstorm LLM to generate candidate companies for each micro-vertical. Each candidate comes with a numeric confidence field between 0 and 1. These candidates then enter the keyword‑scoring stage: the system evaluates each candidate against the micro_verticals.MicroVertical dataclass fields keyword_signals, negative_signals, and pre_score_keywords. The pre_score_keywords are per‑keyword calibrated weight pairs—the pre_score function (referenced in the MicroVertical docstring) sums matched weights and caps the total at 1.0. Any candidate whose summed weight falls below an implicit floor is dropped. Surviving candidates are then deduplicated by their canonical domain (blocklist.canonicalize_domain), and the final set is persisted via _persist_seed_candidates. If that D1 write fails, a D1Error surfaces as {"_error": str(e), …}. The ordered flow is therefore: brainstorm → keyword scoring → domain dedup → persist with error handling.

The design preserves the "grounding guard" invariant named in discovery_graph.py: “the write boundary itself must refuse LLM‑invented (synthetic) rows—never rely solely on the upstream discover strip.” This means the D1 table enforce a write‑time check that only companies from real‑data sources (launchfeed, CommonCrawl) or properly tagged brainstorm outputs are accepted; brainstorm is the sole synthetic channel, and its output is explicitly marked with a source:claude‑brainstorm tag. The guarantee prevents hallucinated companies from ever entering the scoring/outreach pipeline, even if the LLM produces plausible‑looking results.

The key trade‑off is choosing one parameterised graph over several separate workers per vertical. As stated in micro_verticals.py: “that is exactly why one parameterised graph beats several separate workers.” The obvious rejected alternative is a dedicated graph or worker for each of the eight micro‑verticals. That alternative would require duplicated logic, separate deployment slots, and per‑vertical maintenance, raising operational complexity. By centralising the logic in discovery_graph.py and driving behaviour solely from the frozen MicroVertical table (seed query, keyword signals, weights, sub‑niches), the system avoids the cost of branching code; any new vertical is a pure data addition with zero code changes.

A concrete failure mode occurs when the LLM is unavailable during brainstorming: _brainstorm_direction catches a generic Exception and logs "brainstorm failed direction=%s: %r" at WARNING level, then returns an empty list. The operator would see that log line in the runtime logs, paired with the vertical’s mv.vertical string (e.g. "legal‑pi‑demand"). Because no candidates are produced, the _persist_seed_candidates call receives an empty list and the node proceeds without error, but the direction_sub_niches rotation still advances—so subsequent ticks may skip that vertical’s brainstorm path entirely until the LLM recovers. The absence of error in the D1 write path masks the issue unless someone monitors the log level.

Cost & performance — the real knobs

The subsystem spends time on LLM brainstorm API calls (generating candidate companies), keyword scanning of descriptions against pre‑built term lists, and deduplication by domain normalization and a seen_domains set. Money flows primarily to DeepSeek API tokens for brainstorm calls and to Common Crawl / launch‑feed network requests. The following performance knobs, drawn directly from the source, control these costs.

  • concurrency (default 6, via DEFAULT_CONCURRENCY)

    • Bounds — Global cap on simultaneous httpx connections across all hosts.
    • Effect — Higher values reduce wall‑clock time by running more fetches in parallel, but inflate API‑call concurrency and D1 write pressure.
    • Risk — Too high: rate‑limit errors, D1 contention, or memory exhaustion in the async client pool. Too low: under‑utilized parallelism, longer runs.
  • per_host (default dictionary: {"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2})

    • Bounds — Per‑host ceiling inside the global concurrency cap; prevents one slow host from starving others.
    • Effect — Raising a host’s limit lets more requests to that endpoint overlap, trading throughput for that source against fairness to other hosts.
    • Risk — Over‑allow llm: spikes in DeepSeek API latency or cost; under‑allow d1: D1 batch writes serialise, increasing persist time.
  • cc_max (default 40, via CC_MAX_DOMAINS)

    • Bounds — Maximum number of domains queried in the Common Crawl CDX index per run.
    • Effect — Raising it pulls more candidate domains from CC, increasing discovery breadth but also CDX request volume and subsequent page‑extraction I/O.
    • Risk — Too high: runs stall waiting on CC API rate limits; too low: misses real companies that only appear in CC.
  • _LAUNCH_LOOKBACK_DAYS (constant 90, in discovery_graph.py)

    • Bounds — How far back the launch‑feed channel (YC / ProductHunt) searches for recently launched companies.
    • Effect — Extending the window yields more candidates per run, raising API calls to launch‑feed sources and storage persistence.
    • Risk – Too wide: re‑ingests stale launches, inflating dedup work and opportunity‑record age. Too narrow: misses valuable new companies.
  • pre_score_keywords (tuples of keyword strings in each MicroVertical; no global default, defined per‑direction)

    • Bounds — The set of weighted terms that drive relevance scoring and candidate rejection.
    • Effect – Adding keywords or adjusting their weights (via the V43 per‑keyword boost mechanism) tightens or loosens the filter, directly controlling how many candidates survive to persist and how much LLM‑token cost is wasted on irrelevant brainstorm suggestions.
    • Risk – Over‑specialised keywords drop too many candidates, starving the pipeline; generic keywords let noise through, wasting storage and downstream scoring.
Failure modes — what breaks, what catches it

Dedupe D1Error

  • Trigger — The dedupe function queries Cloudflare D1 for existing domains using d1_all with in_clause. A D1Error is raised if D1 is unreachable, times out, or returns an error.
  • Guard — The except D1Error as e: clause in dedupe catches it and returns {"_error": f"dedupe: {e}"}.
  • Posture — Fail‑soft: the _error is stored in the state dict; subsequent agents (e.g., pre_score) check state.get("_error") and exit early, effectively aborting that branch. The overall run continues but with an error code.
  • Operator signal — The returned _error field contains "dedupe: " followed by the exception text. No log line is shown in the source.
  • Recovery — No automatic retry or backoff. The operator must fix the D1 connectivity issue and re‑run the pipeline.

pre_score KeyError on micro_verticals

  • Trigger — In pre_score, the line mv = micro_verticals.get(vertical_tag) if vertical_tag else None is wrapped in a try: ... except KeyError: mv = None. Although dict.get does not raise KeyError, the guard exists for a possible direct key lookup elsewhere. A KeyError would occur if vertical_tag is missing from the dictionary in a different code path.
  • Guard — The except KeyError: clause sets mv = None.
  • Posture — Fail‑soft: mv becomes None, causing fallback to per‑vertical score_weights (V23) or IndustryProfile.score_tiers. No data loss; scoring uses less specific weights.
  • Operator signal — Silent; no _error or log is produced.
  • Recovery — Automatic: the fallback weight set is used. No manual action required, but scoring accuracy may degrade.

sub_niche_score_weights match failure

  • Trigger — The MicroVertical method linear‑scans sub_niche_score_weights (for key, weights in self.sub_niche_score_weights: if key == sub_niche: return weights) and finds no matching key.
  • Guard — The method returns None explicitly; the caller pre_score expects None and then uses per‑vertical score_weights.
  • Posture — Fail‑soft: scoring falls back to coarser vertical‑level weights. No error is raised.
  • Operator signal — Silent; no log or metric.
  • Recovery — Automatic: pre_score continues with the fallback weights.

Dedupe empty candidates

  • Trigger — The dedupe function receives state.get("candidates") as an empty list (no candidates to deduplicate).
  • Guard — The early return: if not candidates: return {"filtered": [], "skipped_existing": 0}.
  • Posture — Fail‑soft: no work is done, an empty filtered list is passed downstream. The pipeline continues to pre_score, which will also handle an empty list gracefully.
  • Operator signal — No error; but the downstream persist function may later log "tick net_new=0 — zero new companies persisted this tick (yield collapse?)" if no other scoring contributed new companies.
  • Recovery — Automatic: no action needed, but the upstream candidate generation step may need investigation.
Interview — could you explain it?

Q — How is the raw confidence score from the brainstorming LLM preserved and used in the scoring pipeline?

A — The _brainstorm_direction function (in discovery_graph.py) extracts the "confidence" field from each candidate returned by the LLM and includes it in the result dictionary. This value is a float between 0 and 1 and is passed along to downstream nodes like pre_score, which can combine it with keyword‑based weights.

Follow-up — What happens when a candidate lacks a "confidence" field?

A — The function only copies the field if it exists; otherwise the key is omitted from the result dict. The pre_score node would then see None and likely treat it as a missing score (the exact behavior is not detailed, but the node is expected to handle a missing confidence gracefully by falling back to keyword scoring alone).

Weak answer misses — The function also filters out candidates that have no valid domain (via blocklist.canonicalize_domain), so a candidate missing a domain is dropped before scoring even begins.


Q — How does the pre_score node apply keyword‑based scoring to each candidate, and where are the term weights configured?

A — The pre_score node (referenced in the discovery graph’s pipeline as the step after dedupe) uses per‑vertical keyword weights stored in the MicroVertical.score_weights attribute (defined in micro_verticals.py). Each weight is a (keyword, boost_weight) pair; the node scans the candidate’s description for matches, sums the matched weights, and caps the total at 1.0. If the total falls below an implicit floor, the candidate is dropped.

Follow-up — Are there separate weights for different sub‑niches within a vertical?

A — Yes, the MicroVertical dataclass also has a sub_niche_score_weights attribute (a tuple of (sub_niche_name, weights) pairs) that allows per‑niche calibration. The pre_score node uses these when the current sub‑niche matches one of the keys; otherwise it falls back to the vertical’s score_weights.

Weak answer misses — The context explicitly states that the total is capped at 1.0 and that the fallback path uses the IndustryProfile.score_tiers when score_weights is empty, not a static threshold.


Q — What mechanism prevents the same company from appearing as a candidate multiple times, both within a single tick and across successive ticks?

A — The dedupe node (part of the discovery graph) removes duplicates by normalizing domains via blocklist.canonicalize_domain (called inside _brainstorm_direction). Additionally, the _brainstorm_direction function accepts an exclude_domains parameter that tells the LLM not to re‑suggest companies already in the pipeline, preventing repeat candidates across ticks.

Follow-up — Are duplicates checked only on the canonical domain, or also on the company name?

A — The code in _brainstorm_direction canonicalizes only the "domain" field; the company name is not used for deduplication. The exclude_domains list likewise contains domain strings, not names.

Weak answer misses — The dedupe node’s internal logic is not shown, but the context shows that domain canonicalization is the only explicit dedup mechanism; the graph also has a dedupe node that likely enforces the same rule.


Q (design question) — Why use a separate pre_score node with static keyword weights instead of having the brainstorming LLM incorporate those keywords into its own confidence output?

A — Keeping scoring as a separate, deterministic step (the pre_score node) allows the weights to be tuned without re‑running the expensive LLM, and ensures consistency across all candidates regardless of LLM variability. The MicroVertical.score_weights attribute is pure configuration, so the scoring logic is transparent, testable, and cheap (lightweight local computation with no external calls).

Follow-up — Doesn’t adding a separate scoring step increase the pipeline’s latency?

A — No, because the graph runs all channels concurrently under a single asyncio.gather per wave (as documented in the concurrency model of discovery_graph.py). The wall‑clock time is dominated by the slowest host, and the pre_score node is a fast in‑memory operation with no I/O.

Weak answer misses — The concurrency model explicitly states “wall‑clock ≈ slowest host, not the sum,” and the pre_score node is not mentioned as having any external dependencies.


Q — How does the sub‑niche rotation (implemented in plan_targets) affect the keyword scoring in pre_score?

A — The plan_targets node advances a per‑vertical cursor (via _read_and_advance_cursor) to select one sub‑niche per tick. When pre_score runs, it checks if the current sub‑niche has an entry in MicroVertical.sub_niche_score_weights; if so, it uses those weights instead of the vertical’s general score_weights. This allows the scoring to prioritize terms relevant to the current framing without needing to re‑extract from scratch.

Follow-up — What happens if a candidate matches keywords from a different sub‑niche than the one currently selected?

A — The scoring only uses the weights for the selected sub‑niche (or the vertical’s default if no per‑niche weights exist). The candidate is not scored against multiple sub‑niches simultaneously, so its total score depends only on the current tick’s framing.

Weak answer misses — The context documents sub_niche_score_weights as a per‑key calibration, and the plan_targets code shows that the cursor rotates independently for each direction, not globally.

10. From Hiring Signal To Lead

Discovery finds active companies by reading public job boards on tracking systems like Ashby and Greenhouse. It checks those boards against five preset hiring directions. Each direction owns its own keywords and a set of target career pages. That focus means only genuine artificial intelligence roles are picked up, and off-topic postings are ignored.

The system looks only at jobs that are live right now. An open role is a strong sign of immediate intent to hire, and most postings stay open for just a short while. So the data carries a built-in freshness window that tracks real demand. The trade-off is coverage. A company that hires through referrals, or that just closed a role, can slip past unseen.

After the scrape, new candidates pass through dedup. Then they earn a preliminary score. The valid ones are saved, each with a note of which board they came from, so we always know the original evidence. From there, discovery hands them to enrichment. That step fills in details like company size and contacts. The pipeline stays anchored to fresh hiring signals.

Persisting seed candidates with domain deduplication, blocklist filtering, and evidence capture.

python
async def _persist_seed_candidates(state: DiscoveryState) -> tuple[list[int], list[int], int]:
    scored = state.get("scored") or []
    if not scored:
        return [], [], 0
    # … blocklist check …
    blocked = {b.domain for b in await asyncio.to_thread(blocklist.list_all)}
    scored = [c for c in scored if c["domain"] not in blocked]

    inserted_ids, existing_ids = [], []
    now = datetime.now(timezone.utc).isoformat()
    for c in scored:
        tags_list = ["discovery-candidate"]  # plus vertical, language, market …
        # Insert with full provenance — reason + evidence
        meta = await d1_run(
            """INSERT INTO companies (key, name, canonical_domain, website,
            tags, score, score_reasons, created_at, updated_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT (key) DO NOTHING""",
            [_slugify(c["domain"]), c["name"], c["domain"], f"https://{c['domain']}",
             json.dumps(tags_list), c["pre_score"],
             json.dumps([c["why"], f"evidence:{c.get('evidence','')}",
                         f"confidence:{float(c.get('confidence') or 0):.2f}",
                         "source:seed_query"]),
             now, now])
        if int(meta.get("changes") or 0) >= 1:
            inserted_ids.append(int(meta["last_row_id"]))
        else:
            existing_ids.append(…)
    return inserted_ids, existing_ids, len(blocked_skipped)
ELI5 — the plain-language version

Think of this subsystem as a tuned radio that only picks up stations broadcasting the exact kind of help a company needs — artificial intelligence roles. Its purpose is to spot companies that are actively hiring AI talent right now, using live job postings as fresh signals.

It works by scanning public job boards like Ashby and Greenhouse, but it doesn’t read every listing. Instead, it checks each posting against a preset list of “directions” — each direction is a bundle of keywords and a list of career page URLs stored as a MicroVertical row. For example, a direction for “accounting-bookkeeping” carries keyword_signals like “close automation” and “AP”, plus negative_signals to block off-topic terms like “music streaming”. If a job matches, the company gets tagged with the direction’s name. The system also stores sub_niches — rotating sub-categories — to keep exploration fresh.

The trickiest part is the sub-niche rotation: a cursor in the discovery_cursors table advances each time the system runs. Instead of always looking for the same type of role, it cycles through different sub-niches (e.g., one run focuses on “AI voice agents for dental front-desk”, the next on “AI phone agents for restaurants”). This prevents the same well-known companies from being re-mined and uncovers niche pockets of AI hiring.

Without this subsystem, you’d drown in irrelevant job postings or only ever see the same big-name companies, missing the smaller, fast-growing startups that actually need AI engineers.

Data flow — one request, in order
  1. micro_verticals module loads MicroVertical dataclass for "voice-agent-ai"

    • reads/writes – reads keyword_signals, negative_signals, pre_score_keywords, score_weights from the frozen dataclass instance.
    • branch – no branching; the dataclass is static config.
  2. Discovery graph calls _brainstorm_direction(mv, exclude_domains=global_blocklist, sub_niche=None)

    • reads/writes – consumes mv.vertical, mv.keyword_signals; writes nothing yet.
    • branch – if mv.sub_niches is non-empty, the caller may window through them; here sub_niche=None means no rotation.
    • fan-out – this function is called once per direction (and per tick, one per sub_niche).
  3. Inside _brainstorm_direction, payload dict is constructed

    • reads/writes – reads mv.vertical, list(mv.keyword_signals), the hardcoded "remote" geography; writes a dict with keys "seed_query", "vertical", "keywords", "geography", "exclude_domains", "sub_niche".
    • branch – none.
  4. External brainstorm function invoked with the payload dict

    • reads/writes – reads the payload; writes the response (a dict with key "candidates").
    • branch – if the LLM call fails (exception or no key), _brainstorm_direction catches the exception and returns an empty list (happy‑path continues).
  5. brainstorm returns candidate dicts

    • reads/writes – each candidate dict contains "name", "domain", "why_in_vertical", "evidence", "confidence".
    • branch – if candidates key is missing or the list is empty, _brainstorm_direction returns [].
  6. _brainstorm_direction iterates over candidate list

    • reads/writes – for each candidate, reads domain and calls blocklist.canonicalize_domain(domain); writes normalized domain.
    • branch – if the normalized domain is empty or contains no dot, the candidate is skipped (invalid domain).
    • fan-out – loops over all candidates from the LLM response.
  7. Valid candidate dicts are appended to result list

    • reads/writes – reads name, domain (canonicalized), why_in_vertical (or why), confidence, evidence; writes a new dict with keys "name", "domain", "why", "why_in_vertical", "confidence", "evidence", "vertical" (set to mv.vertical).
    • branch – none (all valid candidates are added).
  8. Discovery graph passes the result list to dedupe node

    • reads/writes – reads the candidate list and a global set of already‑known domains (e.g., from exclude_domains or previous ticks); writes a deduped list (duplicate domains removed).
    • branch – if the candidate list is empty after deduplication, downstream scoring is skipped.
  9. pre_score node reads the deduped list

    • reads/writes – for each candidate, reads the vertical field to fetch the corresponding MicroVertical’s pre_score_keywords and score_weights; writes a score field (sum of matched weights, capped at 1.0).
    • branch – if the score_weights tuple is empty, falls back to the IndustryProfile.score_tiers path (not shown).
    • fan-out – loops over each deduped candidate to compute its score.
  10. pre_score returns the scored candidate list

    • reads/writes – writes the scored list into the graph state under key "scored".
    • branch – terminal step (the scored list is the output for downstream persistence or further use).
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem begins with plan_targets, which reads the _direction_sub_niches from a per-vertical cursor persisted in the D1 discovery_cursors table, rotating through each MicroVertical.sub_niches to ensure fresh framings each tick. This cursor advance is idempotent under concurrent runs via _read_and_advance_cursor(). Next, _brainstorm_direction is called per vertical, using DeepSeek with the vertical’s seed_query, keyword_signals, and the current sub_niche. It receives an exclude_domains list of already‑known companies to prevent re‑suggestions. On failure (API error, missing key, any exception), it catches all exceptions and returns an empty list, logging a warning like “brainstorm failed direction=…” but continuing the run. The results are then deduplicated by canonicalized domain and passed to _persist_seed_candidates, which writes into the companies table after the grounding guard rejects any synthetic row.

The design preserves a grounding guard—the invariant that the companies table must never contain LLM‑invented (synthetic) rows. The write boundary enforced by _persist_seed_candidates refuses any candidate that does not originate from real‑data sources; only launchfeed and commoncrawl are considered real, while the brainstorm channel is explicitly in _SYNTHETIC_SOURCES and hard‑stripped upstream. This guarantee is named directly in the source: “the write boundary itself must refuse LLM-invented (synthetic) rows — never rely solely on the upstream discover strip.”

The key trade‑off is using an LLM brainstorming step at all despite the invariant. The obvious alternative would be to skip synthetic channels entirely and rely purely on real‑data feeds (launchfeed, commoncrawl) for all candidate discovery. However, that would miss many emerging applied‑AI companies that are too new or niche to appear in those feeds. The micro‑vertical discovery instead accepts the risk of LLM‑hallucinated companies but pays the cost of a strict grounding guard—the brainstorm channel is allowed only as a seed‑query path (SEED_SOURCE), not as a direct persist path. This rejects the approach of letting the LLM output flow directly into the scoring/outreach pipeline, avoiding the noise of fabricated leads that would waste downstream effort.

A concrete failure mode is a DeepSeek API timeout or rate limit during _brainstorm_direction. The operator sees no crash—the exception is caught, and the function returns [] for that vertical. The only signal appears in the logs as a warning: "brainstorm failed direction=fintech-pi-demand: RateLimitError('429 too many requests')". The caller in the discovery graph receives zero candidates for that vertical, the vertical’s sub‑niche cursor is still advanced (because plan_targets already ran), and the run continues with the remaining directions. The operator must inspect log aggregations to notice the missing vertical’s candidates.

Cost & performance — the real knobs

The discovery subsystem spends time on concurrent HTTP requests across multiple sources (Common Crawl, LLM brainstorm, launchfeed, D1 database) and money on the LLM call volume and the number of CDX lookups. Below are the real performance knobs extracted from the source constants and overridable parameters.

DEFAULT_CONCURRENCY (overridable via concurrency input, default 6)

  • Bounds — Global parallelism cap for all async HTTP requests made by the shared httpx.AsyncClient.
  • Effect — Raising it increases throughput and reduces wall-clock time when many hosts respond quickly, but also raises instantaneous token/bandwidth consumption. Lowering serializes requests, increasing latency.
  • Risk — Too high: retry storms from rate-limited hosts, memory pressure, or timeout clusters. Too low: stalls on slow sources while other hosts sit idle.

DEFAULT_PER_HOST (overridable via per_host input, default {"commoncrawl": 6, "llm": 4, "d1": 6, "launchfeed": 2})

  • Bounds — Per‑source concurrency ceiling, preventing any single host (e.g., Common Crawl’s CDX API, the LLM endpoint, or D1) from monopolising the global cap.
  • Effect — Tuning per-host values balances load: raising the LLM cap allows more simultaneous brainstorming calls but increases inference cost; raising the D1 cap speeds up database persistence but may hit write limits.
  • Risk — Too high: a throttling host resets requests and triggers retries, wasting time and money. Too low: fast sources become artificially slow, elongating the overall run.

CC_MAX_DOMAINS (overridable via cc_max input, default 40)

  • Bounds — Maximum number of domain lookups per Common Crawl run. Controls the volume of CDX queries and subsequent WARC/page extraction.
  • Effect — Increasing it casts a wider net for real‑data candidates, raising both discovery coverage and the time spent on CDX fetching and fact/contact persistence. Decreasing it reduces latency and cost but may miss relevant companies.
  • Risk — Too high: run‑time blows up or exceeds external API rate limits; cost of WARC storage and D1 writes grows. Too low: insufficient candidates for scoring/outreach.

_LAUNCH_LOOKBACK_DAYS (constant 90)

  • Bounds — The lookback window for surfacing recent product launches from the launchfeed (YC + ProductHunt) channel.
  • Effect — Extending the window yields more candidates but includes older, potentially stale signals, increasing the volume of companies to deduplicate and persist. Shortening it reduces candidate count and speeds up the launchfeed wave.
  • Risk — Too high: outdated leads waste scoring and outreach effort; the signal‑decay boundary (_LAUNCH_SIGNAL_DECAY_DAYS, 180) caps how far decay is applied, so old entries may still be persisted. Too low: miss genuinely new companies that launched just outside the window.

HTTP client timeout (literal 20.0 seconds in httpx.AsyncClient(timeout=20.0))

  • Bounds — Per‑request timeout for every HTTP call made to any source (CDX, LLM API, launchfeed endpoints, etc.).
  • Effect — A shorter timeout forces the client to abandon slow responses early, reducing overall wall‑clock time but increasing the failure rate for hosts with high latency. A longer timeout allows more requests to succeed on slow networks, but can hold up the entire gather wave.
  • Risk — Too low: many legitimate sources become “failures,” requiring retries and wasting the cost of partial responses. Too high: a single unresponsive host can delay the whole run by the full timeout, multiplying the wall‑clock for each wave.
Failure modes — what breaks, what catches it

1. LLM API Call Failure

  • Trigger — Any exception raised by the brainstorm function (network timeouts, missing API key, rate limiting, or the LLM service being unreachable).
  • Guard — The except Exception as exc: # noqa: BLE001 — brainstorm is best-effort clause inside _brainstorm_direction.
  • Posture — fail-soft: the function returns an empty list[].
  • Operator signal — A log warning: "brainstorm failed direction=%s: %r" where %s is mv.vertical and %r is the exception.
  • Recovery — No retry; the caller receives zero candidates from that direction, and the run continues, possibly yielding no companies for that micro-vertical.

2. LLM Response with Invalid Domain

  • Trigger — The brainstorm returns a candidate whose domain field is missing, empty, or lacks a dot (e.g., "example").
  • Guard — The if not dom or "." not in dom: continue line after dom = blocklist.canonicalize_domain(str(c.get("domain") or "")).
  • Posture — fail-soft: the candidate is silently skipped.
  • Operator signal — No log, no metric; the invalid candidate simply disappears from the result.
  • Recovery — None; the remaining valid candidates are returned without any retry or alert.

3. Dedupe D1 Database Error

  • Trigger — A D1Error raised during the d1_all query that checks existing domains in the dedupe function.
  • Guard — The except D1Error as e: return {"_error": f"dedupe: {e}"} handler.
  • Posture — fail-hard: the dedupe function writes an _error key into the state, which causes subsequent graph nodes (e.g., pre_score, persist) to check state.get("_error") and return an empty dict, aborting the run.
  • Operator signal — The run output will contain {"_error": "dedupe: ..."}; no separate log line is shown in the source for this exception.
  • Recovery — No automatic retry; the operator must manually resolve the database issue and re-run the discovery.

4. Dedupe Empty Candidate List

  • Trigger — The candidates list passed to dedupe is empty (e.g., because the previous _brainstorm_direction returned no valid companies).
  • Guard — The early return if not candidates: return {"filtered": [], "skipped_existing": 0}.
  • Posture — fail-soft: the function returns an empty filtered list with zero skipped.
  • Operator signal — No log; the state’s filtered key is an empty list, and the run proceeds with no candidates for scoring or persistence.
  • Recovery — None; the pipeline continues and will produce no inserted IDs for that direction. No alert is generated.
Interview — could you explain it?

Q — How does the system define a single discovery direction, and what signals does it use to recognise relevant companies?
A — Each direction is a MicroVertical frozen dataclass. It carries keyword_signals (e.g. “voice agent” or “phone”) that mark a company as relevant, and negative_signals (e.g. telephony infrastructure terms) that exclude companies like Twilio or RingCentral.
Follow-up — Where does the system store these directions?
A — They are pure config in micro_verticals.py, a dependency-free module consumed by discovery_graph.py.
Weak answer misses — The exact negative_signals field – a shallow answer may mention only inclusion keywords.


Q — How does the brainstorming step avoid re‑suggesting companies already in the system?
A — The _brainstorm_direction function accepts an exclude_domains list that tells the LLM which domains to leave out. This list is built from known companies before the brainstorm call.
Follow-up — What happens if DeepSeek is unavailable during brainstorming?
A — The function catches all exceptions, logs a warning, and returns an empty list—fail‑soft.
Weak answer misses — The try/except with log.warning and return [], not a hard crash.


Q — Why does the system rotate brainstorm framings with a sub‑niche cursor instead of using a static seed query?
A — A single seed query tends to re‑mine the same well‑known names. The plan_targets node reads and advances a per‑direction cursor via _read_and_advance_cursor, selecting a fresh sub_niche from the MicroVertical.sub_niches tuple each tick so DeepSeek explores different pockets of the vertical.
Follow-up — How does the cursor state survive across multiple runs?
A — It uses _ensure_cursor_table to guarantee a persistent D1 table, then reads and increments the cursor atomically.
Weak answer misses — The identifiers _read_and_advance_cursor and _ensure_cursor_table; a naive answer might say “it uses a counter in memory.”


Q — The code distinguishes two brainstorm‑like paths: seed_query and brainstorm. Why is seed_query sanctioned while brainstorm is forbidden?
A — brainstorm is a legacy synthetic source that invents candidate companies; it is in the _SYNTHETIC_SOURCES frozenset and hard‑stripped in discover/persist. The seed_query path goes through expand_seed (LLM facet parse) and then a brainstorm that produces real candidates, making it persist‑eligible.
Follow-up — How does the system enforce that brainstorm cannot be re‑enabled at runtime?
A — The discover function checks active sources against _SYNTHETIC_SOURCES and removes any synthetic channels before waves begin.
Weak answer misses — The exact constants _SYNTHETIC_SOURCES and SEED_SOURCE and the distinction between the two brainstorm nodes.


Q — Each brainstorm candidate includes a reason and concrete evidence. How does the code capture those fields?
A — The _brainstorm_direction function iterates over candidates and maps the LLM‑returned why_in_vertical (or why) into the output dict under why_in_vertical, and evidence under evidence. The domain is canonicalized with blocklist.canonicalize_domain.
Follow-up — What is the “why” fallback?
A — If why_in_vertical is missing, the code uses the why field from the brainstorm response as a fallback (c.get("why_in_vertical") or c.get("why")).
Weak answer misses — The explicit mapping in the for c in cands loop and the fallback logic.

11. Why Fewer Is Better

The whole design rests on one idea. A few real, checkable candidates beat a flood of uncertain ones. Every step is built to fail open, so a single miss never takes down the run. Take the boards. We poll dozens of them, each aimed at a specific applied niche. A board name is marked verified once we confirm it against the live feed. If a name turns out wrong, the client just returns an empty list and the run keeps going. There is no error and no halt, only a graceful skip, so we can try unproven guesses with no risk.

The ignore list is capped at a thousand domains. That cap saves input cost while still excluding the whole known set, so retired domains are not suggested again. When a seed query is missing, the system never brainstorms at all, and it spends nothing on that model call. The trade-off is honest. Failing open means a broken source can go quiet without shouting for help. We accept that, because a steady stream of verifiable companies is worth more than chasing perfection.

Failed requests return partial results with an error field, allowing the pipeline to continue with whatever data was gathered.

python
async def _scrape_board(tgt: dict[str, Any]) -> dict[str, Any]:
    vendor, slug, vertical = tgt["vendor"], tgt["slug"], tgt["vertical"]
    base = {**tgt, "total_jobs": 0, "ai_count": 0, "remote_ai_count": 0,
            "ai_jobs": [], "matched_titles": [], "error": None}
    fetcher = _FETCHERS.get(vendor)
    if fetcher is None:
        base["error"] = f"unknown vendor: {vendor}"
        return base
    async with limiter.slot(vendor):
        try:
            jobs = await fetcher(client, slug)
        except Exception as exc:
            log.warning("scrape failed %s/%s: %r", vendor, slug, exc)
            base["error"] = str(exc)
            return base
    # classify jobs …
    ai_jobs, matched, remote_ai = [], [], 0
    for j in jobs:
        cls = classify_job(j)
        if cls.is_ai_role:
            ai_jobs.append({**j, "_is_remote": cls.is_remote})
            matched.append(_clean(j.get("title")))
            if cls.is_remote:
                remote_ai += 1
    base.update(total_jobs=len(jobs), ai_count=len(ai_jobs),
                remote_ai_count=remote_ai, ai_jobs=ai_jobs, matched_titles=matched)
    return base
ELI5 — the plain-language version

Imagine a chef who knows that tasting just a few key ingredients—a pinch of salt, a drop of broth—tells them everything about the soup’s quality, rather than slurping every single drop in the pot. That’s this subsystem: it finds a small number of real, checkable candidate companies instead of drowning in a flood of guesses, because a few verified leads are far more valuable than a hundred uncertain ones. Every part is built to fail open, so a single missing ingredient never ruins the whole dish.

The chef works from recipe cards called MicroVertical rows, each with a seed_query and keyword_signals that define which companies match. Instead of tasting every possible spice, the _brainstorm_direction function polls a few targeted boards—each aimed at a specific applied niche—using a DeepSeek model. A board name is only marked verified after we confirm it against the live feed. If a name turns out wrong, the client just returns an empty list and the run keeps going: no error, no halt, only a graceful skip. The chef also reorders their recipe cards each week using reorder_sub_niches_by_priority, which looks at past yield from a discovery_yield_history table to push the most productive sub-niches to the front.

The trickiest edge is the cap on the exclusion list: exclude_domains is trimmed to the first 1,000 domains. That bound keeps input tokens cheap and prevents old, already-checked companies from being re-suggested by the model. If a sub-niche’s yield history is missing (say a D1 error), _steering_priority returns an empty dict and the system falls back to pure rotation—never a crash. Without this subsystem, you’d be stuck tasting every drop: a single wrong lead could halt the entire run, and the cost of processing thousands of low‑quality candidates would ruin the budget.

Data flow — one request, in order
  1. plan_targets(state)

    • reads / writes — reads state["directions"]; resolves directions via micro_verticals.resolve_directions; writes state["_direction_sub_niches"] (a dict mapping direction to sub‑niche string).
    • branch — if the resolved sources do not include the ATS path (_ATS_PATH_SOURCES), it short‑circuits and returns {"_direction_sub_niches": {}} (the happy‑path for a seed‑query‑only run).
  2. expand_seed(state)

    • reads / writes — reads state["_error"], state["seed_query"], state["vertical"], state["keywords"]; after LLM extraction it writes state["seed_query"] (or backfills vertical/geography/keywords).
    • branch — if _error already exists, or the active source isn’t SEED_SOURCE, or both vertical and keywords are already present, it returns {} (no‑op). On LLM exception it writes {"_error": "expand_seed: …"} and returns — the run does not abort.
  3. brainstorm (node)

    • reads / writes — reads state["seed_query"], state["vertical"], state["keywords"], state["exclude_domains"], state["_direction_sub_niches"]; writes state["brainstorm"] (list of candidate dicts from _brainstorm_direction).
    • branch — this node gates on the presence of seed_query. It fans out by calling _brainstorm_direction once per vertical direction (for a single seed‑query that is one call). Inside _brainstorm_direction, any exception causes the function to return [] (fail‑soft); the node then merges the empty list without blocking.
  4. _brainstorm_direction(mv, exclude_domains, sub_niche)

    • reads / writes — reads mv.vertical, mv.keyword_signals, mv.seed_query, the exclude_domains parameter, the sub_niche parameter; writes nothing to state directly (returns a list of candidate dicts).
    • branch — if the brainstorm() call raises any exception, it returns [] (the happy path continues with whatever candidates were already found). Also filters out domains that fail blocklist.canonicalize_domain or lack a dot.
  5. dedupe (node – named in graph comment but no source code shown)

    • reads / writes — presumably reads state["brainstorm"] and removes duplicates; writes updated candidate list back to state.
    • branch — none; the step runs unconditionally if candidates exist.
  6. pre_score (node – named in graph comment)

    • reads / writes — reads state["brainstorm"] (or the deduped list) and the pre_score_keywords from the relevant MicroVertical; writes state["scored"] with scores (capped at 1.0 per the MicroVertical doc).
    • branch — if no candidates remain, returns early with empty scored list.
  7. persist_seed_candidates (node that calls _persist_seed_candidates(state))

    • reads / writes — reads state["scored"] (list of scored candidates); also reads state["vertical"], state["geography"], and _profile(state).tags for the row; writes to the companies table (inserts/updates via ON CONFLICT DO NOTHING).
    • branch — if scored is empty, returns ([], [], 0) immediately (no work). Inside _persist_seed_candidates, blocklisted domains are dropped before the INSERT.
  8. _persist_seed_candidates(state) (internal function)

    • reads / writes — reads state["scored"], state["vertical"], state["geography"], profile_tags; writes inserted_ids, existing_ids, blocked_skipped counts.
    • branch — if a D1Error is raised, the caller catches it and returns {"_error": str(e), "agent_timings": …}. The run does not halt; the error is recorded in the state.
  9. (Terminal) Return of final state

    • The graph returns a dict that includes any _error key (if one was set in earlier steps) plus agent_timings and the discovery results.
    • branch — if any node set _error, subsequent nodes may still run (they check state.get("_error") only in expand_seed for short‑circuit, but other nodes do not stop). The run always completes, carrying the error forward.

Control loops and fan‑out:

  • The brainstorm node loops over each resolved direction (one in a typical seed‑query) and calls _brainstorm_direction per direction.
  • _persist_seed_candidates iterates over scored candidates to drop blocklisted domains and insert them one by one (or batch).
  • The entire pipeline is designed to fail open: every exception in expand_seed, _brainstorm_direction, and the persist step produces a partial result or an _error marker rather than aborting the run.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The discovery system’s ordered mechanism begins with plan_targets, which resolves directions from a requested list and, for each micro-vertical, reads and advances the sub-niche cursor via _read_and_advance_cursor. This cursor rotation (V04) ensures each tick of the discovery loop explores a fresh sub-niche from the sub_niches tuple of the MicroVertical definition, preventing the LLM from re-mining the same well-known names. Next, the _brainstorm_direction function is invoked per direction, calling the DeepSeek model with the seed_query (set to "micro-vertical:{mv.vertical}"), the direction’s keyword_signals, and the currently selected sub_niche as framing. This step is explicitly best-effort: if the LLM is unavailable (no API key, network error), the function catches the exception, logs a warning like "brainstorm failed direction=legal-pi-demand: ...", and returns an empty list — the run does not stop. Finally, in the persist phase, _persist_seed_candidates is guarded by a grounding guard: the write boundary itself must refuse LLM-invented (synthetic) rows; brainstorm is the only synthetic channel, while launchfeed and commoncrawl are real-data sources that pass through untouched. If any brainstorm candidate is synthetic, it is stripped before persistence.

The invariant that the design preserves is precisely this grounding guard — no LLM-invented (synthetic) company rows ever reach the companies table that feeds scoring and outreach decisions. The system explicitly states: “never rely solely on the upstream discover strip”. Only candidates from real-data sources (launchfeed, commoncrawl) pass through freely; brainstorm candidates must be verifiable (the guard ensures they are not persisted without real-world evidence). This guarantees that every company in the database corresponds to a known domain or canonical record, preventing phantom leads from polluting downstream decisions.

The key trade-off is precision over recall: the system aggressively filters noise using keyword_signals and negative_signals per micro-vertical, and prefers a few real, verifiable companies over a long, uncertain list. The obvious alternative it rejects is a bulk-acceptance pipeline that ingests every LLM candidate regardless of verifiability. That alternative would risk high false-positive rates (hallucinated companies), which wastes outreach effort, damages sender reputation, and confuses score-based rankings. The cost avoided is the operational overhead of cleaning hallucinated rows from the database and the loss of trust in the “why” and “confidence” fields that the system records from brainstorm — instead, the grounding guard ensures only high-confidence, domain-verified candidates advance.

A concrete failure mode is a DeepSeek API timeout during _brainstorm_direction. The function catches the exception (Exception as exc), logs a warning (log.warning("brainstorm failed direction=legal-pi-demand: ...")), and returns an empty list. The operator would see that log line in real-time (or in aggregated error dashboards) for the specific vertical, and would notice that the brainstorm candidate count for that direction is zero in the discovery run’s output, while other directions complete normally. The pipeline continues to persist seed candidates from launchfeed and commoncrawl, and no error propagates to the caller — the run finishes with partial results and an informative log signal, but never a hard failure.

Cost & performance — the real knobs

The discovery subsystem spends time primarily on network I/O (Common Crawl CDX queries, launch-feed API calls, and LLM brainstorm requests) and on database writes to D1. Money flows to LLM tokens (DeepSeek calls in the seed-query path) and to D1 write operations. The system follows a fewer is better philosophy: it prefers a small set of real, verifiable companies over a long synthetic list, and each step fails open (a failed request returns a partial result, a lookup error lets all candidates pass). Tight keyword signals and explicit exclusion lists filter noise early. Performance and cost are directly controlled by the following real knobs from the source:

  • concurrency

    • Knob — input parameter concurrency; default DEFAULT_CONCURRENCY = 6
    • Bounds — maximum number of simultaneous HTTP or LLM requests (global cap)
    • Effect — raising it reduces wall-clock time (parallelism) but increases burst rate on APIs and D1, raising peak cost and risk of rate-limit hits. Lowering it spreads load and reduces peak spend but increases latency.
    • Risk — too high may trigger throttling from Common Crawl, LLM providers, or launch-feed APIs; too low starves throughput and makes runs unacceptably slow.
  • per_host

    • Knob — input parameter per_host; default DEFAULT_PER_HOST = {"commoncrawl":6, "llm":4, "d1":6, "launchfeed":2}
    • Bounds — per‑source caps on concurrent requests (< global concurrency). Each key is an independent throttle (e.g., "llm":4 limits DeepSeek calls).
    • Effect — dropping a per‑host cap (e.g., "llm":1) cuts LLM token cost and avoids rate‑limit errors, but serialises that channel, slowing the run. Raising it accelerates that source but risks cost spikes and host‑side backoff.
    • Risk — too high on a single host can cause that host to fail (rate limit or timeout) while other hosts remain idle; too low can make that channel a bottleneck.
  • cc_max

    • Knob — input parameter cc_max; default CC_MAX_DOMAINS = 40
    • Bounds — maximum number of Common‑Crawl CDX lookups performed per run, capping the volume of crawl data fetched.
    • Effect — increasing it pulls more candidate domains from Common Crawl, raising data quantity and time spent on CDX queries and subsequent D1 writes; lowering it restricts crawl scope, saving time and D1 write cost.
    • Risk — too high may overwhelm the D1 write budget or timeout the run; too low may miss valuable real signals, reducing lead quality.
  • _LAUNCH_LOOKBACK_DAYS

    • Knob — constant _LAUNCH_LOOKBACK_DAYS = 90 (days)
    • Bounds — how far back the launch-feed channel searches for recently launched companies.
    • Effect — a larger window retrieves more candidates, increasing API calls, I/O time, and downstream D1 writes; a smaller window reduces data volume and cost but may exclude viable leads.
    • Risk — too large may pull stale or irrelevant launches (signal decay), wasting D1 writes; too small may miss important new entrants.
  • _LAUNCH_SIGNAL_DECAY_DAYS

    • Knob — constant _LAUNCH_SIGNAL_DECAY_DAYS = 180 (days)
    • Bounds — determines how long a product_launch signal is considered relevant before it decays.
    • Effect — lengthening it retains older signals, increasing the number of candidates that survive scoring and persist, thus raising D1 cost and processing time. Shortening it prunes old signals, saving cost but possibly discarding valid leads.
    • Risk — too long may keep low‑quality or outdated candidates, inflating the database; too short may drop valuable leads before they convert.

These knobs together let operators trade latency, throughput, and dollar cost directly by adjusting parallelism, request volume, and data scope.

Failure modes — what breaks, what catches it

D1Error During Seed Candidate Persistence

  • Trigger — A database write failure (e.g., D1 constraint violation, timeout, or connection drop) occurs inside _persist_seed_candidates, which is called at the start of the persist flow.
  • Guard — The outermost try / except D1Error as e block in the persist graph’s Failures section catches it and immediately returns {"_error": str(e), "agent_timings": {"persist": ...}}.
  • Posturefail-soft. The entire persist step is aborted, but the run does not halt; the caller sees the _error key and can continue with other work (e.g., logging the failure and moving to the next tick).
  • Operator signal — The returned state dict contains an _error field with the exact D1Error message (e.g., "UNIQUE constraint failed: companies.domain"). No new inserted_ids are returned; the run proceeds with empty seed results.
  • Recovery — No automatic retry is implemented. The operator must inspect the error message, fix the underlying D1 issue (schema, capacity, or data conflict), and re-run the discovery cycle. The agent_timings provide a timestamp for the failed attempt.

D1Error During Dedupe

  • Trigger — A D1 query failure (e.g., row lock, transient network error) when executing "SELECT DISTINCT canonical_domain FROM companies WHERE canonical_domain IN ..." inside dedupe.
  • Guard — The try / except D1Error as e block in dedupe catches it and returns {"_error": f"dedupe: {e}"} (note the explicit "dedupe:" prefix in the error message).
  • Posturefail-soft. The function returns an error marker, and the calling discover or persist logic will see the _error key in the state and skip deduplication for this batch. Candidates are neither filtered nor passed through—the call effectively yields no filtered list.
  • Operator signal — The state dict from dedupe contains "_error": "dedupe: D1Error(...)". The upstream code may log this; the run continues with an empty filtered list, so no new companies are inserted for that tick.
  • Recovery — No retry. The operator must resolve the D1 read failure (e.g., database overload, connection pool exhaustion) and re-run discovery. The agent_timings field is still populated with the failed attempt’s duration.

Micro-Vertical KeyError in Pre-Score

  • Trigger — The vertical tag from the state is missing or not found in the micro_verticals dict (e.g., an unrecognized vertical string like "flying_cars").
  • Guard — The try / except KeyError block around micro_verticals.get(vertical_tag) in pre_score catches the lookup failure and sets mv = None.
  • Posturefail-soft. The scoring function falls back to the next precedence level: first per-sub_niche weights (if present), then IndustryProfile.score_tiers defaults. No error is raised, and scoring continues with a degraded weight set.
  • Operator signal — No explicit error is surfaced; the operator would only observe slightly lower score confidence if the missing vertical prevented tailored weighting. The mv variable being None is silent—no log line for the fallback is shown in the source.
  • Recovery — Automatic fallback to the default tier-based scoring. No manual step is required unless the operator wants to add the missing vertical to micro_verticals.

LLM Brainstorm Channel Attempting Persistence

  • Trigger — A misconfiguration or legacy route tries to persist companies from the brainstorm channel (the synthetic LLM-invented channel). This is explicitly forbidden by design.
  • Guard — The _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) constant and the hard-strip logic in discover and persist refuse any candidate that originates from "brainstorm". The code comments confirm it “cannot be re-enabled per-run” and is “hard-stripped.”
  • Posturefail-closed. No write is allowed for synthetic brainstorm data, even if the input state includes such candidates. The persist functions will skip or reject them entirely.
  • Operator signal — The candidate list from brainstorm is silently removed before any database operation; the inserted count will not include any brainstorm-sourced companies. There is no explicit error log shown in the source for this strip—the operator must inspect the state’s source tags to verify.
  • Recovery — No recovery needed; this is a safety guard that prevents data pollution. If the operator wants brainstorm domains, they must use the sanctioned seed_query channel instead (which uses LLM but with real seed inputs and still undergoes deduplication and blocklist checks).

Host-Limiter Saturation / Starvation (Concurrency Deadlock)

  • Trigger — All slots in the HostLimiter are occupied by slow requests (e.g., a long-running Common Crawl CDX query holds all 6 CC slots for minutes). The global cap DEFAULT_CONCURRENCY = 6 and per-host caps (e.g., "commoncrawl": 6) together mean the fan-out can fully block if one host’s tasks never complete.
  • Guard — No explicit exception handler for limiter starvation is shown in the source. The limiter.slot() call is a concurrency throttle that blocks the coroutine until a slot is available; it does not raise on timeout. The system relies on the per-host caps (DEFAULT_PER_HOST) to prevent one host from hogging all slots, but there is no timeout or circuit-breaker for a single slow host.
  • Posturefail-soft by design (the run does not crash), but effectively fail-hard in practice if the starved host never completes, because the asyncio.gather per wave will wait indefinitely. The entire wave halts until slots free up.
  • Operator signal — The run hangs on that wave; no new logs or metrics are emitted. The agent timings will show a large elapsed time for that wave. No error is thrown.
  • Recovery — No automatic retry or timeout. The operator must manually cancel the run, increase global concurrency limits, reduce per-host caps, or add timeouts to the host fetchers (not shown in the source). The concurrency and per_host inputs can be overridden per run to mitigate future occurrences.
Interview — could you explain it?

Q — The system claims it prefers a few real, verifiable leads over a long list of uncertain ones. What concrete mechanism prevents LLM-invented synthetic companies from ever reaching the scoring or outreach pipeline?
A — The grounding guard at the persist boundary explicitly refuses LLM-invented synthetic rows: the comment states that brainstorm is the only synthetic channel and is hard-stripped so it cannot be re-enabled per-run. Real sources (launchfeed, commoncrawl) pass through untouched, and the default source set is DEFAULT_SOURCES = ("launchfeed",). This is documented in the discovery_graph.py file around the DEFAULT_SOURCES and the grounding guard comment.
Follow-up — What exactly is the difference between the brainstorm in _SYNTHETIC_SOURCES and the seed_query channel, which also uses an LLM?
Follow-up answerseed_query is explicitly labeled as “NOT synthetic” and is the sanctioned, persist-eligible company‑discovery path, distinct from the legacy micro‑vertical brainstorm channel.
Weak answer misses — The detail that seed_query is exempt from the synthetic ban, and that _SYNTHETIC_SOURCES = frozenset({"brainstorm"}) is a hard‑coded constant that cannot be overridden per run.


Q — The system uses a “fail‑open” philosophy: if something breaks, the run continues with what it already has. Show me one specific function that implements fail‑open and explain why it returns an empty result instead of raising an exception.
A — The _brainstorm_direction function wraps its LLM call in a blanket except and returns an empty list ([]) on any exception, logging a warning. Its docstring says: “Fail‑soft: returns [] if the LLM is unavailable …” This ensures one direction’s brainstorm failure does not halt the whole discovery run.
Follow-up — Is this the only fail‑open point in the graph? What about a failure in expand_seed?
Follow-up answerexpand_seed sets {"_error": str(e)} and returns early, but the graph continues processing other nodes because the state is not fatally aborted.
Weak answer misses — The explicit except Exception as exc: # noqa: BLE001 — brainstorm is best-effort pattern and the fact that the function returns [] instead of None, which keeps downstream consumers safe.


Q — The system rotates sub_niche for each direction every tick instead of using a fixed framing. Why was this design chosen over simply repeating the same seed query?
A — The plan_targets node reads and advances a per‑vertical cursor, picking the next sub‑niche from MicroVertical.sub_niches. The comment explains this “rotates the brainstorm / launch‑feed framing per tick so DeepSeek explores a fresh applied pocket each time instead of re‑mining the same well‑known names.”
Follow-up — What happens if a vertical has no sub‑niches defined?
Follow-up answer — The cursor logic is skipped (if not niches: continue), and the brainstorm uses only the seed_query/keyword_signals framing.
Weak answer misses — The cursor is stored and advanced by _read_and_advance_cursor, not a random choice, and the mechanism is gated on state.get("sub_niche") so callers can pin a specific niche.


Q — The system filters noise using keyword signals, but how does it handle the risk of false positives—companies that match keyword signals but are actually irrelevant?
A — The MicroVertical dataclass includes both keyword_signals (to include) and negative_signals (to exclude). The pre_score function uses weighted keyword matches, and the negative_signals tuple explicitly removes companies that match disallowed terms. The comment in micro_verticals.py says negative signals “excludes a company from this micro‑vertical”.
Follow-up — What about companies that match no keywords at all? Do they still survive to scoring?
Follow-up answer — The discover node strips domains that fail to match at least one keyword_signals from a vertical; unmatched companies are dropped before scoring.
Weak answer misses — The existence of negative_signals as a separate field, and that the system uses a combined match‑and‑reject approach, not just a positive keyword threshold.

Glossary — the domain terms, grounded in the code

15terms, each defined from this subsystem’s real source.

HostLimiter

HostLimiter is a concurrency gating class that uses a global asyncio.Semaphore plus per-host semaphores, and its async with slot(host) acquires the global cap and optionally the host's cap to prevent a single slow or throttling host from monopolizing the global budget or starving other hosts, fitting in the discovery and persist flows to control fan-out across multiple sources.

Memory hook HostLimiter gives each host its own turnstile so one slowpoke can’t hog the whole queue.

From discovery_graph.py

expand_seed

expand_seed is a graph node that asynchronously extracts B2B lead-gen facets (vertical, geography, size_band, keywords) from a seed query via a DeepSeek LLM, and only executes when the seed-query source is active, serving as the first step after START in the seed-query path before plan_targets.

Memory hook expand_seed: a seed query sprouts into B2B facets via DeepSeek LLM, only on seed-query path.

From discovery_graph.py

brainstorm

brainstorm is the only synthetic channel in this subsystem, where an LLM (DeepSeek) generates candidate companies for a micro-vertical direction; its output is stored in the `brainstorm` state key but is hard-stripped at the persist write boundary so that only real-data sources (launchfeed, commoncrawl) are used for scoring and outreach decisions.

Memory hook Brainstorm is the only synthetic channel; its LLM output is stripped at persist for real scoring.

From discovery_graph.py

dedupe

In this subsystem, dedupe is a function that filters a list of candidate domains by querying the companies table for existing canonical_domain values via a SELECT DISTINCT IN clause, returning only the subset of candidates whose domain is not already stored and providing a count of skipped_existing duplicates.

Memory hook Dedupe queries existing company domains to filter out duplicates, counting those skipped.

From discovery_graph.py

pre_score

In this subsystem, `pre_score` is a numeric value computed for each candidate by summing calibrated weights from either sub‑niche or vertical `score_weights` for every keyword found in the candidate’s `why_in_vertical` or `why` text, then capping the sum at 1.0; candidates with a `pre_score` below 0.2 are filtered out, and the remaining scored candidates are sorted descending by this value before being persisted as seed candidates.

Memory hook pre_score weighs keywords in a candidate's 'why' text, caps at 1.0, filters below 0.2, then sorts descending.

From micro_verticals.py

_resolve_sources

_resolve_sources is a function that takes a DiscoveryState and returns a set of strings representing the effective discovery channels for a run, following a defined precedence: it uses explicit sources if provided, otherwise a seed_query with no directions defaults to the seed-query channel, otherwise DEFAULT_SOURCES, and it always strips the synthetic brainstorm channel; it serves as the single source of truth shared by plan_targets, discover, _emit_channel_spans, and finalize to prevent drift.

Memory hook _resolve_sources picks real discovery channels and strips synthetic brainstorm, preventing drift across nodes.

From discovery_graph.py

_SEED_SOURCE

_SEED_SOURCE is the identifier string (likely "seed_query") that `_resolve_sources` returns when the seed-query discovery channel is active, and it is checked inside `expand_seed` to skip facet extraction for multi-source runs that do not use that path.

Memory hook _SEED_SOURCE is the gatekeeper string that lets seed-query into expand_seed and locks out multi-source runs.

From discovery_graph.py

_SYNTHETIC_SOURCES

_SYNTHETIC_SOURCES is a frozenset containing the string "brainstorm" that marks LLM-invented (synthetic) discovery channels, and during the discover function any source passed in the state that matches this set triggers a warning and is dropped so that only real data channels are used for decisions.

Memory hook _SYNTHETIC_SOURCES is the filter that bans "brainstorm" because it's made-up, not real data.

From discovery_graph.py

_ATS_PATH_SOURCES

_ATS_PATH_SOURCES is a set of source identifiers used in the plan_targets node to check whether the current run involves ATS sources; when the intersection of resolved sources with this set is empty, the function short-circuits and returns an empty dictionary, skipping the rotation cursor logic for seed-query-only runs.

Memory hook _ATS_PATH_SOURCES is the bouncer for ATS sources; if none are present, plan_targets takes the early exit.

From discovery_graph.py

DEFAULT_CONCURRENCY

DEFAULT_CONCURRENCY is the default global concurrency cap (set to 6) used by HostLimiter to gate the total number of concurrent async operations, overridable per run via the ``concurrency`` input.

Memory hook DEFAULT_CONCURRENCY is the default six-lane highway for async ops, overridable per run.

From discovery_graph.py

DEFAULT_PER_HOST

DEFAULT_PER_HOST is a dictionary that sets default per-host concurrency caps (e.g., "commoncrawl": 6) for limiting parallel requests, and is used by HostLimiter to create per-host semaphores, with the option to override per-run via the `per_host` input.

Memory hook DEFAULT_PER_HOST is the traffic cop for each host, giving commoncrawl 6 lanes and llm 4 lanes.

From discovery_graph.py

DEFAULT_SOURCES

DEFAULT_SOURCES is a tuple containing the string "launchfeed" that serves as the default set of discovery channels for directions-driven or CLI runs, selected only from real signals and excluding the synthetic brainstorm channel.

Memory hook DEFAULT_SOURCES launches you into real feed channels only, skipping brainstorm's synthetic bait.

From discovery_graph.py

_fetch_yc_launches

_fetch_yc_launches is an asynchronous function that fetches recent Y Combinator launch stories from the HN Algolia public search API, using a slug for vertical keyword-match tagging, and returns a list of company-candidate dicts with launch_source and launch_date provenance fields, with no LLM involvement and fail-soft error handling.

Memory hook _fetch_yc_launches retrieves YC launch stories from HN Algolia like a retriever dog, using slug keywords for vertical tagging.

From discovery_graph.py

_fetch_ph_launches

_fetch_ph_launches is an async function that retrieves the ProductHunt RSS feed, parses items to extract company candidates for a given vertical slug using deterministic keyword matching, and returns dicts with launch_source "producthunt" and launch_date provenance, serving as one of the vendor-specific fetchers in the launch-feed subsystem.

Memory hook Like a hound, _fetch_ph_launches hunts ProductHunt's RSS feed for keyword-matched launches.

From discovery_graph.py

_LAUNCH_LOOKBACK_DAYS

_LAUNCH_LOOKBACK_DAYS is a constant set to 90 days that defines the lookback window for filtering recent launch-feed entries from both Y Combinator and Product Hunt, used to compute a cutoff timestamp and exclude older items.

Memory hook _LAUNCH_LOOKBACK_DAYS is the 90-day rearview mirror that discards launches older than the cutoff timestamp.

From discovery_graph.py