Outreach Engine — Guide

📄 12 chapters · read at your own pace

01. Why Outreach Is A Graph

Cold outreach is not a single bulk send. It is a directed graph of small, gated steps. Each step owns exactly one concern: who to contact, whether we may, what to say, and when to follow up. The most important rule is that the graph drafts copy but never sends. Sending is a separate decision that the caller owns every time.

The same drafting engine is reused in three distinct ways. First, an autonomous pipeline sends without a human in the loop. Second, a campaign pauses for approval by a human. Third, a preview that shows only a draft and nothing else. Because the graph never assumes how it is invoked, the caller decides approval and sending.

The system keeps a registry that lists each graph by a short name. It pairs the name with the module that builds it. The outreach graph, the compose graph, the reply graph, and the durable campaign engine are all separate entries. The outreach graph returns a subject line, a text-only body, and an HTML body. It also returns bookkeeping. That bookkeeping is a skip reason, an engagement signal, and the time of the next touch. The graph never calls send.

Now consider the three-way trade-off. A single big send function has the fewest moving parts. But it has one failure domain and no place to insert a human or a safety gate. Separate microservices per flow give clean isolation. They make you pay a platform tax — deployment, tracing, and state plumbing — once per flow. A shared graph runtime fronted by a registry gives gated, traceable steps with additive growth. The cost is one routing layer and the discipline to keep the registry simple.

Failure modes matter. One is a missing contact row in the database. That leaves the personalization with nothing to stand on. Another is a stale engagement signal. An open recorded late will bias the next-touch gap the wrong way. The suppression gate fails closed: if the check cannot complete, the contact is treated as suppressed. This avoids risking a wrong send.

The design rationale is a deliberate choice. The team rejected a single monolithic function because it offered no seam for a human gate. They also rejected per-flow microservices because the platform tax multiplied with each new flow. The shared graph runtime with a registry was chosen for its additive growth and traceability.

End with a transferable rule. Use this shape when safety, grounding, and observability each need their own testable seam, and when the same copy engine must be reused under different approval policies. Do not use it when the number of verticals overwhelms static configuration or when you cannot measure the faithfulness judge’s accuracy.

<!-- mem:begin -->

Generate it: The most important rule is that the graph drafts copy but never _____. (cue: never _____; answer: sends)

Generate it: Because the graph never assumes how it is invoked, the ______ decides approval and sending. (cue: the ______; answer: caller)

Ask yourself: Why does the graph draft copy but never send it itself?

Answer: Sending is a separate decision the caller owns every time, so the same engine can serve an autonomous pipeline, a human-approved campaign, or a draft-only preview without assuming how it was invoked.

Recall check (try before reading the answer):

  1. What three distinct ways is the one drafting engine reused? Answer: An autonomous pipeline (no human), a campaign that pauses for human approval, and a preview that shows only a draft.

  2. Besides subject and bodies, what bookkeeping does the outreach graph return? Answer: A skip reason, an engagement signal, and the time of the next touch.

  3. Why was the per-flow microservices option rejected? Answer: It makes you pay a platform tax — deployment, tracing, and state plumbing — once per flow.

<!-- mem:end -->

The outreach graph is a directed graph of small, gated steps that drafts copy but never sends, with the caller owning the send decision.

python
"""Email outreach graph.

Flow:
    lookup_contact
      → suppression_gate       (E22: central do-not-contact suppression list check — fail-closed)
      → check_stop_conditions  (skip if recipient already replied / bounced / unsubscribed)
      → decide_cadence         (V81: engagement-aware next-touch scheduling)
      → select_template        (free-form: always returns no template)
      → select_sequence        (V84: emit structured {sequence_id, touches} plan for the vertical)
      → extract_hook
      → draft_step             (V38: per-vertical multi-step copy; falls back to draft node for
                                step=0 when no vertical is set — backward-compatible)
      → draft                  (free-form cold email referencing the hook — step=0 fallback)
      → format_html

Produces {subject, text, html, contact_id, skip_reason}. When ``skip_reason``
is set the graph short-circuits before any LLM/IO work and returns an empty
draft for the resolver layer to handle.
"""

from langgraph.graph import END, START, StateGraph


log = logging.getLogger(__name__)
ELI5 — the plain-language version

Think of this outreach system like a restaurant kitchen where each station has one job—someone preps vegetables, another grills, another plates—but no station is allowed to serve the food. Serving is the waiter’s job. That’s the core idea: cold outreach is not one big “send” button; it’s a directed graph of small, gated steps, each owning exactly one concern—who to contact, whether we may, what to say, and when to follow up. The most important rule is that the graph drafts copy but never sends. Sending is a separate decision the caller owns every time.

Concretely, the same drafting engine is reused in three distinct ways. An autonomous pipeline sends without a human in the loop—like a self-serve salad bar. A human-approved campaign pauses for a person to check each draft before it goes out—like a tasting menu. And a one-shot preview just shows a draft without sending anything—like looking at a recipe. Because the graph never assumes how it’s invoked, it stays safe and flexible.

Without this separation, a single big send function would have one failure domain: a bug or a misstep could blast unsolicited emails, burn reputation, or skip compliance checks. There’d be nowhere to insert a human gate or a safety check. A beginner would feel that chaos—a missed bounce, an accidental repeat send, or a fabricated claim that erodes trust. The graph keeps each risk isolated and inspectable.

Data flow — one request, in order
  1. registry lookup — Resolves the graph identity by its short name from the registry, pairing it with the module that builds the outreach graph.
    reads registry record for outreach_graph; writes graph builder instance.
    branch: No early return; happy path returns the graph builder.

  2. look up the contact — Reads the contact from the database once and loads their role, seniority, department, and profile into the working state.
    reads contact database row; writes role, seniority, department, profile into working state.
    branch: Missing contact row leaves personalization with nothing to stand on (failure); happy path proceeds with snapshot.

  3. suppression gate — Checks a central do-not-contact list using a one-way fingerprint of the email address plus the domain; fails closed if the check cannot be completed.
    reads fingerprint (email+domain hash) from contact; writes audit record of the decision.
    branch: Contact on list → end run with skip reason (early return); not suppressed → continue.

  4. stop conditions — Examines the contact’s current thread state and ends the run with a machine‑readable reason if any stop condition holds (replied, bounced, unsubscribed, unverified).
    reads thread_state from contact; writes reason (one of “replied”, “bounced”, “unsubscribed”, “unverified”).
    branch: Any condition true → end run with reason; none → continue to next step.

  5. plan the sequence — Looks up the sequence definition from the vertical‑level VERTICAL_SEQUENCE_DEFS map, with sub‑niche‑level overrides if one exists.
    reads vertical and sub_niche from contact snapshot; writes sequence_def (touch_angles, steps, cadence_days, fallback_step).
    branch: Missing sub_niche or no match → fall back to vertical‑level definition; happy path picks sub‑niche variant if present.

  6. extract the hook — Reads the supplied post text (a recent public post or job description) and picks exactly one concrete hook to ground the opener.
    reads post_text from request; writes hook (single grounded fact).
    branch: Empty post_text → failure mode (opener has nothing real); non‑empty → happy path.

  7. drafting step — Looks up the directive for the current step index from the sequence definition and writes copy that fits that step’s role (opener, value, or soft close).
    reads step_index, steps directives from sequence_def, optional opportunity link; writes draft (body text).
    branch: Step index past end of sequence → uses fallback_step (generic drafting); within range → per‑step directive used.

  8. faithfulness gate — Uses a judge model to audit the draft against the assembled evidence (the hook and contact profile), removing any sentence whose claim is not supported.
    reads draft, evidence (hook + contact snapshot); writes filtered_draft, score (0–1).
    branch: Over‑aggressive judge may strip a true but tersely worded claim; empty evidence set → gate has nothing to compare.

  9. return_output — Compiles the final result: subject line, plain‑text body, HTML body, skip reason (if any), engagement signal, and next touch time.
    reads filtered_draft, skip_reason, thread_state; writes subject, plain_body, html_body, engagement_signal, next_touch_time.
    branch: No early return; always produces the output struct. The caller owns the send decision.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem is a directed graph of small, gated steps, each owning exactly one concern. The ordered mechanism begins with the lookup step, which reads the contact from the database once and loads role, seniority, department, and profile into working state. Next, the suppression gate runs early: it checks a central do-not-contact list keyed on a one‑way fingerprint of the email address plus domain, and fails closed—any incomplete check treats the contact as suppressed. The stop conditions step then examines the contact’s current thread state, ending the run with a distinct machine‑readable reason if the contact has already replied, bounced, unsubscribed, or has an unverified address. Only after these guards pass does the vertical sequence selector perform a deterministic lookup for the contact’s vertical (and an optional narrower niche), returning a structured three‑touch plan. The hook step then reads the supplied post text and extracts exactly one concrete grounded fact. The drafting step uses a directive lookup per step in the sequence, with a generic fallback, and writes the body. Finally, the faithfulness gate uses a judge model to audit every personalized sentence against the assembled evidence, removing any unsupported claim. The entire graph returns a subject line, plain‑text body, HTML body, and bookkeeping (skip reason, engagement signal, next‑touch time)—but never calls send. That sending decision is always owned by the caller.

The invariant the design preserves is stated explicitly: “the graph drafts copy but never sends.” This rule is the single most important structural guarantee. It means the graph is stateless with respect to transmission and can be invoked identically by three distinct flows: an autonomous pipeline that sends without a human, a human‑approved campaign that pauses for sign‑off, and a one‑shot preview that shows only a draft. Because the graph never assumes how it is invoked, the caller decides approval and sending every time, and no accidental send can escape from the drafting steps. The design also ensures that all personalized claims are grounded in evidence, enforced by the faithfulness gate that produces a score between zero and one, posted as feedback for observability.

The key trade‑off behind this shape rejects two obvious alternatives. A single big send function has the fewest moving parts but creates “one failure domain and nowhere to insert a human or a safety gate”—a monolithic routine cannot pause for approval or run a suppression check without coupling it into the same code path. Separate microservices per flow give clean isolation but “make you pay the platform tax — deployment, tracing, state plumbing — once per flow.” The chosen design uses a shared graph runtime fronted by a registry, yielding “gated, traceable steps with additive growth, at the cost of one routing layer and the discipline to keep the registry simple.” This cost is accepted because it prevents the monolithic send’s inflexibility and avoids the per‑service overhead of independent microservices.

A concrete failure mode in this subsystem is “a step index past the end of the sequence.” This occurs when the drafting step looks up a directive at an index that does not exist in the sequence definitions—for example, after a sequence selector returned a plan with three touches but the engine tries to compose a nonexistent fourth touch. The signal an operator would actually see is a skip reason logged against that contact in the outreach graph’s bookkeeping fields, specifically the skip reason that short‑circuits the contact’s processing. The trace log would show that the graph stopped early for that thread, with no draft returned, and the counter “email.compose.vertical_hook_rate” would not increment because no hook was ever extracted.

Cost & performance — the real knobs

cadence_days default

  • Knobcadence_days: [0, 4, 7] in VERTICAL_SEQUENCE_DEFS
  • Bounds — Controls the minimum days between successive touches in a sequence.
  • Effect — Larger values stretch the campaign timeline, increasing latency before each follow‑up; smaller values compress the schedule, raising request throughput and the potential for faster iteration.
  • Risk — Too short risks appearing aggressive or violating sender‑reputation limits; too long lets leads go cold or the campaign stall.

fallback_step

  • Knobfallback_step: 2 in VERTICAL_SEQUENCE_DEFS (integer)
  • Bounds — Defines which step directive is used when the current touch index exceeds the sequence length (e.g., after step 2 of a 3‑step sequence).
  • Effect — A higher fallback gives a more static “last resort” copy; a lower one may reuse an earlier directive. This trades off adaptation (model cost) for predictability (no extra model call to handle overrun).
  • Risk — Mis‑set it and a step‑past‑end produces copy that is either too generic or repeats an earlier angle, confusing the recipient.

Number of touches (sequence length)

  • Knob — Implicit length of the steps list in each vertical sequence definition (default 3)
  • Bounds — Determines how many discrete emails are drafted per campaign, directly driving LLM call count per thread.
  • Effect — More touches increase total drafting cost proportionally and extend the campaign timeline; fewer touches reduce dollar spend and total latency but may convert fewer leads.
  • Risk — Too many touches wastes budget and risks inbox fatigue; too few may not nurture the contact long enough for a reply.

Faithfulness judge model

  • Knob — No env var; the choice of which LLM serves as the judge in the faithfulness gate (described as “a judge that compares each claim to the evidence” — “at the cost of one extra model call”)
  • Bounds — Adds one model inference per drafted email, gating the final output on that judge’s score.
  • Effect — A cheaper/faster judge reduces per‑email dollar cost and latency but may miss unsupported claims; a more expensive/thorough judge raises cost and latency but improves safety.
  • Risk — A too‑strict judge strips true claims (degrades personalization); a too‑lenient judge lets fabricated claims through (erodes trust and compliance).
Failure modes — what breaks, what catches it

Missing Contact Row

  • Trigger — The contact lookup step runs but the contact row is not found in the database, so no recipient_name, recipient_role, or profile attributes are loaded into state.
  • Guard — No explicit guard is shown in the source. The lookup step simply describes reading the contact once; a missing row is identified as a failure mode but no error handler, retry, or fallback is mentioned.
  • Posture — Fail‑soft: the source says the missing row “leaves the personalization with nothing to stand on,” implying the run continues with empty personalization fields, degrading the output.
  • Operator signal — The source does not specify a log line or metric; the operator would observe that personalization fields are empty in the final draft, or that the contact attributes used later are blank.
  • Recovery — No automated recovery is described. The operator must manually verify that the contact exists in the database and, if necessary, re‑run or add the contact before the next attempt.

Empty Post Text

  • Trigger — The hook‑extraction step (hook in the code) receives an empty or whitespace‑only post_text field, so there is no concrete fact to ground the opener.
  • Guard — No guard is shown. The source states: “The failure mode is an empty post text, which leaves the opener with nothing real to stand on.” The code later uses post_raw = (state.get("post_text", "") or "")[:1000] and then post_safe = wrap_untrusted(post_raw) if post_raw.strip() else "", but this only wraps an empty string; it does not stop the run or replace the missing hook.
  • Posture — Fail‑soft: the opener is drafted with no grounded fact, producing a generic or unfounded first sentence.
  • Operator signal — The operator would see that the opener lacks any specific personalization, or that the hook value is "none" (as the code sets hook_safe = "none" when hook_raw.strip() is false).
  • Recovery — The run continues; the only recovery is for the caller to provide a non‑empty post_text on a subsequent attempt. No automatic retry or fallback is implemented.

Step Index Past End of Sequence

  • Trigger — The drafting step receives a sequence_step index that exceeds the length of the sequence plan deterministically returned by the sequence selector (e.g., a three‑step sequence is defined but step index 4 is requested).
  • Guard — No explicit guard is shown. The source mentions the failure mode but does not describe an exception handler or validation that catches an out‑of‑range step.
  • Posture — Likely fail‑hard: the directive lookup get_step_directive(company_vertical, sequence_step, sub_niche) would probably raise an error or return None; the code then falls back to await draft(state), but if the step does not exist in the sequence, the draft may produce irrelevant copy or error out. The source gives no specific behavior.
  • Operator signal — The operator would observe a missing directive or a generic draft where a step‑specific piece was expected. If an exception occurs, an unhandled error trace would appear.
  • Recovery — No automated retry is described. The operator must correct the sequence definition or reset the campaign to a valid step index before re‑running.

Over‑Aggressive Faithfulness Judge

  • Trigger — The faithfulness_gate judge model audits each claim against the evidence and strips any sentence it deems unsupported. A true but tersely worded claim (e.g., “You spoke at X” when the evidence says “Keynote at X”) is incorrectly removed.
  • Guard — The only guard is that the gate produces a score between zero and one and “posts it as feedback,” allowing prompt and model versions to be ranked. No retry or fallback is described for the gate itself; the judge’s decision is final for that run.
  • Posture — Fail‑soft: the draft is edited to remove the false‑positive claim, continuing with a less personalized or less accurate email.
  • Operator signal — The operator sees the gate’s feedback score (e.g., a low faithfulness score) and observes that a claim known to be true was removed from the final draft.
  • Recovery — No automated recovery. The operator must adjust the judge model’s prompt or sensitivity, or manually re‑insert the claim and resend.

Suppression Gate Address Normalization Failure

  • Trigger — The suppression gate keys on a one‑way fingerprint of the email address plus domain. If the address was not normalized (e.g., different casing or sub‑addressing) before fingerprinting, the fingerprint will not match the suppression record, and a suppressed contact is treated as unsuppressed.
  • Guard — No guard is shown. The source explicitly flags this as a failure mode: “The failure mode is an address that was not normalized before fingerprinting, which could let a suppressed contact slip through.” The gate fails closed when the check cannot be completed, but not for a mismatch caused by normalization.
  • Posture — Fail‑soft (dangerous): the contact passes the gate and proceeds to drafting and eventually sending, violating the opt‑out.
  • Operator signal — The operator would notice that a suppressed contact received an email, or an audit record of the suppression gate decision would show a miss (the source says it writes an audit record of the decision).
  • Recovery — No automated recovery. The operator must normalize the address and re‑fingerprint the suppression entry, then manually suppress the contact again.

Thread Left Waiting Forever (Timer Stops)

  • Trigger — The campaign engine’s external timer that drains threads whose wake time has passed stops or fails, leaving a thread in a waiting status indefinitely.
  • Guard — No guard is shown. The source notes: “The failure mode is a thread left waiting forever if the timer stops.” There is no mention of a watchdog, alert, or retry mechanism for the timer itself.
  • Posture — Fail‑soft (silent): the thread remains pending, no further touches are scheduled, and no error is raised because the system simply pauses.
  • Operator signal — The operator would see that the thread has a “waiting” status and a past wake time, with no subsequent send. The source does not specify a specific log line; the signal is the silent absence of progress.
  • Recovery — Manual intervention required: restart the timer service or manually resume the thread from its checkpointed state in the database.
Interview — could you explain it?

Q — "The system defines cold outreach as a directed graph of gated steps. Can you name the specific nodes or functions that enforce the rule 'draft but never send'?"

  • A — The drafting logic lives inside the outreach engine that invokes build_outreach_evidence and the VERTICAL_SEQUENCE_DEFS lookup, but there is no node that calls an SMTP library. The graph produces a pending_draft state, and the campaign engine holds it until an external caller explicitly decides to send. The reply graph also never sends; it only classifies the inbound message and adds a suppression entry for unsubscribe.
  • Follow-up — "Where does the caller actually trigger the send?"
    Answer — The caller owns the send decision; the graph only returns a draft or classification label.
  • Weak answer misses — A shallow answer would omit that the campaign engine pauses for human approval and that the reply graph’s routing is decided in code, not by the model.

Q — "Why build a separate faithfulness_check node instead of trusting the drafting model to stay grounded or using a simple keyword check?"

  • A — A keyword check is deterministic but blind to meaning, and trusting the model is cheapest but can ship a single confident fabrication. The faithfulness_check node uses a judge model to audit each claim against the assembled evidence (faithfulness_evidence block built by build_outreach_evidence) and removes unsupported sentences before finalization. This catches semantic fabrication at the cost of one extra model call. The failure mode is an over-aggressive judge that strips a true but tersely worded claim.
  • Follow-up — "How does the evidence block differ from a compose-style context_summary?"
    Answer — Outreach evidence has no context_summary; it concatenates hook, source post, memory, and contact facts, wrapped in wrap_untrusted with the label EVIDENCE.
  • Weak answer misses — A shallow answer would fail to mention that build_outreach_evidence is called per step and that the judge posts a score between zero and one as feedback for ranking model versions.

Q — "The same drafting engine is reused in three modes: autonomous pipeline, campaign with human approval, and preview-only. Why design it so the graph never knows which mode it’s in?"

  • A — The graph never assumes how it’s called because it only returns a pending draft (or a classification label). This clean separation means the drafting logic, the faithfulness_check node, and the evidence assembly (build_outreach_evidence) are identical in all three uses. The autonomous pipeline sends without a human, the campaign pauses for approval, and the preview shows only the draft; the graph doesn’t need to branch on the mode, keeping the safety rules in one copy.
  • Follow-up — "How does the campaign survive restarts if the graph has no state about the mode?"
    Answer — The campaign engine runs a durable thread per campaign and contact, checkpointed in the database with a stable thread name, so the graph itself is stateless and restarts pick up from the pending draft.
  • Weak answer misses — A shallow answer would overlook the durable thread and database checkpointing mechanism that makes reuse possible without mode awareness.

Q — "Why does the sub-niche sequence lookup use a nested map that falls back to a vertical-level definition, rather than forcing every caller to provide the exact sequence for every sub-niche?"

  • A — The nested map {vertical: {sub_niche: seq_def}} is additive: a missing vertical, a missing sub_niche, or sub_niche None all fall back to the vertical-level VERTICAL_SEQUENCE_DEFS entry. This avoids brittle hard-coding: a new vertical works immediately with the generic sequence, and only the calibrated sub-niches (those with per-sub-niche score weights) get tailored copy. The failure mode is a niche tag that no longer matches any definition after the taxonomy changes.
  • Follow-up — "What happens if the sub_niche tag resolves correctly but the step index is out of range?"
    Answer — The failure mode is a step index past the end of the sequence; the fallback_step (e.g., 2) is used to avoid a crash.
  • Weak answer misses — A shallow answer would omit the exact keys (micro_verticals.py sub_niches tuple) that must match, and the fact that fallback_step exists specifically for that off-by-one failure mode.

02. Looking Up The Contact

The first step in every outreach run reads the contact from the database exactly once. It loads their role, their seniority, their department, and their full profile into a single shared snapshot. Every later step reads from that one copy. No step ever re-queries the database. No two steps disagree about the contact's identity.

Three options exist for how to get contact data into the system. Option one: let the caller pass the attributes in. This makes the caller the source of truth, but the caller can drift from the database over time. Option two: re-read the contact in every step. This means many database reads, and the row can change mid-run, so different steps see different versions of the same person. Option three: one read at the top. This buys one consistent view for the whole run, and it is the design the team chose.

The rationale for this choice is simple. A single read at the top avoids drift between steps. It keeps the database load low. And it guarantees that every downstream step works from the same facts, so personalization stays coherent across the entire email.

One failure mode to watch is a missing contact row. When the lookup returns nothing, personalization has nothing to stand on. The system cannot ground the opener, cannot pick the right angle, and may silently produce a generic message that wastes the touch. The detection signal is a null result from the lookup step. The blast radius is limited to this single run; other contacts and other outreach threads are unaffected.

Another failure mode is a stale seniority value. If the contact was promoted yesterday but the database still shows their old title, the system picks the wrong message angle. A junior sales pitch to a director sounds tone-deaf and erodes trust. The detection signal is an engagement drop in the cohort that received misaligned messages. The blast radius propagates to the contact's lifetime relationship with the sender, but it stays isolated to the individual contact and does not spread across other contacts.

For operational reality, the team measures the lookup latency at the ninety-ninth percentile. A slow database read here delays every downstream step, because the snapshot cannot be built until this first read completes. The deployment shape is a stateless worker. It scales by the number of concurrent outreach runs. Its cold start cost is around five megabytes of module load before the first lookup completes.

An operator debugging a missing contact can inspect the snapshot after the lookup step. If the snapshot is empty, the problem is in the lookup itself, not in any later personalization logic. This narrows the search to the database connection, the contact identifier, or a race in how the graph received its input.

The design considered letting the caller pass attributes in. That option was rejected because the caller runs in a different context and may have a stale version of the contact. The constraint that ruled it out was the need for consistency. Every step must agree on the contact's role and seniority. Only reading from the source of truth gives that guarantee.

Use this one-read approach when your graph has multiple downstream steps that each depend on contact attributes. It fits when consistency between those steps matters more than the cost of one database call. Do not use this approach when your contact data changes frequently during a run and each step genuinely needs the latest value. Avoid it too when the read is so expensive that caching the result elsewhere would be cheaper.

<!-- mem:begin -->

Generate it: It loads role, seniority, department, and profile into a single shared ________ that every later step reads from. (cue: shared ________; answer: snapshot)

Generate it: The constraint that ruled out letting the caller pass attributes in was the need for _____________. (cue: need for _____________; answer: consistency)

Ask yourself: Why read the contact once at the top instead of re-reading it in every step?

Answer: A single read avoids drift between steps and keeps database load low, guaranteeing every downstream step works from the same facts so personalization stays coherent — re-reading lets the row change mid-run, so different steps would see different versions of the same person.

Recall check (try before reading the answer):

  1. What happens to personalization when the contact lookup returns a missing row? Answer: Personalization has nothing to stand on, so the system may silently produce a generic message that wastes the touch.

  2. How does a stale seniority value harm an outreach run? Answer: The system picks the wrong message angle — a junior pitch to a director sounds tone-deaf and erodes trust.

  3. Where does an operator look first to confirm the lookup itself failed? Answer: At the snapshot after the lookup step; if it is empty, the problem is in the lookup, not later personalization logic.

<!-- mem:end -->

The lookup step reads the contact from the database once and stores the snapshot for all downstream steps.

python
async def lookup_contact(state: EmailOutreachState) -> dict:
    email = (state.get("recipient_email") or "").strip().lower()
    if not email:
        return {"contact_id": None}
    try:
        rec = await d1_one(
            """
            SELECT id, position, seniority, department, profile
            FROM contacts
            WHERE lower(email) = ?
            LIMIT 1
            """,
            [email],
        )
        if not rec:
            return {"contact_id": None}
        return {
            "contact_id": int(rec["id"]),
            "_contact_row": rec,
        }
    except Exception:
        return {"contact_id": None}
ELI5 — the plain-language version

Think of it like a photographer taking a single group photo before a long hike. Everyone poses once, and that one picture is passed around later so no one argues about who was there or what they were wearing. That is exactly what this first step does: it reads the contact’s role, seniority, department, and full profile from the database exactly once, then pins that snapshot into the working state for every later step to use. No step ever re‑queries the database, so two steps can never disagree about the contact’s identity. Without this single read, the system would risk a contact’s title changing mid‑run or a step accidentally using stale data from a separate query. A beginner would feel that confusion as a follow‑up email that accidentally calls a VP a manager, or a sequence that personalizes to a job they left last week—small mistakes that break trust and waste the whole outreach effort.

Data flow — one request, in order
  1. lookup_contact node — reads the contact row from the database using d1_one, then loads role, seniority, department, and profile into the working state.

    • reads / writes: consumes nothing from state (reads from DB); writes contact.role, contact.seniority, contact.department, contact.profile into the EmailOutreachState snapshot.
    • branch: if the contact row is missing, the node sets skip_reason to a missing-contact indicator and the graph short-circuits to END (happy path: contact found, continues).
  2. Suppression gate (function check_suppressed from infra.suppression) — computes a one‑way fingerprint of the contact’s email + domain, checks the central do‑not‑contact list via the same module, and writes an audit record through audit_suppressed.

    • reads / writes: reads the email fingerprint; writes skip_reason (if suppressed) and an audit log entry.
    • branch: if the check returns suppressed, the graph ends immediately with skip_reason set (happy path: not suppressed, continues). Fails closed — an incomplete check treats the contact as suppressed.
  3. Stop‑conditions node (no explicit identifier in provided code, but described as a second guard) — examines contact.current_thread_state for any of replied, bounced, unsubscribed, or never_verified.

    • reads / writes: reads thread_state from the database (or from the contact snapshot); writes skip_reason with a distinct machine‑readable reason.
    • branch: if any stop condition holds, the graph terminates with the reason (happy path: none hold, continues). This is separate from suppression — permanent vs. conversation‑state check.
  4. select_sequence_node — deterministic LangGraph node that calls select_sequence(vertical, sub_niche) to look up the tailored sequence definition.

    • reads / writes: reads vertical (and optionally sub_niche) from state; writes selected_sequence (a dict with sequence_id and touches: [{step, angle}]).
    • branch: if the vertical is unknown or missing, select_sequence returns None, but the node still writes it (graceful fallback). Happy path returns a full sequence plan.
  5. build_sequence_touches — pure function that converts touch_angles from the sequence definition into the [{step, angle}] list required downstream.

    • reads / writes: consumes selected_sequence.touch_angles from state; writes the structured touch list into state (probably as part of selected_sequence or a separate key).
    • branch: no conditional — always produces the list (failure mode if touch_angles malformed, but not a branch).
  6. Adaptive cadence node — reads the engagement signal (e.g., open event) and the number of days since the last send, then proposes a next‑touch gap. The code clamps the proposal to a safe range.

    • reads / writes: reads engagement_signal and last_send_date from state; writes next_touch_time and cadence_reason.
    • branch: if engagement signal is stale, the gap may be biased but still constrained (no hard branch). First touch uses the default gap.
  7. extract_hook node — reads the supplied post_text (a recent public post or job description) and picks exactly one concrete hook string.

    • reads / writes: reads post_text from state; writes hook (the extracted fact).
    • branch: if post_text is empty, the hook cannot be grounded — the opener will have nothing real to stand on (failure mode). Happy path writes a non‑empty hook.
  8. Drafting step node — looks up the per‑step directive from selected_sequence.touches[step_index].angle and writes the email body for this touch. If the step index is past the end of the sequence, it falls back to a generic directive.

    • reads / writes: reads selected_sequence, hook, step_index; writes draft_body (intermediate).
    • branch: if step_index is out of bounds, the fallback generic directive is used (happy path: directive found per vertical/niche). If an opportunity is linked, the generic step switches to a job‑application framing.
  9. faithfulness_check node — a judge model audits each sentence of the draft against the assembled evidence. Any unsupported sentence is removed. The node also calls post_faithfulness_feedback to record a score (0–1) for ranking prompt versions.

    • reads / writes: reads draft_body and evidence (from state); writes cleaned draft_body and a faithfulness_score to the feedback path.
    • branch: an over‑aggressive judge may strip true but terse claims (failure mode). Happy path passes all sentences.
  10. Compose/refine node — refines the draft: strips machine‑sounding phrases, tightens the subject. Then runs the same faithfulness gate again (reuses faithfulness_check). Finally writes the output fields: subject, text, html, contact_id, and skip_reason (if any).

    • reads / writes: reads the cleaned draft; writes final output fields into EmailOutreachState.
    • branch: if the refine pass over‑trims, the signature may be dropped (failure mode). The faithfulness gate runs again as a safety net. Terminal step — graph reaches END.

Control loops over build_sequence_touches, extract_hook, and drafting_step for each touch in the sequence (the touches list is iterated over by the campaign engine, but within a single request only one touch is processed). The faithfulness gate and refine node are reused from the compose graph.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

In the Looking Up the Contact subsystem, the ordered mechanism begins with a single, eager database read: the first step in every outreach run loads the contact’s role, seniority, department, and profile into the working state exactly once. This snapshot (stored internally, e.g., as _contact_row in the state object) is then used by every subsequent step — no step ever re-queries the database. On failure, if the contact row is missing, the run cannot continue because the personalization would have nothing to stand on; the system would likely log a warning or error at that point, and the empty _contact_row would propagate as a null signal that the drafting and faithfulness stages would have to handle (or abort). The design guarantees a single, immutable view for the entire run, never permitting two steps to see different values or the row to change mid-execution.

The invariant the design preserves is a consistent view for the whole run. By reading once and freezing that data into the working state, every downstream step—personalization, hook selection, drafting, faithfulness gate—sees exactly the same role, seniority, department, and profile. This guarantees that no two steps disagree on who the contact is, and that no re-query can introduce a stale or updated row. It is a form of snapshot isolation across the orchestration graph, ensuring idempotency of the contact identity within a single thread execution.

The key trade-off is between consistency and freshnes. The design explicitly rejects two obvious alternatives. The first—letting the caller pass the attributes in—makes the caller the source of truth, which allows the data to “drift” from the database over time, silently introducing inconsistencies. The second—re‑reading the contact in each step—risks the row “changing mid‑run,” which could cause different steps to act on contradictory facts (e.g., an outdated seniority leading to a mismatched message angle). The chosen mechanism buys a single consistent view at the cost of a one-time database fetch and the possibility that the snapshot may be slightly stale if a contact is updated during the run. The cost avoided by rejecting the caller-passed alternative is the complexity of maintaining caller-side drift detection; the cost avoided by rejecting per-step re‑reads is the elimination of race-condition bugs and the added latency of redundant queries.

A concrete failure mode is a missing contact row, where the initial database read returns no record for the intended contact ID. The signal an operator would actually see is a logged error or missing‑state warning—for instance, a trace showing an empty _contact_row key in the working state, accompanied by a null personalization in the output. The system would not be able to produce a credible hook or any personalized sentence, because the build_outreach_evidence function (which assembles the faithfulness_evidence block) would have no role, department, or profile to include. A second, subtler failure mode is a stale seniority that quietly picks the wrong message angle; the operator would not see an error per se, but would observe that the generated email’s tone or framing does not match the contact’s current job level—a degradation detectable only by manual auditing or by monitoring the email.compose.vertical_hook_rate counter against expected patterns.

Cost & performance — the real knobs

The subsystem described in the "Looking Up The Contact" chapter spends time and money primarily on a single database read at the start of each outreach run. That read retrieves the contact’s role, seniority, department, and full profile into a snapshot that is reused by every later step. Because no subsequent step re‑queries, the cost is fixed per run: one database request instead of many. However, the source does not name explicit performance knobs (such as MAX_RETRIES, BATCH_SIZE, CACHE_TTL, or concurrency limits) for this specific step. The only identifiers that appear in the surrounding source files and that could function as performance‑influencing controls are configuration constants from the vertical‑sequence definitions and the cadence logic, which affect model call count and timing but are not concurrency or per‑host limits.

Here are three real identifiers from the source that indirectly affect how the subsystem spends time and money, followed by an explanation of why the requested knobs are absent for the contact‑lookup part.

CADENCE_DAYS (from email_outreach_graph.py)

  • KnobCADENCE_DAYS : [0, 4, 7] (a list of integers representing days between touches).
  • Bounds – Controls the gap between successive touches in a campaign. Not a concurrency limit, but it determines how many threads are waiting simultaneously and how often the timer drains them.
  • Effect – Turning it down (shortening gaps) increases the frequency of model calls (drafting, faithfulness judging) and database lookups per unit time, raising both latency per thread and dollar cost because more sends happen in a given window. Turning it up does the opposite.
  • Risk – If set too low, the system may schedule sends faster than human approvers can review or than the database can handle, while too high a gap risks losing engagement.

fallback_step (from email_outreach_graph.py)

  • Knobfallback_step : 2 (integer index).
  • Bounds – Defines which step’s directive is used when no per‑step directive exists for the current touch (e.g., a step index past the end of the sequence). That dictates which LLM prompt is sent.
  • Effect – Changing the fallback step alters the prompt complexity, which influences token count per model call and thus both latency and dollar cost. A shorter fallback prompt is cheaper; a longer one costs more.
  • Risk – If set to a step that expects information not available (e.g., a closer when no opener was drafted), the model may produce irrelevant or contradictory copy, increasing rejection rates and wasted call time.

judge model (described in the faithfulness gate)

  • Knob – Not given an explicit constant name in the source, but referred to as “a judge that compares each claim to the evidence” at “the cost of one extra model call.”
  • Bounds – The judge model is a separate LLM call per sentence/claim, adding token cost and latency. The source contrasts it with cheaper alternatives (no check, keyword check).
  • Effect – Choosing a slower, larger judge model raises latency and dollar cost per email but may improve faithfulness. A faster, smaller judge reduces cost but may strip true claims (failure mode).
  • Risk – Over‑aggressive judge removes correct sentences; too‑lenient judge lets fabrications through, eroding trust and compliance.

No explicit knobs for the contact‑lookup step itself.
The source describes only the design choice to read once and reuse the snapshot. It does not define concurrency limits (e.g., MAX_PARALLEL_READS), per‑host limits, retry counts, backoff parameters, batch sizes, caches, or retrieval top‑k for the database read. The failure modes are a missing contact row or a stale seniority, not a misconfigured knob. Therefore, the real performance levers for this subsystem lie in the constants that govern model calls (CADENCE_DAYS, fallback_step, judge‑model choice) and indirectly control how often the contact‑lookup step is invoked. The lookup itself is a single, unconfigurable database fetch per run; its cost is fixed and minimal relative to the subsequent model‑generation and evaluation steps.

Failure modes — what breaks, what catches it

Missing contact row

  • Trigger — The database query for the contact returns zero rows.
  • Guard — No guard shown in the source.
  • Posture — Fail-hard: the run cannot proceed because there is no data for personalization; the source says “leaves the personalization with nothing to stand on.”
  • Operator signal — The source does not specify a log line; the operator would see a run that terminates early with an “empty contact” error or a stalled graph step.
  • Recovery — Not defined in source. A manual step (e.g., confirming the contact ID or reimporting the contact) is required; no retry or fallback is described.

Stale seniority

  • Trigger — The contact row exists, but the stored seniority field has not been updated after a real-world promotion or role change.
  • Guard — No guard shown in the source. The design choice to read the contact only once makes this staleness invisible until the wrong message angle is delivered.
  • Posture — Fail-soft: the run continues, but the message angle mismatches the contact’s actual seniority.
  • Operator signal — Silent absence; no error is raised. The downstream effect (poor engagement or mismatched tone) would be visible only in campaign analytics or a compliance review.
  • Recovery — Not defined in source. The stale data persists until someone manually refreshes the contact record or a separate data pipeline updates the field.

Note: The source text for this chapter explicitly identifies only these two failure modes. No other distinct failures are described within the “Looking up the contact” subsection; therefore no further entries are listed.

Interview — could you explain it?

Q1:
What is the first step in every outreach run and how does it handle the contact’s data?

A:
The first step reads the contact from the database exactly once, loading their role, seniority, department, and full profile into the working state — specifically stored as _contact_row in EmailOutreachState. Every later step reads from that snapshot, so no re-queries happen and no two steps disagree about the contact’s identity.

Follow-up:
What happens if the contact row is missing?
A: That is a documented failure mode — a missing contact row leaves personalization with nothing to stand on, as stated in the “Looking up the contact” section.

Weak answer misses:
The exact storage location (_contact_row in EmailOutreachState) and the explicit failure mode of a missing row.


Q2:
Why does the system read the contact once at the top instead of letting the caller pass attributes in or re-reading the contact in every step?

A:
The source explicitly compares three options: letting the caller pass attributes makes the caller the source of truth, which can drift from the database; re-reading in each step issues many reads and risks the row changing mid-run. One read at the top buys a consistent view for the whole run, as described in the “Looking up the contact” chapter.

Follow-up:
What specific failure mode does the one-read approach prevent compared to re-reading each step?
A: It prevents two steps disagreeing about who the contact is due to a row change mid-run.

Weak answer misses:
The additional failure mode of “a stale seniority that quietly picks the wrong message angle” — a risk that remains even with the one-read approach.


Q3:
How does the contact lookup step supply evidence to the faithfulness gate downstream?

A:
The evidence assembly function build_outreach_evidence reads the contact from the state’s _contact_row and appends role and department facts into the faithfulness_evidence block. This ensures the judge audits claims against the same snapshot the drafter used. For cold first-touch emails with no enrichment, the function returns an empty dict, and the judge posts a no_claims-tagged 1.0 rather than a misleading perfect grounding score.

Follow-up:
What would happen if _contact_row held a stale seniority but the email was personalized based on that stale data?
A: The judge would audit the claim against the stale fact in the evidence, so the claim would appear true even though the real-world contact has changed — a risk the source warns about.

Weak answer misses:
The exact code path — state.get("_contact_row") inside build_outreach_evidence — and the no_claims tag for empty evidence.


Q4:
The failure modes listed are “a missing contact row” and “a stale seniority that quietly picks the wrong message angle.” How does the system distinguish between those two at runtime?

A:
It does not. The source presents both as failure modes to watch, but no runtime detection is described for stale seniority. A missing row will likely produce an empty _contact_row, causing the evidence assembly to lack personalization facts; a stale seniority will still produce valid-looking evidence, making the failure silent.

Follow-up:
Could adding a re-read on every step solve the staleness problem?
A: It would introduce a different failure mode — two steps could disagree on the contact’s identity if the row changes mid-run, trading one category of risk for another.

Weak answer misses:
The explicit trade-off in the source between “one read” consistency and “stale data” risk, and the fact that the system consciously accepts the staleness risk.

03. The Suppression Gate

A model call is expensive and slow. You cannot waste one drafting an email to someone who already asked to stop. That is why the suppression gate runs first, before any writing happens.

The gate checks a central list of contacts who must not be contacted. If the contact is on that list, the run ends immediately. No draft, no model call, no wasted compute. The check happens at the very start of the graph, so the system never spends resources on a blocked contact.

The gate also fails in a closed state. If the system cannot reach the list or complete the check, it treats the contact as suppressed. Better to block a send by mistake than to let a blocked contact slip through.

The gate keys its check on a fingerprint of the email address and the domain. That fingerprint only works one way. You cannot reverse it to recover the address. The gate also writes an audit record of every decision. Later, the team can prove whether a contact was checked and which decision the gate made.

A permanent choice to stop receiving messages lives in this central list. It is a standing rule that applies to every campaign. It never expires. That is different from the temporary thread state checked by the next step. A bounce or a reply is about this specific conversation, not a global ban.

Three options exist for where to put this check. Option one is to check only at send time. That saves a pre-send lookup but wastes every draft written to a blocked contact. It also trusts the sender to remember who is blocked. Option two is a local block list per campaign. That fragments the truth. One campaign blocks a contact while another sends to them freely. Option three is one central check, early in the graph, that fails in a closed state. That gives one authority and one audit trail. It is the option the system uses.

The failure mode to watch is an address that was not normalized before fingerprinting. If the same address enters the list as all lowercase but the graph has mixed case, the fingerprint differs. The gate does not recognize the match. A suppressed contact slips through. The fix is to normalize every address to the same form before creating the fingerprint. Normalize first, then hash, then check.

Use this central early gate when you need one authoritative rule that every campaign must respect. Do not use it when your address data is messy and normalization is unreliable. In that case, the gate gives false confidence.

<!-- mem:begin -->

Generate it: If the system cannot reach the list or complete the check, it treats the contact as ___________ — the gate fails closed. (cue: treats as ___________; answer: suppressed)

Generate it: The gate keys its check on a one-way ___________ of the email address and domain that cannot be reversed to recover the address. (cue: one-way ___________; answer: fingerprint)

Ask yourself: Why does the suppression gate run first, before any drafting happens?

Answer: A model call is expensive and slow, so checking suppression up front avoids wasting any draft or compute on a contact who already asked to stop.

Recall check (try before reading the answer):

  1. What distinguishes a suppression entry from a stop condition checked by the next step? Answer: Suppression is a permanent, never-expiring opt-out across every campaign; a stop condition (bounce or reply) is about one specific conversation.

  2. Why is the local block-list-per-campaign option rejected? Answer: It fragments the truth — one campaign blocks a contact while another sends to them freely.

  3. How can a suppressed contact slip through the gate, and what is the fix? Answer: An address not normalized before fingerprinting (e.g., mixed case vs. lowercase) produces a different fingerprint; fix it by normalizing first, then hashing, then checking.

Looking back: In "Why Outreach Is A Graph," what single thing does the drafting engine never do? Answer: It drafts copy but never sends — sending is a separate decision the caller owns.

<!-- mem:end -->

The suppression gate runs first, before any drafting, to check a central do-not-contact list and short-circuit the graph immediately if the contact is blocked.

python

# the suppression list is always consulted first.  The same conditional
# edge (_route_after_stop_check) is reused: if suppression_gate sets
# skip_reason, the edge routes to END without any further IO.
builder.add_node("suppression_gate", suppression_gate)
# ...
builder.add_edge("lookup_contact", "suppression_gate")
# If suppression_gate blocked the send, short-circuit to END immediately;
# otherwise fall through to the existing stop-conditions check.
builder.add_conditional_edges(
    "suppression_gate",
    _route_after_stop_check,
    {"skip": END, "continue": "check_stop_conditions"},
)
ELI5 — the plain-language version

Imagine a bouncer at a club door who checks a "banned list" before anyone even pays the cover. That's the suppression gate. Before the system spends even a penny on drafting an email, it first checks a central do-not-contact list. If the contact is on that list, the run stops immediately—no model call, no wasted compute. It even fails closed: if the list can't be reached, the system treats the person as banned, because the risk of accidentally contacting someone who opted out is worse than a false stop. Every check is recorded with a one‑way fingerprint of the email address so the decision can be audited later.

Without this gate, the engine would regularly spend expensive AI calls drafting emails to people who already said "never contact me again." That wastes money, burns sender reputation, and could create compliance violations. Worst of all, a single slip—like a sloppily normalized address—could let a suppressed contact through, undoing the whole trust the system is built on. The gate is the first, cheap safety net that makes every later step possible.

Data flow — one request, in order
  1. Graph START node — LangGraph begins executing the outreach graph with EmailOutreachState as the input.

    • reads / writes: Reads contact_id from state; writes contact, role, seniority, department, profile into state (based on "look up the contact" description).
    • branch: Happy path – contact found; failure path – missing contact row triggers early exit with skip_reason = "missing_contact".
  2. Contact lookup (node using d1_one) — Fetches the contact row from the database once and loads the fields into working state.

    • reads / writes: Reads contact_id from state; writes contact, role, seniority, department, profile into state keys.
    • branch: Happy path – row found; failure – row missing leads to terminal node without drafting.
  3. Suppression gate (node that calls check_suppressed) — Checks the central do-not-contact list using a one‑way fingerprint of the contact’s email address and domain.

    • reads / writes: Reads contact.email and contact.domain from state; writes suppression_result (boolean) into state; also writes an audit record via audit_suppressed.
    • branch: Happy path – contact not suppressed; early return – if check_suppressed returns True, the node sets skip_reason = "suppressed" and immediately transitions to END (no further nodes execute). If the check cannot be completed (e.g., network failure), check_suppressed treats the contact as suppressed (fail‑closed) and the same early return occurs.
  4. audit_suppressed (called inside suppression gate) — Writes an audit record of the suppression decision to a persistent store for later proof.

    • reads / writes: Reads suppression_result, contact.email, contact.domain; writes audit record (side effect, no state mutation).
    • branch: Always called after check_suppressed; no branching – it is a post‑decision logging step.
  5. Stop conditions gate (node using internal checks) — Checks the contact’s live thread state: reply, bounce, unsubscribe, or unverified address.

    • reads / writes: Reads contact.email and thread state (from database, not explicitly named); writes skip_reason if a stop condition is met.
    • branch: Happy path – no stop condition; early return – if any condition holds, sets skip_reason = "replied" (or bounce, unsubscribe, unverified) and ends the graph.
  6. select_sequence_node (LangGraph wrapper) — Deterministically looks up the sequence plan for the contact’s vertical and niche, returning {sequence_id, touches: [{step, angle}]}.

    • reads / writes: Reads vertical and niche from state; writes selected_sequence to state.
    • branch: Happy path – valid vertical found; fallback – returns None for unknown vertical, causing the drafting node to use a generic sequence.
  7. build_sequence_touches (helper called by select_sequence_node) — Converts the raw touch_angles from the lookup into the structured [{step, angle}] list required by the spec.

    • reads / writes: Reads touch_angles from the sequence definition; writes the structured list into selected_sequence.touches.
    • branch: No branching – always transforms if data is present.
  8. Drafting step (node using make_llm and wrap_untrusted) — Invokes the LLM to draft the body of the current touch, grounded on the hook extracted earlier.

    • reads / writes: Reads selected_sequence, hook, step_index, contact, vertical; writes draft (subject, text, html) to state. PII sanitized via wrap_untrusted.
    • branch: Happy path – LLM call succeeds; failure – make_llm raises LlmDisabledError if the kill switch is engaged, halting the graph.
  9. faithfulness_check (judge model) — A judge model audits each sentence of the draft against the assembled evidence, removing unsupported claims.

    • reads / writes: Reads draft.text and evidence from state; writes a cleaned draft.text and a faithfulness_score.
    • branch: Happy path – all claims supported; over‑aggressive judge – may strip a true but terse claim; no early exit.
  10. post_faithfulness_feedback (logging call) — Posts the faithfulness score as feedback for ranking prompt versions and model versions.

    • reads / writes: Reads faithfulness_score; writes feedback record (side effect).
    • branch: No branching – always runs after faithfulness_check.

Terminal step: The graph ends after the send step (not shown in code) or after an early exit via skip_reason. On the happy path, the draft is approved and sent; on a suppression‑gate early exit, the terminal step is the END node reached immediately after step 4.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The suppression gate is the first operational step after the contact lookup, executing before any drafting or model invocation. Its ordered mechanism is straightforward: the gate consults a central do-not-contact list keyed by a one-way fingerprint of the email address plus domain. If the contact’s fingerprint matches an entry, the run terminates immediately with no model call. If the check itself fails—for example, the list is unreachable or the fingerprint computation errors—the gate fails closed, meaning the contact is treated as suppressed and the run ends as if a match had been found. An audit record of the decision—whether matched, not matched, or closed—is written to ensure the choice can be proven later.

The invariant the suppression gate preserves is the enforcement of a permanent opt-out: no draft is ever produced for a suppressed contact, and the system never wastes a model call on such a contact. This is explicitly distinct from stop conditions (bounces, unsubscribes, unverified addresses), which are checked in a separate subsequent step. The design guarantees that the expensive drafting step is only invoked after suppression is confirmed clear, and that the system is biased toward safety when its own infrastructure is unreliable.

The key trade-off is between compute efficiency and the cost of normalization errors. The design rejects the alternative of checking suppression later—for instance, just before sending—because that would waste a full model call on a blocked contact. It also rejects failing open (treating an inconclusive check as not suppressed), which risks shipping a draft to someone who opted out. By running the gate first and failing closed, the system avoids the wasted latency and compute of drafting for suppressed contacts and the compliance risk of a wrong send. The cost it accepts instead is the need for rigorous, deterministic email address normalization before fingerprinting; a mismatch there can let a suppressed address slip through.

One concrete failure mode is an email address that was not normalized before fingerprinting—for example, a leading or trailing space, or a sub-addressing variation like user+tag@domain.com instead of user@domain.com. The operator would see a signal in the audit trail: the suppressed contact’s fingerprint in the do-not-contact list does not match the fingerprint computed from the contact record, so the gate returns a “not found” and allows the draft to proceed. The downstream logs would then show an email sent to a contact who previously opted out, with the audit record revealing that the suppression check returned a negative match despite an intended match. The operator would need to trace the fingerprint normalization logic to discover the inconsistency.

Cost & performance — the real knobs

Where the subsystem spends time and money
The subsystem’s dominant cost is the number of large‑language‑model calls it makes. Every drafted email requires at least one call to the drafting model and, if the faithfulness gate is enabled, a second call to a judge model that checks each claim against evidence. A third model call happens when a reply is classified as interested, objection, or unsubscribe. Time is consumed by these model invocations, by the database read to fetch a contact’s snapshot once, and by the suppression‑gate lookup (a fast keyed check, but still a network round‑trip). The timer‑driven campaign engine processes threads one at a time, so concurrency is a bottleneck: only a single thread is drained per wake cycle, which serializes the entire send pipeline and can stretch end‑to‑end latency when many threads are due.

Real performance knobs from the source

  • Evidence snippet maximum length

    • Knobmax 300 chars (documented in the VERTICAL_HOOK_TEMPLATES comment as the limit for the {evidence} placeholder). Default is 300 characters.
    • Bounds — Caps the number of tokens from the enrichment evidence that are inserted into the draft prompt.
    • Effect — Increasing the limit raises token cost per draft (more input tokens) and may improve personalisation; decreasing it lowers cost but risks omitting key facts.
    • Risk — If set too high, prompt length grows, increasing latency and per‑call cost; if too low, the model may have insufficient evidence to ground the opener, triggering more faithfulness‑gate removals.
  • Faithfulness gate enabled

    • Knob — Whether to run the judge model after drafting (the source contrasts “trusting the drafting model to stay grounded” [no judge] with “a judge that compares each claim to the evidence” [one extra model call]). No explicit variable name is given, but the design describes it as a choice.
    • Bounds — Adds exactly one extra model call per outgoing email.
    • Effect — Enabling it doubles the model‑call cost per email but removes fabricated claims; disabling it saves money and time but risks sending a “single confident fabrication.”
    • Risk — When enabled, an over‑aggressive judge may strip true but tersely worded claims (a reported failure mode). When disabled, any unsupported claim ships, eroding trust and potentially causing compliance problems.
  • Timer drain concurrency

    • Knob — The campaign engine drains threads “one at a time” (explicit in the text: “an external timer drains threads … one at a time, and resumes each”). Default concurrency = 1.
    • Bounds — Limits how many campaign threads are processed simultaneously when their wake times arrive.
    • Effect — Increasing concurrency (e.g., to 5) would allow multiple drafts, approvals, and sends to happen in parallel, reducing overall campaign latency but raising database and model‑call load. Decreasing further (still 1) serialises all work.
    • Risk — If concurrency is too high, the shared database and model API may be overwhelmed, causing timeouts or rate‑limit errors; too low and campaigns take longer to complete, and a single slow thread blocks all others.
  • Hook extraction top‑k

    • Knob — The hook step “picks exactly one concrete hook” (top‑1 retrieval from the supplied post text). No variable name is given, but the parameter is the number of hooks extracted.
    • Bounds — Controls how many facts are passed to the draft node for personalisation. Currently set to exactly 1.
    • Effect – Setting it to 0 would leave the opener ungrounded; setting it to 2 or more would feed multiple facts into the draft prompt, increasing token usage and potentially diluting the focus.
    • Risk – At 1, if the post text is empty (a reported failure mode) the opener has nothing real to stand on. At higher values, the model may try to weave in multiple claims, raising the chance of contradiction or hallucination that the faithfulness gate must catch.
Failure modes — what breaks, what catches it

Address Not Normalized Before Fingerprinting

  • Trigger — The contact's email address contains varied casing, leading/trailing whitespace, or subaddressing (e.g., plus notation) that is not normalized before being hashed into the one-way fingerprint.
  • Guard — No guard is shown in the source. The fingerprint step is the intended check, but the absence of a preceding normalization step means the guard can be silently bypassed.
  • Posture — fail-soft (the run continues uninterrupted, and a draft may be sent to a contact who is on the do-not-contact list; the system degrades by failing to detect the match).
  • Operator signal — Silent absence: no suppression event is recorded for this contact in the audit record, and the contact proceeds through the entire outreach pipeline. A later compliance audit or an inbound complaint reveals the missed suppression.
  • Recovery — No automatic recovery. The operator must add a normalization step before the fingerprint hash and then replay the contact's suppression check. No retry is attempted within the same run.

Stale Suppression List (Delayed Opt-Out Propagation)

  • Trigger — A contact opts out, but the central do-not-contact list has not been updated to reflect that opt-out before the suppression gate runs.
  • Guard — No guard within the gate itself; it is a data-currency problem. The gate only queries whatever the list currently contains.
  • Posture — fail-soft (the run continues, sending an email to a contact who has opted out; the system fails to respect the opt-out).
  • Operator signal — The audit record shows no suppression for this contact, but later the contact complains or an opt‑out event timestamp is found to be before the send time. A compliance metric detects a violation.
  • Recovery — No automatic recovery. The operator must update the suppression list and ensure campaign reconciliation processes recheck contact status before the next touch.

Suppression List Unreachable (Network or Service Outage)

  • Trigger — The central do-not-contact list cannot be queried (network partition, database outage, or rate limiting).
  • Guard — The gate's fail-closed behavior: "if the check cannot be completed, the contact is treated as suppressed rather than risk a wrong send".
  • Posture — fail-closed (the run ends immediately; the contact is incorrectly suppressed, but no email is sent).
  • Operator signal — An audit record is written, presumably with a reason such as "suppression_check_failed". Operators see an elevated count of suppression decisions that lack a corresponding opt-out event.
  • Recovery — No automatic retry within the same run; the run is aborted. The operator must restore list availability and then re‑initiate the campaign for that contact. The graph does not re‑attempt the check.

Fingerprint Collision (Hash Collision Across Different Addresses)

  • Trigger — Two distinct email addresses (or address variants) produce the same one-way fingerprint due to a hash collision, causing a non‑suppressed contact to be falsely matched to a suppressed record, or a suppressed contact to be missed.
  • Guard — None identified in the source. The fingerprint function is used without collision detection; the design assumes uniqueness.
  • Posture — fail-soft (if false positive: contact incorrectly suppressed, run aborted; if false negative: suppressed contact slips through and an email is sent). The system continues but silently produces incorrect behavior.
  • Operator signal — For a false positive, the audit record shows a suppression for a contact with no opt‑out history; manual audit might reveal the collision. For a false negative, the contact receives an email despite being on the list, triggering a compliance alert.
  • Recovery — None automatic. Requires operator investigation, potentially migrating to a stronger hash or adding domain‑level verification.

Audit Record Write Failure

  • Trigger — After the suppression decision is made, the attempt to write the audit record to persistent storage fails (disk full, database write error).
  • Guard — No explicit guard is shown in the source for audit write failures. The gate does not require the audit write to succeed before ending the run.
  • Posture — fail-soft (the run ends anyway, but the decision is unrecorded, making it impossible to prove later).
  • Operator signal — Silent absence: the audit trail for this suppression decision is missing. Operators may detect a gap when cross‑referencing suppression counts against campaign logs.
  • Recovery — No retry; the audit write is not retried. Manual step: operator must reconstruct the decision from graph run logs or re‑run the check with logging enabled.
Interview — could you explain it?

1. Warm-up: Basic mechanism

  • Q — Where in the graph does the suppression gate live, and what is its single most important property for cost control?
  • A — The suppression gate is the first step in the graph, before any drafting work. It checks the contact against a central do‑not‑contact list and ends the run immediately if the contact is suppressed, so no model call is ever wasted on a blocked address.
  • Follow-up — How do you know the check itself will not consume significant resources?
  • A — The check uses a one‑way fingerprint of the email address plus domain, so it is a cheap, deterministic lookup—no expensive model or network round‑trip beyond the database read.
  • Weak answer misses — The “one‑way fingerprint” detail; a shallow answer might say “it just checks a list” without naming the hashing mechanism.

2. Design trade‑off: Fail‑closed vs. fail‑open

  • Q — Why is this gate designed to fail closed rather than fail open?
  • A — The source explicitly says the gate “fails closed: if the check cannot be completed, the contact is treated as suppressed rather than risk a wrong send.” This prioritizes compliance over availability—it is safer to block a run than to send to a potentially suppressed contact.
  • Follow-up — What downstream effect does fail‑closed have when the database is unreachable?
  • A — The run ends immediately with a suppression decision recorded, preventing any draft from being generated, even for legitimate contacts, until the database is restored.
  • Weak answer misses — The phrase “risk a wrong send” is the core rationale; a shallow answer might say “it’s safer” without tracing to the specific fail‑closed rule in the source.

3. Distinguishing suppression from other stops

  • Q — The system has both a suppression gate and stop conditions. Why are they separate, and how does each treat its decision?
  • A — The suppression gate handles permanent opt‑outs and runs first. The stop conditions node (the “second guard”) checks temporary state like “already replied, bounced, unsubscribed, or email never verified.” The source is explicit: “suppression is a permanent opt‑out, while a stop condition is about this contact’s live conversation.”
  • Follow-up — What happens if a contact unsubscribes mid‑run?
  • A — The unsubscribe path in the reply graph adds a suppression entry, so a subsequent run would be caught by the suppression gate. The stop‑conditions node would also catch the unsubscribe label on the next check.
  • Weak answer misses — The exact distinction between “permanent opt‑out” and “temporary state” is easily overlooked; a shallow answer might conflate suppression with bounce handling.

4. Hard: Auditability and the weak point

  • Q — The gate writes an audit record of the decision. Why is that necessary, and what is the single failure mode that could let a suppressed contact slip through despite the check?
  • A — The audit record makes the decision provable later, which is important for compliance—if someone later asks “why did you email this person,” the audit trail shows the suppression check passed. The documented failure mode is “an address that was not normalized before fingerprinting,” which could cause a mismatch between the fingerprint stored in the blocklist and the fingerprint computed at runtime.
  • Follow-up — How would you test for that failure mode in a deployment pipeline?
  • A — Inject test contacts with non‑normalized addresses (e.g., leading whitespace, mixed case) into the blocklist and verify that the fingerprint calculation and lookup still match after normalization.
  • Weak answer misses — The phrase “not normalized before fingerprinting” is the exact attack vector; a shallow answer might say “the hash could collide” instead of naming the normalization step that prevents the mismatch.

5. Hard: Why this design instead of a simpler alternative?

  • Q — A simpler design would be to check the suppression list only at send time, right before the email is dispatched. Why check it at the very start of the graph, before any drafting, and then again later?
  • A — Checking at the start avoids any model call or expensive enrichment for a suppressed contact, saving cost and latency. The source states “it runs early, so the system never spends a model call drafting to someone who already opted out.” The later stop‑conditions node checks live thread state (replied, bounced, etc.) that can change between the start and the send attempt, but suppression is permanent and unchanging, so the early check is sufficient and efficient.
  • Follow-up — What if the suppression entry were added after the gate passed but before the email is dispatched?
  • A — That is a race condition; the design accepts it because the suppression gate runs at the start of the graph and a subsequent add would be caught on the next run of the campaign thread. For real‑time prevention the send gate would need its own check, but the current architecture prioritizes the cost saving of an early check and relies on the batch-oriented campaign retry.
  • Weak answer misses — The contrast between “permanent” and “temporary” state; a shallow answer might not acknowledge the race condition trade‑off or the iterative nature of campaign threads.

04. Stop Conditions

The second guard looks at the contact's current thread state and ends the run with a specific reason when any stop condition holds. The contact may have already replied, or their address bounced, or they unsubscribed, or their email was never verified.

Each reason stays distinct and machine-readable. A bounce gets its own identifier, and an unverified address gets a different one, so downstream consumers tell them apart easily.

This guard lives separately from the suppression gate. Suppression marks a permanent opt-out, while this stop condition checks the live conversation. Permanence and liveness are different concerns.

You have a few design options here. One boolean eligible flag is simplest but loses the why. Re-deriving eligibility inside the drafting step couples policy to copy. A dedicated step that returns a typed reason keeps one place that owns the policy, and that is the choice made here.

The failure mode is a missing verification check. If the system sends to an unverified address, sender reputation takes a hit. The blast radius stops at that one send, but the damage to deliverability can spread to the whole domain. The detection signal is an operator who sees a spike in hard bounces from new contacts. The root cause trace points to a verification check that never ran.

The reasons are plain text strings, and they get written to an audit record alongside the run outcome. You can later count how many runs ended on a bounce versus an unsubscribe. That separation is useful for tuning timing and content.

That covers the stop conditions guard. Use this pattern when eligibility checks depend on contact state that changes over a single conversation. Do not use this when eligibility is static for the contact's lifetime, and in that case, fold the check into the permanent suppression gate instead. The line between temporary and permanent is clear: if the condition can clear without action from your team, treat it as a stop condition.

<!-- mem:begin -->

Generate it: Each stop reason stays distinct and machine-________ so downstream consumers tell them apart. (cue: machine-________; answer: readable)

Generate it: The chosen design uses a dedicated step that returns a typed ______, keeping one place that owns the policy. (cue: typed ______; answer: reason)

Ask yourself: Why keep stop conditions in their own step instead of folding them into the suppression gate?

Answer: Stop conditions check the live conversation and can clear without action from your team, while suppression marks a permanent opt-out — permanence and liveness are different concerns, so the line is whether the condition can clear on its own.

Recall check (try before reading the answer):

  1. Name the four stop conditions that can end a run. Answer: The contact already replied, their address bounced, they unsubscribed, or their email was never verified.

  2. What is the consequence of a missing verification check? Answer: Sending to an unverified address hits sender reputation; the bad send is isolated but the deliverability damage can spread to the whole domain.

  3. What does writing the reason strings to an audit record let the team do later? Answer: Count how many runs ended on a bounce versus an unsubscribe, which helps tune timing and content.

<!-- mem:end -->

The stop conditions guard checks for replies, bounces, unsubscribes, and unverified emails, returning distinct machine-readable reasons.

python
async def check_stop_conditions(state: EmailOutreachState) -> dict:
    contact_id = state.get("contact_id")
    if not contact_id:
        return {"skip_reason": None}
    try:
        vrow = await d1_one(
            "SELECT email_verified, outreach_eligible FROM contacts WHERE id = ? LIMIT 1",
            [int(contact_id)],
        )
    except Exception:
        vrow = None
    if vrow is not None and (vrow.get("email_verified") == 0 or vrow.get("outreach_eligible") == 0):
        return {"skip_reason": "email_unverified"}
    try:
        rows = await d1_all(
            """
            SELECT status, followup_status, reply_received, reply_classification
            FROM emails
            WHERE contact_id = ? AND direction = 'outbound'
            ORDER BY created_at DESC
            LIMIT 10
            """,
            [int(contact_id)],
        )
    except Exception:
        return {"skip_reason": None}
    if not rows:
        return {"skip_reason": None}
    for row in rows:
        reply_received = row.get("reply_received")
        reply_class_n = _norm(row.get("reply_classification"))
        if reply_received:
            return {"skip_reason": "replied"}
        if reply_class_n in _SKIP_REPLY_CLASSIFICATION:
            return {"skip_reason": _SKIP_REPLY_CLASSIFICATION[reply_class_n]}
        # … status and followup_status checks for bounced/unsubscribed/stopped
    return {"skip_reason": None}
ELI5 — the plain-language version

Think of a doorman who doesn’t just check a permanent blacklist—he also glances at today’s sign-in sheet. If a guest already replied to the last invitation, their address bounced, they unsubscribed, or their email was never verified, the doorman stops them with a clear, specific reason: “Bounced” gets one colored tag, “unverified” gets another. Every downstream system can read those tags at a glance, so a bounce is never confused with an opt‑out. This guard lives separate from the permanent suppression list because a temporary stop—like a bounced address that might be fixed later—is different from a lifetime block. Without it, the system would keep sending to a contact who already replied, wasting their attention, or to a broken address, burning sender reputation. The doorman would let everyone through just because they aren’t on the blacklist, ignoring whether the conversation is already over or the door is unreachable.

Data flow — one request, in order
  1. look up the contact
    Reads contact from database; loads role, seniority, department, and profile into working state.
    reads / writes — consumes contact_id; writes role, seniority, department, profile to state.
    branch — missing contact row → empty personalization (failure); stale seniority → wrong message angle; happy path: all fields populated, continues.

  2. suppression gate
    Checks a central do-not-contact list using a one‑way fingerprint of the email address plus domain. Writes an audit record of the decision.
    reads / writes — reads email and domain from contact record; writes an audit record (e.g. suppression_audit).
    branch — contact on list → end run (terminal); check cannot complete → treated as suppressed (fail‑closed); happy path: not suppressed, proceeds.

  3. stop conditions
    Inspects the contact’s current thread state and ends the run with a specific machine‑readable reason if any condition holds: already replied, address bounced, unsubscribed, email never verified.
    reads / writes — reads thread‑state fields (e.g. replied, bounced, unsubscribed, email_verified); writes a terminal reason like "bounced" or "unverified".
    branch — any condition true → end run (terminal) with that reason; happy path: none true, continues.

  4. sequence selector
    Deterministically looks up a multi‑touch sequence for the contact’s vertical and, if a niche tag is present, a tighter variant. Returns a structured plan of touches.
    reads / writes — reads company_vertical, sub_niche; writes a sequence plan (list of touches and their roles).
    branch — no niche tag → broader sequence; happy path: plan returned.

  5. extracting the hook
    Reads the supplied post text and picks exactly one concrete hook to ground the opener.
    reads / writes — reads post_text; writes hook (the chosen concrete fact).
    branch — empty post_text → leaves no real hook (failure mode); happy path: hook extracted.

  6. drafting the step
    Looks up the per‑step directive for the current touch. If application_mode is true, delegates to the job‑application branch. Otherwise calls get_step_directive(company_vertical, sequence_step, sub_niche). If no directive exists, falls back to the generic draft node.
    reads / writes — reads company_vertical, sequence_step, sub_niche, application_mode, hook, post_text, tone, recipient_name, recipient_role, contact_id, and optional memory from recall; writes draft text.
    branchapplication_mode=true → job‑application path; no directive → generic draft; step index past sequence end → failure; happy path: directive found, draft produced.

  7. faithfulness gate
    Uses a judge model to audit the draft against the assembled evidence; removes any sentence whose claim is not supported. Produces a score between 0 and 1.
    reads / writes — reads draft text and evidence set; writes filtered draft and a faithfulness_score.
    branch — over‑aggressive judge strips a true claim (failure mode); happy path: unsupported sentences removed, draft is grounded.

  8. send step
    The only step that actually sends the email; records the send. (In a campaign thread, this runs only after human approval; in an autonomous pipeline, it runs directly after the gate.)
    reads / writes — reads the final filtered draft; writes a send record (e.g. sent_at, message_id).
    branch — none (terminal step for the request).

Diagram — the real call graph
System design — mechanism, invariant, trade-off

In the stop‑conditions guard, the system operates as a second gate that fires immediately after the suppression gate and before any drafting work begins. At this point the contact’s current thread state has already been loaded once from the database. The guard evaluates four discrete predicates—already replied, address bounced, unsubscribed, and email was never verified—and on any true predicate it terminates the run with a machine‑readable reason that is unique per condition. The run does not proceed to drafting, so no model call is wasted on a contact whose live conversation is already finished. On failure, specifically when the verification–check predicate is missing, the guard does not block the run and the system proceeds to draft, which risks sending to an unverified address.

The invariant the design preserves is a distinct, machine‑readable reason per stop condition. The source names this property explicitly: “Each reason stays distinct and machine‑readable. A bounce gets its own identifier. An unverified address gets a different one.” This guarantee ensures that downstream consumers—whether a campaign dashboard, a reporting pipeline, or a human operator—can always tell why a contact was halted, and they share a single vocabulary. The invariant is maintained by giving each condition its own reason rather than “folding eligibility into one boolean flag,” which would lose the reason and force every consumer to guess the cause.

The key trade‑off is between reason granularity and simplicity of the guard. The design rejects the obvious alternative of a single boolean “eligible / not eligible” flag, a choice that would be simpler to implement and faster to evaluate. That rejection buys an explicit, auditable trail of why a contact was stopped. The cost avoided is the downstream confusion and policy drift that would follow from an opaque boolean—a bounce and an unverified address would look identical, and every consumer would have to re‑implement its own logic to disambiguate. The design pays the extra complexity of maintaining four separate checks in order to “keep the policy in one place and give every consumer the same vocabulary,” as the source notes.

A concrete failure mode is a missing verification check. If the predicate email was never verified is not executed—for example because of a logic gap or a regression in the guard’s code—then a contact whose address was never verified will pass the stop‑conditions gate and move to drafting. The operator signal would be a sent email that later generates a bounce or a complaint from an address that was never confirmed, visible in the send‑log with a missing “verified” flag. The source warns that this failure “would send to an unverified address and burn sender reputation.” The operator, monitoring the campaign dashboard, would see an unusually high bounce rate for a particular sequence, traceable to contacts that lack a verification timestamp in the contact record.

Cost & performance — the real knobs

The subsystem spends time and money primarily on LLM inference calls: one to draft the email body, and one to the faithfulness judge that audits every personalized claim. The suppression gate and stop‑condition checks are deterministic, low‑cost database lookups. The vertical‑sequence lookup and hook extraction are also deterministic. The only explicit performance knob in the source is LLM_KILL_SWITCH.

  • KnobLLM_KILL_SWITCH (environment variable; default not shown, assumed 0).
  • Bounds — When set to 1, it short‑circuits the entire LLM path, returning the unmodified body and setting faithfulness_score=1.0. This trades off all LLM‑related latency and dollar cost for zero personalization.
  • Effect — Turning it up (to 1) eliminates both the drafting model call and the judge model call, reducing latency to near‑zero and cost to zero for the LLM portion. Turning it down (back to 0) restores the normal LLM pipeline, increasing latency by at least two model calls and adding per‑token cost.
  • RiskToo high: all emails become generic, harming conversion. Too low: when the drafting model is unreliable, the system still pays for a judge call that might strip claims (but that risk is inherent).

No other environment variables, constants, or parameters that directly control concurrency, per‑host limits, retry counts and backoff, batch sizes, caches, or retrieval top‑k are present in the source text. The source instead describes architectural choices (e.g., “the model proposes, the bounds constrain” for cadence, “a fixed table” vs. “model‑chosen gap”) and failure modes, but none provide an exact identifier with a default value. Therefore, only the single knob listed above can be cited as a real, named performance control from the provided documentation.

Failure modes — what breaks, what catches it

Missing verification check

  • Trigger — The stop condition for "email was never verified" is not evaluated because the check itself is absent from the execution path.
  • Guard — None shown. The design expects a verification check to exist, but no exception handler, retry, or fallback is provided in the source.
  • Posture — Fail-soft. The run continues without halting, allowing an email to be drafted and sent to an unverified address.
  • Operator signal — Silent; no error or log is produced because the check never runs. Downstream systems see a send that should have been blocked.
  • Recovery — No automatic recovery. The missing check must be identified and added manually by reviewing the thread execution trace.

Stale thread state hiding a reply

  • Trigger — The contact's current thread state is not updated after a reply arrives, so the stop condition "already replied" is not triggered.
  • Guard — None shown. The second guard reads the thread state but the source does not specify a staleness check, retry, or fallback.
  • Posture — Fail-soft. The run proceeds and sends another touch to a recipient who has already replied.
  • Operator signal — No immediate signal; later analysis of engagement metrics shows a reply that was missed.
  • Recovery — No automatic recovery. The thread state refresh frequency must be adjusted manually.

Bounce misclassified as unsubscribed

  • Trigger — The stop condition logic incorrectly maps a bounce event to the "unsubscribed" reason code instead of using the distinct bounce identifier.
  • Guard — None shown. The source ensures "the reasons are kept distinct and machine-readable" but provides no guard against misclassification.
  • Posture — Fail-soft. The run halts with the wrong reason, but downstream consumers may treat the contact as permanently opted out instead of enabling re‑engagement after the bounce is resolved.
  • Operator signal — A machine-readable reason string like "unsubscribed" is emitted instead of "bounced". The anomaly is visible in the audit record.
  • Recovery — Manual correction of the mapping logic; no automatic retry or fallback.

Unsubscribe record not synced into thread state

  • Trigger — A contact unsubscribes via a different channel, but the unsubscribe event is not propagated to the thread state that the second guard reads.
  • Guard — None shown. The suppression gate handles permanent opt‑outs separately, but the stop condition for "unsubscribed" relies on the thread state being current.
  • Posture — Fail-soft. The guard does not see the unsubscribe, so the run continues and drafts an email to a contact who has opted out.
  • Operator signal — No stop is logged; the send later fails at the suppression gate (if the suppression record exists) or the email is delivered against the recipient's wish.
  • Recovery — No automatic recovery. The sync mechanism between unsubscribe sources and thread state must be fixed manually.
Interview — could you explain it?

Q1 (warm-up): What stop conditions does the second guard check, and how does it signal the reason for termination?

A: The second guard reads the contact’s current thread state and terminates the run if the contact already replied, their address bounced, they unsubscribed, or their email was never verified. Each reason is kept distinct and machine-readable (e.g., a bounce gets a different identifier than an unverified address), so downstream consumers can tell them apart. This mechanism is explicitly called the stop-conditions guard in the source, and it is implemented as a separate step after the suppression gate.

Follow-up: How do downstream consumers actually differentiate between a bounce and an unsubscribe?

A: The system emits a distinct, machine-readable reason string per condition, so a bounce identifier is never confused with an unsubscribe identifier.

Weak answer misses: A shallow answer might omit that the reasons are machine-readable and distinct – without naming the property that downstream consumers rely on the identifier format, not on parsing free text.


Q2 (design alternative – “why this way and not the obvious alternative”): Why are stop conditions separated from the suppression gate rather than merged into a single check, given both can end the run?

A: Suppression is a permanent opt‑out checked against a central do‑not‑contact list using a one‑way fingerprint; the stop‑conditions guard examines the live conversation state (e.g., current thread state, bounce status, verification flag). Combining them would conflate permanent prohibition with transient conversation status, making it impossible to distinguish a global opt‑out from a reply that ended the sequence. The source explicitly states that suppression is “deliberately separate from suppression: suppression is a permanent opt‑out, while a stop condition is about this contact’s live conversation.”

Follow-up: What happens if the suppression check itself cannot complete?

A: The suppression gate fails closed – if the check cannot be completed, the contact is treated as suppressed to avoid a risky send.

Weak answer misses: A shallow answer leaves out the fail‑closed design of the suppression gate and the fact that suppression uses a one‑way fingerprint, whereas stop conditions rely on thread state – two fundamentally different data sources.


Q3: How does the stop‑conditions guard avoid race conditions or stale data when the contact is already being used by other steps?

A: The contact is read from the database exactly once by the “looking up the contact” step at the start of the run, and that snapshot (role, seniority, department, profile, and implicitly thread state) is reused by every later step. The stop‑conditions guard reads its thread state from that same snapshot, so no two steps can disagree about the contact’s current status during a single run. The source calls this the “one read at the top” rule, providing “one consistent view for the whole run.”

Follow-up: What is the failure mode of that snapshot approach?

A: A missing contact row leaves personalization with nothing to stand on; a stale seniority could silently pick the wrong message angle – but the stop‑conditions guard would still see the thread state as it was at the start of the run.

Weak answer misses: A shallow answer fails to mention that the thread state is part of the same one‑time snapshot, and that the guard’s check is synchronized by the single read, not by a fresh query.


Q4 (hard): If an address is not normalized before finger‑printing in the suppression gate, can that cause a stop‑conditions false negative?

A: Yes – the source warns that a non‑normalized address before finger‑printing “could let a suppressed contact slip through.” However, the stop‑conditions guard checks the live conversation independently; it does not rely on the suppression fingerprint. An unnormalized address would still pass the stop‑conditions guard if the thread state is unremarkable, but the suppression gate would fail to block a permanently opted‑out contact. That is a suppression + normalization failure, not a stop‑conditions failure.

Follow-up: Is there a similar normalization risk in the stop‑conditions guard itself?

A: The source does not describe normalization in the stop‑conditions guard; the risk is limited to the suppression gate’s fingerprint step.

Weak answer misses: A shallow answer might claim that stop conditions also fingerprint the address, but the source only describes a fingerprint for suppression and relies on thread state for stop conditions – the two use different keys.


Q5 (hard): What ensures that the distinct stop‑condition reasons (bounce vs. unsubscribe) are recorded for audit after the run ends?

A: The source mentions that the suppression gate “writes an audit record of the decision so the choice can be proven later.” By analogy, the stop‑conditions guard emits a machine‑readable reason string that can be logged or fed into the same auditing mechanism. The exact audit function is not named in the provided excerpt, but the design principle of auditable termination is shared: the guard returns a distinct identifier, and the framework records it in the run metadata.

Follow-up: How does the system know whether to retry the contact later (e.g., after a transient bounce) versus permanently suppress them?

A: A bounce reason from the stop‑conditions guard is treated as a transient stop (live conversation state), whereas the suppression gate’s record is a permanent opt‑out – the distinct reason identifiers enable downstream logic to make that retry vs. block decision.

Weak answer misses: A shallow answer omits that the stop‑conditions guard does not write to the suppression list; it only terminates the current run, and the distinction between permanent (suppression) and transient (stop) is enforced by the separation of the two guards.

05. Adaptive Cadence

The cadence step reads one engagement signal — whether the contact opened the last message or not — and the number of days since that send. It returns two things: the next touch date and a short reason for the choice. The system asks a language model to propose a gap in days. Then the code clamps that proposal to a safe range. No message ever fires on the same day. No silence stretches past six months. The model proposes, the bounds constrain.

Three cases cover every contact. A first touch uses a default gap, typically set by the campaign designer. For a contact who opened the last message, the step shrinks the gap toward half that default. For a quiet contact — no open, no click — the step stretches the gap, but the upper bound stops it from blowing out. The reason string makes the decision transparent: "opened, shortening" or "quiet, stretching".

The design rejects two alternatives. A fixed table would be predictable and cheap, but it would stay blind to how this contact actually behaves. An unbounded model-chosen gap would adapt freely, but a misfire — a model that picks zero days or two hundred days — would damage sender reputation. The compromise: let the model propose, then enforce hard floor and ceiling in code. The constraint that rules out the unbounded approach is blast radius. One bad proposal could queue a flood of emails, or let a warm lead go cold. That risk is unacceptable for a system that sends to thousands of contacts.

The failure mode here is a stale engagement signal. Imagine a contact opened an email, but the open event arrives at the cadence step two hours late. The step reads "no open" from the database and stretches the gap, even though the contact just engaged. An operator spots this in the logs: the stated reason says "quiet" but the actual open happened recently. The blast radius is one contact — the gap is wrong for that person, not for others — because the cadence decision runs per contact, per thread. To detect it earlier, watch one metric. Count the cadence decisions where the stated reason is quiet but the time since the last send sits in the bottom ten percent of the range. That pattern signals a late-arriving signal.

Operationally the cadence step runs as a short-lived node inside the graph runtime. It calls one language model for the proposal — roughly three hundred milliseconds at the ninety-ninth percentile — then the clamp finishes in microseconds. The node itself is stateless; it reads from the database snapshot that the contact lookup step already loaded. Scaling is per graph instance, not per contact, so the cost of each call is the model latency. Cold starts add maybe five megabytes for the model warm-up, but the clamp code is negligible.

Here is the transferable rule. Use this pattern when you have a reliable engagement signal but the model's raw output cannot be trusted to stay within safe limits. Do not use it when the engagement signal arrives in real time and the risk of a bad model guess is acceptable. A low-volume trial run is one example, where a single misstep costs nothing.

<!-- mem:begin -->

Generate it: The model proposes a gap in days, then the code _______ that proposal to a safe range. (cue: code _______; answer: clamps)

Generate it: The constraint that rules out an unbounded model-chosen gap is blast _______ — one bad proposal could queue a flood of emails. (cue: blast _______; answer: radius)

Ask yourself: Why let the model propose the gap at all instead of just reading a fixed table?

Answer: A fixed table is cheap but blind to how this contact actually behaves; letting the model propose adapts to engagement, while clamping its output in code prevents a misfire (zero days or two hundred days) from damaging sender reputation.

Recall check (try before reading the answer):

  1. How does the gap differ for a contact who opened versus a quiet contact? Answer: For an opener the step shrinks the gap toward half the default; for a quiet contact it stretches the gap up to the upper bound.

  2. What two hard limits does the clamp enforce on every proposal? Answer: No message fires the same day (a floor), and no silence stretches past six months (a ceiling).

  3. What metric detects a stale engagement signal early? Answer: Count cadence decisions whose stated reason is "quiet" yet the time since the last send sits in the bottom ten percent of the range.

Looking back: In "The Suppression Gate," what does the gate do when its check cannot complete? Answer: It fails closed — it treats the contact as suppressed rather than risk a wrong send.

<!-- mem:end -->

The cadence step reads engagement signals, asks the LLM to propose a gap, then clamps it to safe bounds.

python
async def decide_cadence(state: EmailOutreachState) -> dict:
    contact_id = state.get("contact_id")
    sequence_step = state.get("sequence_step") or 0
    seq_def = get_sequence_def(state.get("company_vertical"), state.get("sub_niche"))
    cadence_days = seq_def["cadence_days"] if seq_def else [0, 4, 7]
    next_step_idx = sequence_step + 1
    default_gap = cadence_days[next_step_idx] if next_step_idx < len(cadence_days) else cadence_days[-1] if cadence_days else 7
    default_gap = max(CADENCE_MIN_DAYS, default_gap)
    if not contact_id:
        return {"engagement_signal": "first_touch", "next_touch_at": None, "cadence_reason": "no_contact_record", "cadence_confidence": 1.0, "cadence_source": "adaptive_cadence_v81", "cadence_evidence": []}
    rows = await d1_all("SELECT sent_at, opened_at, reply_received, sequence_type, status FROM emails WHERE contact_id = ? AND direction = 'outbound' ORDER BY created_at DESC LIMIT 5", [int(contact_id)])
    engagement_signal = _resolve_engagement_signal(rows)
    days_since = _days_since_last_send(rows)
    decision = await _cadence_decision(engagement_signal=engagement_signal, days_since_last_send=days_since, company_vertical=state.get("company_vertical"), sequence_step=sequence_step, default_gap=default_gap)
    days_gap = decision["days_gap"]
    next_touch_at = (datetime.now(timezone.utc) + timedelta(days=days_gap)).strftime("%Y-%m-%dT%H:%M:%SZ")
    return {"engagement_signal": engagement_signal, "next_touch_at": next_touch_at, "cadence_reason": decision["reason"], "cadence_confidence": decision["confidence"], "cadence_source": "adaptive_cadence_v81", "cadence_evidence": [f"engagement:{engagement_signal}", f"days_since_last_send:{days_since}", f"default_gap:{default_gap}", f"resolved_gap:{days_gap}"]}
ELI5 — the plain-language version

Imagine a thoughtful friend who, instead of texting you on a fixed schedule, adjusts how long they wait based on whether you replied to their last message. If you replied quickly, they text again sooner; if you went quiet, they give you more space. But they never bombard you the same day, nor let months pass without checking in. That’s exactly how this cadence step works. It reads one simple signal—did the contact open the last email?—and how many days have passed since then. The system first asks a language model to suggest a number of days to wait before the next touch. Then, before using that number, the code clamps it to a safe range: no same-day blasts, no six-month silences. For a first touch, it uses a default gap set by the campaign designer. If the contact opened the last message, the gap shrinks toward half the default. If they’ve gone quiet, it stretches, but always within the upper bound. The model proposes; the bounds constrain. Without this, a stale engagement signal—like an open recorded late—could wrongly suggest the contact is interested, making the system send too soon or too late, damaging the relationship.

Data flow — one request, in order
  1. scheduling step – Invoked by the campaign engine after a send or approval, it calls the cadence step to compute the next wake time.

    • reads / writes: Consumes the thread’s last_send_timestamp and engagement_signal from the database; writes no state yet.
    • branch: Happy path calls the cadence step; failure path (e.g., thread missing) raises an error.
  2. cadence step – Reads the fresh engagement signal and days since the last send to decide the pause before the next touch.

    • reads / writes: Reads engagement_signal (boolean: opened or not) and days_since_last_send (integer) from the contact’s thread state; returns next_touch_time and human-readable reason to the scheduling step.
    • branch: If last_send_timestamp is None (first touch), the step sets proposed_gap to the default_gap from the sequence definition and skips the LLM call.
  3. LLM call (llm_propose_gap) – For a non‑first touch, the cadence step sends the engagement signal and days since last send to a language model, which returns a proposed gap in days.

    • reads / writes: Inputs the engagement_signal and days_since_last_send; output is a numeric proposed_gap.
    • branch: If the LLM returns a malformed or absent value, the step falls back to the default_gap (safe fallback).
  4. Clamping to safe range – The code clamps proposed_gap to never schedule a same‑day send (minimum 1 day) and never exceed six‑month silence (maximum 180 days).

    • reads / writes: Reads the raw proposed_gap; writes the clamped proposed_gap back into local state.
    • branch: No early return; always clamps.
  5. Case‑specific bounds (opener) – If engagement_signal is True (contact opened the last message), the upper bound is further tightened to at most half the default_gap.

    • reads / writes: Reads engagement_signal and default_gap; mutates the clamped proposed_gap to min(clamped, default_gap / 2).
    • branch: Only executed when engagement_signal == True; otherwise skipped.
  6. Case‑specific bounds (quiet) – If engagement_signal is False (contact did not open), the lower bound is raised to a value longer than default_gap, while still respecting the global upper bound (180 days).

    • reads / writes: Reads engagement_signal; mutates proposed_gap to max(clamped, default_gap * 1.5) (inferred from “longer gap”) and then clamps again to the safe range.
    • branch: Only executed when engagement_signal == False; otherwise skipped.
  7. Compute next_touch_time – The cadence step adds the final clamped gap to the current system time to produce the absolute wake datetime.

    • reads / writes: Reads the clamped proposed_gap and current_time; writes next_touch_time (datetime) into local state.
    • branch: None.
  8. Generate human‑readable reason – The step creates a short string explaining the choice: “First touch”, “Opened last message — shorter gap”, or “Quiet — longer gap”.

    • reads / writes: Reads the branch decisions; writes a reason string.
    • branch: None.
  9. Return to scheduling – The cadence step returns the pair (next_touch_time, reason) to the calling scheduling step.

    • reads / writes: No further state changes; values flow upward.
    • branch: None.
  10. scheduling step writes wake time – Using the returned next_touch_time, the scheduling step writes waiting_status and wake_time to the contact’s thread record in the database.

    • reads / writes: Reads the cadence‑returned values; writes thread.status = "waiting" and thread.wake_time = next_touch_time.
    • branch: If the write fails, the thread may remain in an inconsistent state; the happy path commits.
  11. Pause thread – The scheduling step signals the campaign engine to pause the thread; control returns to the external timer that manages wake times.

    • reads / writes: Updates thread state to “paused” (implicitly).
    • branch: None.
  12. External timer resumes at wake – The timer drains threads whose wake_time has passed, one at a time, and resumes the campaign graph for the next cycle (back to step 1 for the next touch).

    • reads / writes: Reads wake_time from database; writes a resume signal.
    • branch: If the timer stops, the thread is left indefinitely (failure mode noted in source).
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The adaptive cadence subsystem follows a precise ordered mechanism. It begins by reading the engagement signal — specifically whether the contact opened the last message — and the number of days since that send. The cadence step then calls a language model to propose a gap in days. The code immediately clamps that proposal to a safe range that prevents a same-day blast on the low side and a six‑month silence on the high side. Three cases are covered: a first touch uses the default gap set by the campaign designer; a contact who opened the last message gets a shorter gap, toward half the default; a contact who has gone quiet gets a longer gap, still inside the upper bound. The step returns two outputs: a single next‑touch date and a short human‑readable reason for the choice. This flow ensures that every cadence decision is adaptive yet bounded.

The design preserves the invariant that timing can adapt without ever scheduling a same‑day blast or a six‑month silence. This invariant is enforced mechanically: the model proposes, the bounds constrain. The code clamps the model’s proposal to a safe range, so even if the model suggests an extreme gap, the final schedule stays within the acceptable window. The invariant is named in the source as the guarantee that “timing can adapt without ever scheduling a same‑day blast or a six‑month silence.” It is a hard boundary on the output of the cadence step, making the system’s timing both responsive and safe.

The key trade‑off is between a fixed table and an unbounded model‑chosen gap. A fixed table would be predictable but blind to engagement, missing the opportunity to tighten or loosen spacing based on real behavior. An unbounded model‑chosen gap would adapt freely but could misfire badly, scheduling a follow‑up too soon or too late. The design rejects both extremes in favor of a hybrid: the model proposes, the bounds constrain. This rejection avoids the cost of a fixed table (poor conversion because it ignores engagement) and the cost of an unbounded model (unpredictable and potentially harmful timing). Instead, the system gains adaptation within a safe envelope.

One concrete failure mode occurs when the engagement signal is stale — for example, an open recorded late. The cadence step reads that stale signal, treats the contact as having opened the message, and biases the gap toward the shorter side. An operator would see a contact receiving a follow‑up sooner than appropriate, because the system believed the contact was engaged when the open actually happened after the cadence decision. The signal an operator would observe is a gap shorter than the default for a contact who, in reality, has not shown timely interest. The source explicitly identifies this failure: “a stale engagement signal — an open recorded late — biasing the gap the wrong way.”

Cost & performance — the real knobs

Adaptive Cadence – Time and Money Spent & Performance Knobs

The subsystem spends time and money in three measurable places:

  1. A database query (d1_all) to fetch prior engagement signals for the contact.
  2. One large-language model call (currently DeepSeek) that proposes a gap in days.
  3. A cheap arithmetic clamp that binds the model’s proposal to fixed constants.

The following four knobs directly control latency, throughput, and dollar cost. No other knobs are visible in the source extracts.


CADENCE_MIN_DAYS

  • Knob — constant CADENCE_MIN_DAYS; default value not shown in source (inferred from the text "no same-day blast": likely 1).
  • Bounds — prevents the model from scheduling a touch earlier than this many days after the last send.
  • Effect — turning it up (e.g., 3 days) delays high-frequency touches, reducing model-call volume for a given campaign; turning it down allows faster re-engagement but can increase per-contact LLM calls.
  • Risk — set too low, the clamp becomes a no-op and a same-day blast can fire, burning sender reputation and wasting model cost on useless touches. Too high (e.g., 30 days) kills the adaptive value for warm contacts.

CADENCE_MAX_DAYS

  • Knob — constant CADENCE_MAX_DAYS; default not shown in source (inferred from "no six-month silence": likely 180).
  • Bounds — maximum gap the model can propose, stops cold streaks from stretching indefinitely.
  • Effect — raising it lets the system wait longer (fewer model calls per contact over time but slower cadence); lowering it forces more frequent touches (more model calls per contact).
  • Risk — if set too low, warm contacts get hammered too often, wasting model cost and annoying recipients. Too high, a truly disengaged contact sits un-approached for months, which is a lost opportunity but not a cost risk.

cadence_days array (per‑vertical sequence definition)

  • Knobseq_def["cadence_days"] in the sequence lookup; default fallback is [0, 4, 7].
  • Bounds — provides the default gaps for each step (first touch, second, third) when no prior engagement exists. Each value becomes the default_gap passed to the LLM prompt.
  • Effect — shorter values (e.g., [0,2,4]) increase the frequency of model proposals overall, raising monthly LLM spend; longer values (e.g., [0,7,14]) slow the cadence and lower cost. The last element also serves as the catch‑all default after the sequence ends.
  • Risk — if the array is too short and a long sequence is run, the fallback to the last element can create unreasonably tight or loose spacing. If it is omitted entirely, the hardcoded [0,4,7] may mismatch the vertical’s expected rhythm.

DeepSeek model (the LLM used for gap proposal)

  • Knob — the model name passed to the runtime (currently DeepSeek as per the docstring); no configurable parameter for temperature or top‑p is exposed in the source.
  • Bounds — determines the token‑processing latency, per‑call cost, and the quality of the proposed gap.
  • Effect — switching to a smaller/cheaper model reduces both latency (faster response) and dollar cost per proposal but may produce less accurate gap suggestions or parse failures (the code handles parse failures by falling back to default_gap). Moving to a larger model increases cost and latency but yields more nuanced adaptive timing.
  • Risk — a model that is too cheap may frequently return malformed output, triggering the parse_failure fallback and losing the adaptive benefit. An expensive model per call can dominate the budget even when the default gap would have been fine.

Engagement‑signal query (D1 read)

  • Knob — the SQL SELECT sent_at, opened_at, reply_received, sequence_type, status FROM emails WHERE contact_id = ?; the only tuneable parameter is the contact_id index design (not shown in code).
  • Bounds — queries one row per contact per cadence decision.
  • Effect — turning up the cache on this query reduces database latency (the code already uses d1_all which is a full SELECT *, no pagination).
  • Risk — a missing index makes this step a bottleneck, adding milliseconds to every cadence call. The code is already fail‑open to a default gap if D1 errors, so a slow query increases latency without breaking the function.

These four to six knobs are the only real performance controls visible in the source. No concurrency limits, per‑host limits, retry counts or batch sizes appear in the provided extracts. The most impactful cost driver is the DeepSeek model choice, because every cadence decision burns one LLM inference. The bound constants (CADENCE_MIN_DAYS, CADENCE_MAX_DAYS) and the sequence‑definition array shape how often those inferences are triggered.

Failure modes — what breaks, what catches it

Stale Engagement Signal

  • Trigger — The contact opened the last message, but the engagement_signal is recorded late (e.g., after the cadence decision). The system sees an old or missing open and biases the gap incorrectly.
  • Guard — None. The code reads engagement_signal as-is from the state and passes it to the LLM without staleness detection. The source explicitly names this failure mode: “a stale engagement signal — an open recorded late — biasing the gap the wrong way.”
  • Posture — fail-soft (the function still runs and returns a gap, but the gap is wrong for the actual engagement).
  • Operator signal — No error log. The cadence_reason field in the output will mention “no open” or “first_touch” despite a late open. The operator would see unexpected long gaps for engaged contacts.
  • Recovery — Manual audit of sequence logs; no automatic retry or fallback. The stale signal must be corrected upstream.

LLM Response Parse Failure

  • Trigger — The LLM returns a response that is not a dict, or the dict lacks a days_gap key. This can happen with JSON formatting errors, truncated output, or an unexpected schema.
  • Guardif isinstance(result, dict): else the function falls back: raw_gap = default_gap; reason = "parse_failure"; confidence = 0.5.
  • Posture — fail-soft (the function returns the default cadence instead of the LLM-suggested one).
  • Operator signal — The reason field in the output will be "parse_failure" and confidence will be 0.5. No exception is raised; a log line containing “parse_failure” would be emitted if logging is added (the source does not show one, but the fallback value is the signal).
  • Recovery — The default gap is used immediately; no retry. The next touch uses the campaign’s default_gap.

D1 Database Error

  • Trigger — A D1 query (read or write) inside decide_cadence fails, e.g., network timeout or unavailable database. The docstring states: “Fail-open on D1 errors — falls back to default cadence.”
  • Guard — The function catches the D1 error and returns default_gap with reason = "d1_error" (the exact field is not shown but implied by “falls back to default cadence”).
  • Posture — fail-soft (the run continues with a safe default gap).
  • Operator signal — A D1 error log (not shown in snippet, but standard in Workers) and the cadence_confidence would be low (e.g., 0.5) or a distinct cadence_reason like "d1_error".
  • Recovery — Default gap used; the operator can inspect the D1 error logs and retry manually.

Missing Engagement Signal

  • Trigger — The engagement_signal field in state is None or an empty string. This might happen if the initial contact state was never set, or a previous step failed to write it.
  • Guard — None. The code passes engagement_signal: {engagement_signal} to the LLM without validating its presence.
  • Posture — fail-soft (the LLM receives "engagement_signal: None" and may return an arbitrary gap, but the clamping bound prevents extreme values).
  • Operator signal — No error; the cadence_reason might contain phrases like “first_touch” even for a non-first touch. The operator would see unusual cadence gaps.
  • Recovery — Manual inspection of the contact state; no automatic fallback. The missing signal must be supplied upstream.

Unparseable raw_gap from LLM

  • Trigger — The LLM returns a days_gap value that is not convertible to an integer via int(raw_gap). For example, a string like "ten", a float written as "5.5" (which Python’s int() rejects), or a list.
  • Guard — None. The code executes int(raw_gap) without a try-except block. This raises a ValueError.
  • Posture — fail-hard (the exception propagates up and halts the current run).
  • Operator signal — A ValueError traceback in the logs, referencing int(raw_gap) in decide_cadence.
  • Recovery — No automatic recovery; the run fails. The operator must correct the LLM prompt or add a type guard. The failure could be caught by an outer error handler if one exists (the source does not show one).
Interview — could you explain it?

Q – What inputs does the cadence step consume, and what does it output?

A – The cadence step reads an engagement signal (whether the contact opened the last message) and the number of days since that send. It returns two things: a single next-touch date and a short human-readable reason for the chosen gap.

Follow-up – How is the engagement signal obtained, and what happens if it arrives after the cadence decision?
A – The signal is recorded by a separate monitoring process; if it is recorded late (a stale engagement signal), the calculated gap can be biased in the wrong direction, which is the documented failure mode.

Weak answer misses – The answer omits that the step returns both a date and a reason, not just a date.


Q – The design uses a model-proposed gap that is then clamped by code. Why this way and not a completely fixed table of gaps?

A – A fixed table is predictable but blind to engagement, so it cannot adapt to an opened or quiet contact. The hybrid approach lets the model adapt freely within safe bounds: the code clamps the proposal to a safe range, so timing can adapt without ever scheduling a same-day blast or a six‑month silence.

Follow-up – What mechanism prevents the model from proposing a zero-day gap?
A – The clamping code enforces a lower bound, so no message ever fires on the same day.

Weak answer misses – The answer fails to name that the “model proposes, the bounds constrain” pattern is the explicit design rationale, and that both lower and upper bounds are enforced.


Q – Describe the three distinct cases the cadence handles and how the gap is set for each.

A – A first touch uses the default gap set by the campaign designer. A contact who opened the last message gets a shorter gap, toward half the default. A contact who has gone quiet gets a longer gap, but still inside an upper bound.

Follow-up – What is the specific numeric target for the shorter gap when the recipient opened?
A – The context says “toward half the default”; no exact constant is given, but the direction is clear.

Weak answer misses – The answer does not mention that the “default gap” is typically set by the campaign designer, nor that the quiet case still respects an upper bound.


Q – What is the single documented failure mode of the adaptive cadence, and how does it affect the system?

A – The failure mode is a stale engagement signal — an open recorded late — which biases the gap the wrong way. For example, a contact who already opened might receive a longer gap because the open was not yet recorded, wasting momentum.

Follow-up – Is there any mechanism to recover from a stale signal once it is eventually recorded?
A – The context does not describe a recovery mechanism; the failure mode is simply noted as a risk.

Weak answer misses – The answer leaves out that the stale signal is specifically about an “open recorded late,” not any other kind of stale data.

06. Vertical Sequences

A generic sequence converts poorly, so each vertical gets its own planned arc. The arc has three stops. The opener starts with a specific problem the buyer faces, the second stop delivers concrete value for that problem, and the third stop is a gentle close.

The sequence selector handles the plan, and it looks up the vertical first. If a narrower niche tag exists on the contact profile, the selector picks an even tighter variant of the arc. The returned structure lists every touch and its role, and this all happens before any copy is written. A human approver can see the full arc of touches, so they sign off on the complete plan, not just a single draft.

This design is a choice among three options. One option is to let the model invent a sequence each time. That is flexible, but it is unrepeatable. A compliance audit cannot reproduce the exact plan that ran last week, and the traces are hard to compare.

Another option is one sequence written by hand for everyone. That is consistent, but it ignores how differently buyers read. A director in engineering does not read the same way as a director in legal. A buyer who sends demand letters reads differently than a buyer who runs voice operations, and one fixed story cannot serve both.

So we chose the third option, a deterministic lookup. The vertical maps to a fixed plan, and the niche tag maps to a tighter plan under it. The lookup is repeatable across runs and inspectable in traces, and it is specific to the buyer's world.

An on-call engineer can search the trace for the lookup step. The log shows the vertical key, the niche key, and the plan that was selected. If the plan looks wrong, the taxonomy config is the first place to check. The system never drafts copy before the plan is confirmed.

One failure mode to watch is a niche tag that no longer matches. This happens when the taxonomy is renamed but the contact profiles are not refreshed. The operator sees the lookup miss in the trace, and the system falls back to the generic vertical plan. The blast radius is the set of contacts with that orphaned tag, and a job that refreshes the tags can fix the profiles.

Reach for a deterministic sequence lookup when your contacts split into clear groups that read differently. Do not use it when every contact needs a completely unique path. A taxonomy cannot capture infinite variety, but for a known set of buyer roles, a deterministic lookup gives you consistency you can audit and repeat.

<!-- mem:begin -->

Generate it: If a narrower niche ___ exists on the contact profile, the selector picks an even tighter variant of the arc. (cue: niche ___; answer: tag)

Generate it: The vertical maps to a fixed plan, and the lookup is repeatable across runs and ___________ in traces. (cue: ___________ in traces; answer: inspectable)

Ask yourself: Why pick a deterministic lookup over letting the model invent a sequence each time?

Answer: A model-invented sequence is unrepeatable — a compliance audit cannot reproduce the exact plan that ran last week — whereas a deterministic lookup is repeatable across runs and inspectable in traces.

Recall check (try before reading the answer):

  1. What are the three stops of a vertical's planned arc? Answer: An opener on a specific problem the buyer faces, a second stop delivering concrete value, and a third stop that is a gentle close.

  2. Why is one hand-written sequence for everyone rejected? Answer: It ignores how differently buyers read — a director in engineering does not read like a director in legal, and one fixed story cannot serve both.

  3. What happens when a niche tag no longer matches the taxonomy? Answer: The lookup misses and the system falls back to the generic vertical plan; a job that refreshes the tags fixes the affected profiles.

<!-- mem:end -->

The sequence selector looks up a vertical's fixed plan, with an optional tighter sub-niche variant, and returns the full arc of touches before any copy is written.

python
def select_sequence(
    vertical: str | None,
    sub_niche: str | None = None,
) -> dict[str, Any] | None:
    seq = get_sequence_def(vertical, sub_niche)
    if seq is None:
        return None
    return {
        "sequence_id": seq["sequence_id"],
        "touches": build_sequence_touches(seq),
    }

def build_sequence_touches(seq_def: dict[str, Any]) -> list[dict[str, Any]]:
    return [
        {"step": i, "angle": angle}
        for i, angle in enumerate(seq_def.get("touch_angles", []))
    ]

def get_sequence_def(
    vertical: str | None,
    sub_niche: str | None = None,
) -> dict[str, Any] | None:
    if not vertical:
        return None
    if sub_niche:
        sub_map = SUB_NICHE_SEQUENCE_DEFS.get(vertical)
        if sub_map and sub_niche in sub_map:
            return sub_map[sub_niche]
    return VERTICAL_SEQUENCE_DEFS.get(vertical)
ELI5 — the plain-language version

Think of a chef planning a tasting menu for different guests. A one-size-fits-all three-course meal might satisfy nobody, so the chef designs a dedicated arc for each type of diner—a problem-led appetizer that names the diner’s actual frustration, a value-driven main course that solves it, and a soft-dessert close that invites a follow-up. The sequence selector works like a chef’s order pad: it first looks up the diner’s main category (vertical), and if the contact profile carries a narrower niche tag (like “vegan” or “gluten-free”), the selector picks an even tighter variant of that arc. Crucially, the entire three-stop plan is written down before any cooking begins—a human approver can see the full menu of touches and sign off before a single sentence is drafted.

Without this custom planning, you’d either serve the same generic three emails to everyone (and watch a legal buyer and a voice-operations buyer both bounce), or let the model improvise a new sequence each time—unrepeatable and impossible to audit. The concrete failure a beginner would feel: a niche tag that no longer matches any definition after the taxonomy changes. The chef opens the order pad, sees “vegan” for a diner, but the pantry has no vegan entry—the sequence collapses, and the contact gets no tailored arc at all.

Data flow — one request, in order
  1. select_sequence_node (LangGraph node)
    Entry point of the subsystem. Reads vertical and sub_niche from the current EmailOutreachState snapshot (loaded earlier by the contact lookup step).
    reads / writes: reads vertical (string), sub_niche (string or None); writes selected_sequence (dict or None) to state.
    branch: None yet – it unconditionally proceeds to the underlying lookup function.

  2. select_sequence(vertical, sub_niche) (pure lookup function)
    Called by the node with the two keys. Checks whether vertical exists in the top‑level dictionary VERTICAL_SEQUENCE_DEFS.
    reads / writes: reads VERTICAL_SEQUENCE_DEFS (global dict) and the nested sub‑niche map; no state writes.
    branch: if vertical is not a key in VERTICAL_SEQUENCE_DEFS, returns None immediately (graceful fallback – happy path only proceeds if vertical is known).

  3. Sub‑niche resolution inside select_sequence
    Assuming vertical exists, it checks if sub_niche is not None and if it exists as a key in the nested map VERTICAL_SEQUENCE_DEFS[vertical]["sub_niche"??] (the code shows a nested dict {vertical: {sub_niche: seq_def}}).
    reads / writes: reads the nested dict; no state writes.
    branch: if sub_niche is present and matches a key, use that sub‑level definition (tighter variant). Else fall back to the vertical‑level definition stored directly under VERTICAL_SEQUENCE_DEFS[vertical].

  4. Return sequence definition from select_sequence
    The function returns a dictionary containing sequence_id, touch_angles (list of strings), steps (list of LLM directives), cadence_days (list of ints), and fallback_step (int).
    reads / writes: none – returns a dict.
    branch: if the function returned None (vertical missing), the node will write None to selected_sequence and the graph short‑circuits later. Happy path receives a full definition.

  5. Back in select_sequence_node — check the result
    The node inspects the returned definition. If it is None, it writes selected_sequence = None to state and the node returns (the graph later triggers a skip_reason because no sequence is available).
    reads / writes: writes selected_sequence to state (either None or a dict).
    branch: None ends the subsystem here for that request (failure path); happy path continues to step 6.

  6. build_sequence_touches(touch_angles) called by the node
    The node extracts touch_angles from the definition and passes it to this helper function. It converts the list of angle strings (e.g., ["problem‑opener", "value‑step", "soft‑close"]) into the structured [{step, angle}] list required by the spec.
    reads / writes: reads touch_angles from the returned definition; produces a list of dicts. No state writes yet.
    branch: no branch here – the function always produces a list of the same length.

  7. Assemble final selected_sequence structure
    The node constructs the final value for the state field: a dict with keys sequence_id (from the definition) and touches (the list from build_sequence_touches).
    reads / writes: writes selected_sequence (now the structured plan) to EmailOutreachState.
    branch: no branch – this always happens on the happy path.

  8. Node returns control to the graph runtime
    After writing the state, select_sequence_node finishes. The LangGraph runtime transitions to the next node (typically the approval layer that inspects the plan before any draft is written).
    reads / writes: none – the node returns.
    branch: the graph may later fan out to multiple touches (one per step), but within the subsystem control is linear – no loop or fan‑out here.

Control note: The subsystem does not loop or fan out internally. The only fork is the select_sequence branch between sub‑niche and vertical‑level definitions, and the early‑exit if the vertical is unknown. The graph runtime itself may iterate over contacts or campaign threads, but that is outside the subsystem’s scope.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The vertical-sequences subsystem begins with a deterministic lookup: the sequence selector reads the contact’s vertical and, if a narrower niche tag is present on the profile, picks an even tighter variant of the arc. That lookup returns a structured plan listing every touch and its role—opener, value, or soft close—before any copy is written. A human approver can thus inspect the full multi-touch arc and sign off before the drafting step invokes the outreach engine. On failure—for example when a niche tag no longer matches any definition after the taxonomy changes—the lookup produces no match, and the system cannot proceed with a vertical-specific plan; the failure mode is an operator seeing a missing or empty sequence plan for that contact.

The invariant the design preserves is repeatably inspectable determinism. The source states that a “deterministic lookup is repeatable, inspectable, and still specific,” and the guarantee is that every run for a given vertical and niche yields exactly the same planned arc, traceable before any model call. This invariant ensures that two engineers or auditors can replay the same inputs and see the same sequence structure, and that the plan exists independently of the draft so it can be reviewed and logged.

The key trade-off rejects two obvious alternatives. Letting the model invent a sequence each time would be flexible but “unrepeatable and impossible to audit.” A single hand-written sequence for everyone would ignore “how differently a legal demand-letter buyer and a voice-operations buyer read a cold email.” The chosen design pays the cost of a static configuration table that can outgrow itself as verticals multiply, but it avoids the cost of an unrepeatable, inauditable pipeline and the cost of generic, poorly converting copy. The trade-off is concrete specificity at the price of config maintenance.

One concrete failure mode, named exactly from the source, is “a niche tag that no longer matches any definition after the taxonomy changes.” An operator would see a logged error for the sequence-selector step, an empty plan in the trace, and the contact’s run halting without a sequence to draft against. The signal is a missing structured plan and a stalled compose step, observable in the span attributes that log the vertical slug and the absence of a returned touch list.

Cost & performance — the real knobs

Based solely on the source, the vertical‑sequence subsystem exposes the following real performance knobs. Only identifiers that appear in the provided text are used; no values are invented.

  • Knob: fallback_step (default 2 in the finance VERTICAL_SEQUENCE_DEFS entry).
    Bounds: Limits which touch directive is used when a step index exceeds the sequence length.
    Effect: Raising it (e.g., to 3) forces all out‑of‑range steps to use a later touch angle; lowering it to 1 would make every over‑indexed step use the second touch. Since the directive lookup falls back to this step, a higher value may reuse a “value” or “close” angle earlier, while a lower value might repeat an “opener” – affecting personalization accuracy and the number of model calls per touch.
    Risk: Set too high, a follow‑up may skip the intended opener and jump to a close, confusing the recipient; too low, the sequence never progresses past the first style and becomes repetitive.

  • Knob: cadence_days array (default [0, 4, 7] for the finance vertical).
    Bounds: Controls the minimum wait between touches (in days). The first gap is always zero; the second and third gaps set the spacing of the sequence.
    Effect: Shortening the gaps (e.g., [0, 2, 4]) compresses the campaign, potentially increasing delivery throughput but risking lower engagement and higher suppression rates. Lengthening them (e.g., [0, 7, 14]) reduces send rate per contact, lowering immediate throughput but improving compliance with throttling rules.
    Risk: Too tight may appear spammy and trigger bounces; too loose may cause the contact to forget the context, reducing conversion and extending campaign duration.

  • Knob: Evidence snippet length (the {evidence} placeholder is “max 300 chars”).
    Bounds: Limits the number of characters fed into the hook directive.
    Effect: A longer snippet (up to 300) gives the drafting LLM more grounded detail, improving personalization but increasing prompt tokens and thus dollar cost per draft. A shorter snippet saves tokens but may omit the signal needed for a credible opener, causing the faithfulness gate to strike more sentences.
    Risk: Exceeding 300 chars is cut off by the replacement logic, losing the bespoke signal; below a minimum it may degrade to a generic hook, reducing personalization.

  • Knob: Size of VERTICAL_SEQUENCE_DEFS (the number of vertical‑level sequence definitions).
    Bounds: Determines how many distinct outreach arcs are available; each entry is a static dict in the source file.
    Effect: Adding more verticals increases the deterministic lookup time (O(1) hash‑lookup but more memory) and expands the configuration surface. Fewer verticals means more contacts fall back to a generic fallback (likely the “accounting” entry or the next gated step), reducing personalization but keeping the registry small.
    Risk: Too many verticals without corresponding sub‑niche definitions makes the lookup brittle – a vertical tag that no longer matches a key fails silently (falls back); too few forces every contact into an ill‑fitting arc.

  • Knob: Sub‑niche resolution fallback (implicit in the “additive” rule: missing vertical, missing sub‑niche, or None all fall back to the vertical‑level definition).
    Bounds: Acts as a binary enable/disable for niche‑specific sequences. When a sub‑niche tag exists and a matching key is present in the nested map, a tighter sequence is used; otherwise the broader vertical arc is chosen.
    Effect: Enforcing strict sub‑niche matching (no fallback) would break every contact without an exact sub‑niche, causing the sequence selector to produce no plan. The current fallback trades specificity for robustness. Turning the fallback off (theoretically) ensures every contact gets the narrowest arc, but at the cost of missing contacts whose niche tag changed after a taxonomy update.
    Risk: If the fallback were removed, any niche tag that no longer matches a key would silently produce no touches; with the fallback, a stale niche still uses the vertical arc, possibly losing personalization depth.

Failure modes — what breaks, what catches it

Niche Tag No Longer Matches Any Definition After Taxonomy Changes

  • Trigger – A contact profile carries a narrow niche tag (e.g., legal-pi-demand) that was valid when the taxonomy was last built, but the taxonomy has since been updated and that tag no longer maps to any tighter sequence variant.
  • Guard – No explicit guard is described in the source. The deterministic lookup attempts to find a tighter sequence for the niche tag. When no definition matches, the source does not specify a fallback; the system may silently use the vertical’s generic sequence or raise an error.
  • Posture – Fail‑soft (degrades by using the vertical’s default arc) if the lookup implicitly falls back; fail‑hard if the lack of a match throws an exception. The source only identifies the failure mode, not the runtime behavior.
  • Operator signal – No distinct error log or metric is named. The span attribute email.compose.vertical logs the vertical slug, but the niche tag resolution is not observable. The operator would notice a lower open‑rate on that contact’s touches compared to other contacts in the same niche.
  • Recovery – Manual: update the contact’s niche tag to a current definition, or add the missing variant to the sequence definitions. No automated retry exists.

Empty Post Text Leaves the Opener With Nothing Real to Stand On

  • Trigger – The hook step is invoked with a post text argument that is empty or None (e.g., when the contact has no recent public post and no job description was supplied).
  • Guard – No guard is shown in the source. The hook step “reads the supplied post text and picks exactly one concrete hook” – if the text is empty, the step has no material to work with. The failure mode is explicitly stated: “an empty post text, which leaves the opener with nothing real to stand on.”
  • Posture – Fail‑soft: the opener drafts without a concrete hook, producing a generic greeting instead of a grounded personalization.
  • Operator signal – The source does not specify a log line. An operator might observe that the drafted opener lacks the typical specificity of a grounded hook; no counter or error is explicitly mentioned.
  • Recovery – No automated retry. The contact’s post text must be supplied manually, or the sequence step can be skipped. The system continues with a generic opener for that touch.

Step Index Past the End of the Sequence

  • Trigger – The drafting step is called with a step index that exceeds the number of touches defined in the sequence for that vertical. For example, a three‑step arc is used but an index of 3 (zero‑based) is passed, or the sequence definition was truncated.
  • Guard – No guard is described in the source. The drafting code attempts a “directive lookup for the current step from the sequence definitions”; if the index is out of bounds, no directive exists. The failure mode is explicitly “a step index past the end of the sequence.”
  • Posture – Fail‑hard: the run aborts with an IndexError or similar exception because the sequence definitions cannot be indexed.
  • Operator signal – An unhandled exception would appear in the trace logs. The source does not name a specific error field or metric; the runtime error is the signal.
  • Recovery – Manual: correct the sequence definition so it has the expected number of steps, or fix the step‑index logic that caused the out‑of‑range access. No retry or fallback is provided.

Missing Vertical Slug in the Hook Templates Falls Back to a Generic Intro

  • Trigger – The contact’s vertical (e.g., a new vertical not yet added to configuration) is not present as a key in the VERTICAL_HOOK_TEMPLATES dictionary.
  • Guard – The source states: “A missing / unknown vertical silently falls back to the generic intro.” This fallback is inherent in the lookup logic – no explicit try/except or conditional is named, but the behavior is documented.
  • Posture – Fail‑soft: the system uses the generic opener template instead of a vertical‑specific problem hook. The rest of the sequence proceeds normally.
  • Operator signal – The counter email.compose.vertical_hook_rate is incremented only when a hook is found; for a missing vertical it is not incremented. The span attribute email.compose.vertical logs the unknown slug, but no error is raised. An operator might notice the hook rate dropping for that vertical.
  • Recovery – Manual: add a new entry to VERTICAL_HOOK_TEMPLATES for the missing vertical slug. No automated retry or alternative fallback is triggered.
Interview — could you explain it?

Q1
How does the system decide which three-step sequence to use for a given contact, and what ensures a specific vertical gets its own planned arc rather than falling back to a generic default?

A
The function get_sequence_def takes a vertical and an optional sub_niche. When a sub-niche is provided, it first checks SUB_NICHE_SEQUENCE_DEFS[vertical][sub_niche]; if that exists, the tailored nested sequence is returned. Otherwise, it falls back to VERTICAL_SEQUENCE_DEFS[vertical] which contains the vertical-level arcs. A missing vertical returns None, and the caller then uses a generic draft prompt—so the system is deterministic, inspectable, and still specific to the vertical or sub-niche.

Follow-up
What happens when no sequence definition exists for the given vertical at all?
Answer
get_sequence_def returns None, and get_step_directive also returns None, causing the draft node to use a fallback generic prompt instead of a vertical-specific directive.

Weak answer misses
The actual fallback behavior: get_sequence_def returns None only when vertical is falsy or not in the dict—it does not default to a universal sequence; instead the draft node handles None directives via its own generic path.


Q2
The design uses a deterministic lookup (vertical → sub-niche → sequence definitions) instead of letting the model pick a sequence per touch. Why this way and not the obvious alternative of having the LLM itself decide the arc?

A
The source explains that a deterministic lookup is “repeatable, inspectable, and still specific”, whereas an LLM-invented sequence would be “unrepeatable and impossible to audit”. The same principle is stated for the hook extraction: a single hand-written sequence for everyone would ignore differences between buyer personas (e.g., legal demand-letter vs. voice-operations), but a deterministic lookup gives a fixed, auditable per-segment arc. The failure mode is a niche tag that no longer matches any definition after taxonomy changes—but that is a known, bounded risk.

Follow-up
How does this deterministic approach handle multiple touches within one sequence without repeating the same message?
Answer
Each sequence definition contains three distinct step directives (opener, value, soft close) indexed by touch number; get_step_directive clamps to a configurable fallback_step if the step index exceeds the defined steps, preventing repeats.

Weak answer misses
The key detail that the lookup is not a single global map but a two-level map (VERTICAL_SEQUENCE_DEFS then SUB_NICHE_SEQUENCE_DEFS), and that the fallback_step is a per-sequence field, not a hardcoded constant.


Q3
How does the system use a sub-niche tag to select a tighter variant of the arc, and what exactly changes between the vertical-level and sub-niche-level sequences?

A
In get_sequence_def, when sub_niche is provided, it looks inside SUB_NICHE_SEQUENCE_DEFS[vertical] for that exact key (string must match micro_verticals.py’s sub_niche tuple). If found, the entire sequence definition—including three separate step directives, touch angles, cadence days, and a fallback step—is replaced. For example, in the construction vertical, the sub-niche “AI bid management and bid-leveling for general contractors and owners” has its own tailored steps that reference bid comparison and scope‑gap detection, whereas the vertical‑level definition might have a more generic opener.

Follow-up
What happens if the sub_niche string from the contact profile does not exist in SUB_NICHE_SEQUENCE_DEFS for that vertical?
Answer
get_sequence_def falls back to the vertical‑level definition from VERTICAL_SEQUENCE_DEFS—it does not error, and every caller retains the same behavior as before the sub‑niche lookup was introduced.

Weak answer misses
The requirement that sub_niche keys must match the exact strings in micro_verticals.py; a mismatch silently falls back, it does not raise an exception.


Q4
The three steps are explicitly labelled as opener, value, and soft close. How does the system ensure that the LLM respects that role for each touch, especially when the step directive is generic?

A
The function get_step_directive returns the per-step directive string from the sequence definition (e.g., “Write a short initial cold email … Lead with the pain…”) for steps 0, 1, and 2. If the step index exceeds the number of defined directives, it clamps to fallback_step (e.g., 2 in most sequences). The directive is then injected into the draft node’s system prompt, giving the model a concrete instruction for that touch’s role. When no directive exists (vertical is None or step index negative), get_step_directive returns None, and the draft node uses its generic prompt—which still flips into a job‑application framing if an opportunity is linked.

Follow-up
What is the fallback behavior if the directive is None but the step index is within the defined steps?
Answer
That case cannot occur because get_sequence_def must return a sequence def for the directive to be looked up; if get_sequence_def returns a valid dict, the steps list always contains three entries (per the source examples), so a valid directive is always returned for index 0–2.

Weak answer misses
The existence of a separate get_sequence_def call within get_step_directive, and the fact that fallback_step is per‑sequence (e.g., 2 for most, but configurable), not a global constant.


Q5
The opener step is supposed to start with a specific problem the buyer faces. How does the system guarantee that the model uses a grounded problem statement rather than inventing a generic compliment?

A
The hook extraction step (described in extracting the hook section of outreach.input.md) reads the supplied post text and picks exactly one concrete hook using only that text. It then hands only that fact forward—not the full context—so the drafting model cannot hallucinate details. Additionally, the vertical‑specific hook templates in email_compose_graph.py (e.g., VERTICAL_HOOK_TEMPLATES["legal-pi-demand"]) are injected into the system prompt as an “OPENING HOOK DIRECTIVE” that instructs the LLM to “Open with the specific problem this company addresses” and to cite the evidence. The faithfulness gate (DeepEval) later enforces that every personalized claim is supported.

Follow-up
What happens if the supplied post text is empty?
Answer
The hook extraction step has nothing to ground the opener on; the source explicitly lists this as a failure mode: “empty post text … leaves the opener with nothing real to stand on.”

Weak answer misses
The precise mechanism: a separate extraction step (not integrated into the draft prompt) selects a single hook before drafting begins, and the hook template is a static config string that only fills a {evidence} placeholder from the enrichment snippet.

07. Extracting The Hook

A cold email opens on one specific fact about the recipient, and without that fact, the message reads as generic. People delete generic email before finishing the first sentence.

The system has a hook step, and it reads the supplied post text. That text might be a recent public post or a job description. The step picks exactly one concrete hook, and that hook grounds the opener. The step may only use the supplied text, which keeps the personalization grounded and not invented.

You have three ways to write an opener. You can hard code a template, which is safe but obviously generic. You can let the drafting model read the whole context and invent the opener, which is fluent but prone to creating a detail the recipient never shared. Or you can extract one grounded fact before drafting, then hand only that fact forward, which keeps the opener honest and easy to inspect.

The extraction approach was a deliberate choice, and the team weighed two alternatives against it. A fixed template is reliable, and it always produces something, but it cannot adjust to the person on the other side of the email. Letting the model write the opener from full context creates text that sounds natural. Yet the model might invent a detail the recipient never shared, and that would break trust. Extraction solves both problems at once. It gives the model one real fact to start with. The fact comes from the supplied text, so it is grounded. The model still writes the opener in its own words, so it sounds natural. An inspector can also see which fact was chosen and confirm it is real, and this transparency matters for compliance.

The first failure mode is a post text with no content. When the supplied text is empty, the hook step has nothing to extract, and the opener has no real fact to stand on. The system should detect this case and raise a clear signal so the drafter knows. The second failure mode is a post text that is too long or too broad. The model might pick a minor detail that does not capture the main point of the post, and the opener would then feel off target. The solution is to keep the extraction focused on a single concrete fact. The step also logs which fact was chosen, so an operator can inspect the log and adjust if needed.

The system logs each extraction. It records whether the post text had content, how many candidate facts the step considered, and which one it picked. This gives the team a way to debug. When a cold email sounds generic, they can check the log and see whether the step had any material to work with. They can also see which fact it chose and decide whether that fact was a good fit.

Use extraction when the supplied post text has at least one concrete fact and you need the opener to be both grounded and inspectable. Do not use extraction when the post text is empty, and in that case, fall back to a different personalization signal, like the contact's role or industry.

<!-- mem:begin -->

Generate it: The hook step may only use the _________ text, which keeps the personalization grounded and not invented. (cue: _________ text; answer: supplied)

Generate it: The step picks exactly one concrete hook, and that hook ________ the opener. (cue: ________ the opener; answer: grounds)

Ask yourself: Why extract one grounded fact before drafting instead of letting the model write the opener from full context?

Answer: Reading full context produces fluent text but the model may invent a detail the recipient never shared, breaking trust; extraction hands the model one real fact from the supplied text, so the opener stays grounded and an inspector can confirm the chosen fact is real.

Recall check (try before reading the answer):

  1. What is the first failure mode of the hook step, and how should the system respond? Answer: A post text with no content leaves nothing to extract; the system should detect the empty case and raise a clear signal so the drafter knows.

  2. What goes wrong when the post text is too long or too broad? Answer: The model may pick a minor detail that misses the post's main point, so the opener feels off target; keep extraction focused on a single concrete fact.

  3. What does the system log about each extraction? Answer: Whether the post text had content, how many candidate facts it considered, and which fact it picked.

Looking back: In "Adaptive Cadence," what does the code do after the model proposes a gap in days? Answer: It clamps the proposal to a safe range so no message fires the same day and no silence stretches past six months.

<!-- mem:end -->

The hook step extracts one concrete fact from the supplied post text to ground the cold email opener.

python
async def extract_hook(state: EmailOutreachState) -> dict:
    llm = make_llm()
    post = (state.get("post_text") or "")[:4000]
    if not post.strip():
        return {"hook": ""}
    result = await ainvoke_json(
        llm,
        [
            {
                "role": "system",
                "content": (
                    "Pick one specific, concrete hook from the provided LinkedIn post — a claim, "
                    "metric, or opinion to reference in a cold email. Return JSON `{\"hook\": \"...\"}` "
                    "with a single sentence, no quotes."
                ),
            },
            {"role": "user", "content": post},
        ],
    )
    return {"hook": (result or {}).get("hook", "") if isinstance(result, dict) else ""}
ELI5 — the plain-language version

Think of this step like a detective walking into a cluttered room and picking up exactly one piece of real evidence—a fingerprint, a receipt—to build the first question. Everything else stays on the table. That single, verified fact makes the opening feel personal, not like a form letter.

Concretely, the hook step reads a supplied post text—maybe the recipient’s recent LinkedIn article or a job description—and selects just one concrete hook from that text. The rule is strict: the hook can only use the supplied words, nothing invented. That one grounded fact then feeds into the email’s opener. Hard-coding a template would be safe but obviously generic; letting the model invent a detail from the whole context would risk a lie. This step keeps the opener honest and inspectable.

Without it, the opener would have no real fact to stand on. The failure mode is exactly that: if the supplied post text is empty, there’s nothing to extract, and the first sentence falls back to something generic. People delete generic email before finishing the first sentence. So the system guards against the beginner’s painful experience of writing to someone and getting ignored because the opening felt mass-produced.

Data flow — one request, in order
  1. outreach graph – entry point of the request; receives the contact identifier and the supplied post text.

    • reads / writes – consumes contact data and the supplied post text; returns a subject line, plain-text body, HTML body, skip reason, engagement signal, and next touch time.
    • branch – happy path proceeds inside the graph; an early skip reason short‑circuits the entire graph if the contact is suppressed or stop conditions are met.
  2. contact lookup – retrieves the single contact record associated with the request.

    • reads / writes – reads contact fields (name, role, vertical, sequence step); writes the contact into the graph state for downstream steps.
    • branch – happy path continues; if the contact is not found, the graph may skip or return an error (not detailed in source, but implied by “fail closed”).
  3. suppression gate – checks whether the contact is allowed to be contacted based on suppression entries and stop conditions.

    • reads / writes – reads suppression list and stop‑condition flags; writes a skip reason if triggered.
    • branch – happy path: no suppression, no stop; proceeds. Failure path: sets skip reason and exits the graph early, returning the skip reason in the output.
  4. sequence planning – deterministically selects the sequence steps using VERTICAL_SEQUENCE_DEFS based on the contact’s vertical and current sequence_step.

    • reads / writes – reads vertical and sequence_step from contact state; writes the touch_angles and the directive for the current step index.
    • branch – happy path: vertical matches a defined sequence; uses the steps list. Empty/failure path: falls back to a generic draft node (step 0) or the last defined step if the sequence step is beyond the list.
  5. hook step – reads the supplied post text and picks exactly one concrete hook to ground the opener.

    • reads / writes – reads supplied post text (a recent public post or job description); writes the chosen hook string into state.
    • branch – happy path: non‑empty post text yields one hook. Failure path: empty post text leaves the opener with no real fact; the hook step may produce a generic placeholder or cause the opener to be omitted (failure mode: “empty post text”).
  6. drafting step – looks up the directive for the current sequence step from VERTICAL_SEQUENCE_DEFS["steps"] and writes the email body grounded on the chosen hook.

    • reads / writes – reads the directive for the current step, the hook, and contact fields (recipient_name, recipient_role, vertical_context); writes the draft body and subject line.
    • branch – happy path: directive exists for the current step. Failure path: step index past the end of the steps list; falls back to the fallback_step (wraps to last step). Also, if no per‑step directive applies, falls back to a generic drafting step (which can flip into job‑application framing when an opportunity is linked).
  7. faithfulness gate – uses a judge model to audit the draft against the assembled evidence; removes any sentence whose claim is not supported.

    • reads / writes – reads the draft body and the evidence set (assembled earlier); writes a score between zero and one as feedback, and returns a cleaned draft with unsupported sentences removed.
    • branch – happy path: all claims supported – draft unchanged. Failure paths: over‑aggressive judge strips a true but tersely worded claim; evidence set missing a real fact leaves a supported claim removed. The gate posts the score for ranking.
  8. output – the outreach graph assembles the final output fields.

    • reads / writes – reads the final draft, hook, and bookkeeping from state; writes the return tuple: subject line, plain-text body, HTML body, skip reason (or None), engagement signal, and next touch time.
    • branch – terminal step; no further branches. The graph never sends – sending is a separate caller decision.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem begins with the hook step, which reads the supplied post text—a recent public post or a job description—and selects exactly one concrete hook. That single fact grounds the opener of the cold email. Crucially, the step may only use the supplied text; no external knowledge is permitted. This ordered mechanism runs before any drafting; the chosen hook is the only personalization passed forward. On failure—specifically when the post text is empty—the hook step yields nothing, and the opener is left with no grounded fact.

The invariant the design preserves is that every personalized claim must be grounded in the supplied text and not invented. The hook step enforces this by extracting a single fact from the recipient’s own words before any drafting occurs. This ensures the opener is honest, repeatable, and inspectable—the same input always produces the same hook, and an auditor can verify that the hook came directly from the source.

The key trade-off rejects the alternative of letting the drafting model freelance the opener from the whole context, which is fluent but prone to inventing a detail the recipient never shared. Hard-coding an opener template is safe but obviously generic. The chosen design extracts one grounded fact before drafting, handing only that fact forward, which avoids the cost of a confident fabrication that would erode trust and create compliance problems. The flexibility of full-context invention is sacrificed for auditability and truthfulness.

One concrete failure mode is an empty post text, which leaves the opener with nothing real to stand on. An operator would observe the system producing an email with a generic or missing opener, and the trace would show hook_step returning no output or a fallback to a non-grounded template. The signal is a missing or blank hook field in the workflow’s state, triggering an alert that the supplied post text was absent.

Cost & performance — the real knobs

VERTICAL_HOOK_TEMPLATES — the static dictionary constant that maps vertical slugs to directive strings.

  • Bounds — limits the available hook strategies to the defined verticals; each entry imposes a maximum token length for the instruction injected into the drafting prompt.
  • Effect — adding more verticals or lengthening a directive increases prompt token count and thus cost per draft call; a shorter, terse directive reduces cost but may yield less grounded hooks.
  • Risk — a missing vertical silently falls back to a generic intro, making the opener less personalized and increasing chance of deletion; excessive verbosity raises latency and dollar cost without improving outcome.

max 300 chars (enrichment evidence snippet length) — the explicit bound on the {evidence} placeholder replaced at compose time from recipient_context.

  • Bounds — caps the number of tokens from enrichment that enter the hook directive; a tighter bound reduces prompt size, lower limit saves tokens and money, higher limit provides more signal.
  • Effect — raising this bound increases per-draft token cost but may produce a more relevant hook; lowering it cuts cost but risks losing the specific fact that grounds the opener.
  • Risk — setting it too low leaves the model with insufficient evidence, leading to a generic or fabricated claim; setting it too high inflates costs for no gain if the extra text is irrelevant.

wrap_untrusted — the function (called in gather_context) that fences the enrichment evidence before it reaches the draft node.

  • Bounds — defines the sanitization layer; determines what characters or structures are allowed through to the hook step.
  • Effect — a more permissive fence increases risk of prompt injection but may preserve more of the evidence; a stricter fence may truncate or escape parts of the evidence, reducing the usable signal and degrading hook quality.
  • Risk — too loose allows injection of malicious content into the hook directive, potentially causing the model to generate unsafe copy; too aggressive strips valid evidence, leaving the hook with nothing real to stand on and triggering the empty-post-text failure mode.

DeepEval — the judge model used by the faithfulness gate to audit each claim in the draft against the assembled evidence.

  • Bounds — adds one extra LLM call per drafted email; the judge model choice (e.g., a smaller cheaper model vs. a frontier model) controls the cost and accuracy of fabrication detection.
  • Effect — using a cheaper judge model reduces per-email cost but may miss subtle fabrications or over-aggressively strip true claims (the over-aggressive judge failure mode); a stronger model increases cost but improves grounding.
  • Risk — mis-setting the judge model (too weak) allows unsupported claims to slip through, eroding trust; too strong (or too aggressive) strips true claims, making the email incomplete or less effective.
Failure modes — what breaks, what catches it

Chapter: Extracting the Hook

The hook step is designed to read the supplied post text and pick exactly one concrete, grounded fact. The source explicitly documents only one failure mode for this subsystem. Below are four distinct failures, ordered from most to least likely, grounded solely in the provided source text. Where the source does not specify a guard, signal, or recovery, that is stated plainly rather than invented.


1. Empty supplied post text

  • Triggerempty post text (the supplied post text is missing, null, or zero-length). The source states: “The failure mode is an empty post text, which leaves the opener with nothing real to stand on.”
  • Guard – No guard is shown in the source for this subsystem. The hook step does not define an exception handler, validation check, or fallback for an empty input.
  • Posture – Fail-soft (degrades and continues). The system does not abort; instead the opener “has nothing real to stand on,” meaning the email proceeds with a generic or ungrounded opener.
  • Operator signal – No log line, metric, or error field is documented for this failure. The operator would observe only the downstream effect (a generic opener) with no explicit alert.
  • Recovery – No automatic recovery is specified. Manual review of the outgoing email or enrichment of the contact’s post text would be required before re-triggering.

2. Supplied post text is non‑empty but contains no concrete hook

  • Trigger – The post text exists (e.g., a very short or generic post) but lacks a specific, verifiable detail about the recipient that can serve as a hook. The system is designed to “pick[ ] exactly one concrete hook” – if none exists, no valid hook can be extracted.
  • Guard – No guard is shown in the source. There is no validation that the extracted hook is non‑empty or that the post text contains at least one concrete claim.
  • Posture – Fail-soft (degrades and continues). The step would likely return a null or placeholder, causing the opener to be grounded on nothing – identical to the empty‑text case.
  • Operator signal – Not documented. The operator would see the same generic‑opener symptom with no indication that the post text was present but unusable.
  • Recovery – None specified. The failure is silent; recovery depends on noticing the poor opener and re‑enriching the contact’s information.

3. Supplied post text contains ambiguous or contradictory hooks, and the step picks the wrong one

  • Trigger – The post text includes multiple distinct concrete details (e.g., two different job changes or projects), and the extraction logic selects a hook that is not the most relevant or is factually outdated. The source mandates “picks exactly one concrete hook” but does not describe disambiguation rules.
  • Guard – No guard is shown. The source does not define any validation of hook relevance or a fallback for ambiguous input.
  • Posture – Fail-soft. The system continues with a grounded (but incorrect) hook, leading to an opener that is personalized to a wrong detail. Trust may be eroded.
  • Operator signal – No documented signal. The operator would only detect the failure if the email is manually inspected and the hook is found to be mismatched.
  • Recovery – Not specified. Manual correction is needed; no automated retry or fallback is described.

4. Software exception during hook extraction (e.g., parsing failure, upstream data error)

  • Trigger – An unhandled exception occurs when reading the supplied post text or applying the extraction logic (e.g., null pointer, format error, timeout). The source does not explicitly list this failure, but it is a realistic runtime condition.
  • Guard – No guard is shown. The source provides no exception handler, retry, or fallback for the hook step. The graph runtime mentioned in other sections is not referenced here.
  • Posture – Likely fail-hard (aborts the run) if unhandled, because the step would crash without any error recovery in the source. Alternatively, it could be fail‑soft if the calling graph catches the exception – but no such catch is documented.
  • Operator signal – Not specified. A typical runtime would log an unhandled exception with a stack trace, but no specific error identifier is given in the source.
  • Recovery – No automated recovery. The run would stop, and the contact may not receive any email for that sequence step. Manual restart after fixing the data or code is required.
Interview — could you explain it?

Q
What is the purpose of the hook step in the email composition system, and what input does it rely on?

A
The hook step reads the supplied post text — a recent public post or a job description — and picks exactly one concrete fact to ground the email opener. This ensures the opening is specific and true about the recipient, not a generic compliment. The step is defined in the “Extracting the hook” description, which states it “may only use the supplied text” to keep the personalization grounded and not invented.

Follow-up
What happens if the post text is empty?
A
Empty post text is identified as a failure mode: “the opener with nothing real to stand on.”

Weak answer misses
A shallow answer omits that the hook step exclusively uses the supplied post text and does not pull from other context; it also fails to name “Extracting the hook” as the design chapter that codifies this rule.


Q
Why does the system extract one grounded fact before drafting, rather than using a hard-coded opener template or letting the model freely use all context? (Design “why this way and not the obvious alternative”)

A
Hard-coding an opener is safe but obviously generic, while letting the model freelance from the full context is fluent but prone to inventing a detail the recipient never shared. Extracting one grounded fact and handing only that fact forward keeps the opener “honest and easy to inspect.” This trade-off is explicitly analyzed in the “Extracting the hook” section.

Follow-up
How does the system enforce that only the supplied text is used?
A
The rule “may only use the supplied text” is enforced at the drafting step, and the faithfulness gate later audits every personalized claim against the evidence, suppressing unsupported sentences.

Weak answer misses
A weak answer does not cite the exact failure modes (“generic” vs. “invented”) or the phrase “handing only that fact forward” as the core design rationale.


Q
How do VERTICAL_HOOK_TEMPLATES ensure that the hook is grounded in the recipient’s specific evidence?

A
Each vertical hook template includes a placeholder {evidence} that is replaced at compose time with an enrichment evidence snippet (max 300 chars) from recipient_context. The template instructs the LLM to “ground the hook in the company’s evidence” and cite the problem, not credentials. This ties the hook to a concrete, verifiable detail per vertical.

Follow-up
What observability is logged when a vertical hook is applied?
A
The vertical slug is logged as the span attribute email.compose.vertical and the counter email.compose.vertical_hook_rate is incremented when a hook is found.

Weak answer misses
A shallow answer omits the exact max length (300 chars), the span attribute name, and the fact that the template is static config with only {evidence} filled from untrusted data.


Q
What mechanism prevents the LLM from inventing unsupported claims in the email body that are not backed by evidence?

A
The faithfulness gate, implemented in the composition graph, uses a judge LLM that identifies every personalized claim and tags it with a source_field from recipient_context or "UNSUPPORTED". Unsupported claims are suppressed (the sentence is removed) before the email is returned. If LLM_KILL_SWITCH=1 is set, the node short-circuits and returns the unmodified body with faithfulness_score=1.0.

Follow-up
What PII protection is in place for the faithfulness audit?
A
Only source-field IDs and grounded booleans are emitted to OTel/LangSmith; the raw claim text is never output.

Weak answer misses
A weak answer does not mention the sentence removal mechanism, the rollback via LLM_KILL_SWITCH=1, or the PII rule that excludes raw claim text from telemetry.


Q
Why does the system use a deterministic lookup for the email sequence rather than having the model invent a sequence each time? (Design trade-off)

A
The design notes state that an invented sequence would be “flexible but unrepeatable and impossible to audit,” whereas a deterministic lookup is “repeatable, inspectable, and still specific.” A single hand-written sequence for everyone ignores how differently verticals read cold emails, so a lookup per vertical provides specific, auditable sequences.

Follow-up
What is the failure mode of this deterministic approach?
A
The failure mode is a niche tag that no longer matches any definition after the taxonomy changes, leaving a vertical unhandled.

Weak answer misses
A shallow answer omits the exact phrase “repeatable, inspectable, and still specific” and the specific failure mode (“niche tag no longer matches after taxonomy changes”).

08. Drafting The Step

The drafting step writes the body for exactly this touch in the sequence. It writes for this vertical. And it grounds its copy on the hook you already extracted.

This step looks up the directive for the current step from the sequence definitions. The directive tells the model what role this email plays. Is it an opener? Is it a value pitch? Or is it a soft close? The copy fits that role, so a follow up never just repeats the first email.

When no directive exists for the current step, the system falls back to a generic drafting prompt. That generic prompt also flips into a job application framing when an opportunity is linked. So one engine writes either a sales touch or an application. You do not need two separate systems.

Three design options exist. Option one: write one prompt for every step. That is simple but produces three emails that all say the same thing. Option two: write a separate prompt for each vertical and each step, with no fallback. That is precise but brittle. You need to maintain dozens of prompts, and any missing one crashes the run. Option three, which the system uses: a directive lookup with a generic fallback. It is specific where a directive exists and safe everywhere else.

The failure mode to watch is a step index past the end of the sequence. If the sequence defines three touches but the run asks for touch four, the lookup finds no directive and falls back to the generic prompt. That generic prompt may produce weak copy because it does not know this is a follow up. It lacks the arc context. The detection signal is a log line showing the step index exceeded the sequence length. The blast radius stays small: one email in one run. It does not corrupt other contacts or other sequences.

The operational reality is that the directive lookup runs as a pure function with no I O. Its cold start cost is negligible. You can trace it by reading the step index and the directive key in the run logs.

The design rationale here is a deliberate choice. The team considered a single prompt for all steps but rejected it because every email would sound the same. They considered per vertical, per step prompts but rejected them because the maintenance burden grows linearly with every new vertical. The directive lookup with a generic fallback balances specificity against brittleness. It is the right choice when verticals multiply faster than you can write prompts.

Here is the transferable rule. Use a directive lookup with a generic fallback when you have more than three step roles and the same engine serves multiple contexts. Do not use it when each step demands completely different prompt anatomy and you have the team size to maintain every prompt pair.

<!-- mem:begin -->

Generate it: The directive tells the model what ____ this email plays — opener, value pitch, or soft close. (cue: what ____; answer: role)

Generate it: The generic fallback prompt flips into a job application framing when an _____________ is linked. (cue: an _____________; answer: opportunity)

Ask yourself: Why look up a per-step directive instead of using one prompt for every step?

Answer: A single prompt makes the three emails all say the same thing, so a follow-up just repeats the opener; the directive tells the model the role of this touch so each email fits its place in the arc.

Recall check (try before reading the answer):

  1. Why are per-vertical, per-step prompts with no fallback rejected? Answer: They are brittle — you must maintain dozens of prompts, and any missing one crashes the run.

  2. What happens when the step index runs past the end of the sequence? Answer: The lookup finds no directive and falls back to the generic prompt, which lacks the arc context and may produce weak follow-up copy.

  3. What is the detection signal for that step-index failure? Answer: A log line showing the step index exceeded the sequence length.

<!-- mem:end -->

The draft_step node looks up the per-vertical, per-step directive and falls back to the generic draft node when no directive exists.

python
async def draft_step(state: EmailOutreachState) -> dict:
    """V38: per-vertical, per-step copy generation."""
    company_vertical = state.get("company_vertical")
    sequence_step = state.get("sequence_step")
    sub_niche = state.get("sub_niche")

    if state.get("application_mode"):
        return await draft(state)

    directive = get_step_directive(company_vertical, sequence_step, sub_niche)
    if not directive:
        # No vertical or step directive — delegate to the generic draft node.
        return await draft(state)

    llm = make_llm()
    tone = state.get("tone") or "professional and friendly"
    recipient_name = state.get("recipient_name", "") or ""
    recipient_role = state.get("recipient_role", "") or "unknown"
    # … (memory recall, hook fencing, then LLM call with directive)
ELI5 — the plain-language version

Think of a chef preparing a three-course meal. Each course has a specific role—appetizer, main, dessert—and the chef follows a different recipe for each one, using the same base ingredient (the hook) but adjusting the preparation to fit that course. The drafting step works the same way. It looks up a directive from the sequence definitions that tells the model whether this email is an opener, a value pitch, or a soft close. The copy is then written to fit that role, so a follow-up never just repeats the first email. When no directive exists for a particular step, the system falls back to a generic drafting prompt—safe but less tailored. Without this directive lookup, the system would either use a single prompt for every step, producing three identical emails, or require a separate prompt for every possible case, making it brittle. A beginner would feel the failure as a confused recipient who gets the same pitch twice, wasting the chance to build a real conversation.

Data flow — one request, in order
  1. Entry to the drafting node (the unnamed function in email_outreach_graph.py that receives EmailOutreachState).

    • reads / writes: reads state["company_vertical"], state["sequence_step"], state["sub_niche"], state["application_mode"]; writes nothing yet.
    • branch: if application_mode is true, the node immediately delegates to draft(state) (job‑application framing) and returns – the happy path proceeds when application_mode is false.
  2. get_step_directive(company_vertical, sequence_step, sub_niche) – a pure lookup that returns the LLM directive string for the current vertical, step index, and sub‑niche, or None if no directive exists.

    • reads / writes: reads the three passed state keys; returns a string or None.
    • branch: if the return is None, the node delegates to draft(state) (generic fallback) – the happy path continues with a non‑None directive.
  3. make_llm() – instantiates the LLM client; can raise LlmDisabledError when the kill switch is engaged.

    • reads / writes: no state access; returns a configured LLM object.
    • branch: kill switch engaged → early exception terminates the graph; happy path obtains the LLM.
  4. Read tone, recipient_name, recipient_role, and contact_id from state – the node extracts tone (default "professional and friendly"), recipient_name (default ""), recipient_role (default "unknown"), and contact_id.

    • reads / writes: reads state["tone"], state["recipient_name"], state["recipient_role"], state["contact_id"]; no writes.
  5. recall(store, query, contact_id, allow_inbound=False) – long‑term memory recall, fail‑open (returns empty string on failure).

    • reads / writes: builds query from recipient_name, recipient_role, and first 200 chars of state["post_text"]; writes memory_context variable (string).
    • branch: if contact_id is missing or query is empty, memory_context remains "".
  6. wrap_untrusted(hook_raw) and wrap_untrusted(post_raw) – fences any inbound or enriched text to prevent prompt injection.

    • reads / writes: reads state["hook"]hook_raw, state["post_text"] (first 1000 chars) → post_raw; writes hook_safe and post_safe (wrapped strings, or "none"/"" if empty).
  7. Construct the LLM prompt – the node assembles the directive string, memory section, hook_safe, post_safe, tone, and recipient info into a prompt (exact format not shown in the source, but uses the retrieved directive).

    • reads / writes: reads all previously gathered variables; no state writes yet.
  8. Invoke the LLM via ainvoke_json (imported from llm.client) – sends the prompt to the LLM and receives a JSON‑structured draft.

    • reads / writes: consumes the prompt; produces a draft object with expected keys: subject, text, html.
    • branch: LLM call may raise LlmDisabledError; happy path returns the draft.
  9. faithfulness_check (imported from email_compose_graph.py) – inspects every personalized sentence in the draft against the supplied evidence (hook, post_text).

    • reads / writes: reads the draft text and the original evidence; returns a pass/fail verdict.
    • branch: if the verdict is “fail”, the node may rewrite or return the draft with a skip_reason; happy path passes.
  10. post_faithfulness_feedback (imported from email_compose_graph.py) – logs or handles the result of the faithfulness check, possibly triggering a retry.

    • reads / writes: reads the verdict; writes a feedback record (implementation detail not shown).
    • branch: on persistent failure, the draft may be marked as skip_reason; happy path continues with the clean draft.
  11. Write the final draft to state – the node writes subject, text, html, contact_id, and skip_reason (empty or a skip reason) into the outgoing EmailOutreachState.

    • reads / writes: reads the draft and contact_id; writes state["subject"], state["text"], state["html"], state["contact_id"], state["skip_reason"].
    • terminal step: the node returns the mutated state and the graph moves to the next stage (approval or send).
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The drafting step is invoked only after the hook has been extracted and the vertical sequence has been selected deterministically. Its first action is a directive lookup against the sequence definitions, using the current step index and the contact’s vertical to find a per-step instruction that names the touch’s role — opener, value, or soft close. That directive is injected into the drafting prompt so the model writes copy that fits that specific role, ensuring a follow-up never repeats the first email. If no directive exists for the current step, the system falls back to a generic drafting prompt; that same generic prompt also flips into a job-application framing when an opportunity is linked, so the engine can write either a sales touch or an application without a separate code path. On failure — for example, a step index that is past the end of the sequence — the lookup returns nothing, and the draft falls to the generic fallback, which may produce copy misaligned with the intended arc.

The invariant the drafting step preserves is that every touch is role-specific and grounded on the previously extracted hook. The design achieves this through a directive lookup with a generic fallback — the output is always a plausible email for that step, even when the taxonomy misses a directive. This is not idempotent in the write sense; rather, it guarantees that the copy’s purpose is distinguishable from other touches in the same sequence, so the recipient never receives three emails that “all say the same thing.” The fallback also ensures the system never crashes from a missing directive, and the generic path extends cleanly to job applications, giving one engine two modes without branching prompt logic.

The key trade-off rejects two obvious alternatives. A single prompt for every step would produce three emails that say the same thing — the cost of that rejection is repetitive, non-adaptive copy that undermines multi-touch outreach. On the opposite end, a separate prompt per vertical and per step with no fallback would be precise but “brittle,” carrying the cost of maintaining and testing a large, rigid prompt matrix that breaks every time a vertical or step is added. The chosen approach — a directive lookup with a generic fallback — is “specific where a directive exists and safe everywhere else,” accepting a slightly less precise fallback when a directive is missing, but avoiding both the repetitive-cost of a single prompt and the maintenance-cost of a fully enumerated matrix.

A concrete failure mode is a step index past the end of the sequence. An operator would see this as a trace or log entry where the drafting step’s directive lookup returns empty, the fallback is triggered, and the resulting draft is generic rather than role-specific — for instance, a third touch that reads like an opener instead of a soft close. The signal is a missing or misaligned directive in the per-step sequence definitions, observable via the span attributes logged for the drafting step, and the trace would show the fallback path being taken rather than a per-step directive.

Cost & performance — the real knobs

The drafting step generates copy with a single model call per touch, so its cost and latency are governed by the prompt length and the number of calls made. The source reveals four explicit performance knobs that control these factors.

sender_mode

  • Knobsender_mode parameter; default "job-seek" (from _select_draft_prompt docstring: “Defaults to job-seek”).
  • Bounds — Limits which system prompt is injected: the sales‑collaborator persona (DRAFT_SYSTEM_PROMPT_SALES_COLLAB) or the job‑seeker persona (DRAFT_SYSTEM_PROMPT_JOB_SEEK).
  • Effect — Switching to "sales-collab" uses a longer, domain‑specific prompt that may increase token count per call, raising both latency and dollar cost.
  • Risk — Mis‑setting (e.g., leaving "job-seek" in a sales campaign) produces copy with the wrong tone and intent, breaking the outreach sequence without raising an error.

fallback_step

  • Knobfallback_step constant inside each VERTICAL_SEQUENCE_DEFS entry; default value 2 (shown in the accounting vertical example).
  • Bounds — Controls which step’s directive is applied when the current step index exceeds the sequence length (e.g., a step 4 when only 3 directives exist).
  • Effect — Raising it uses a later directive (say a “soft close”) for out‑of‑range touches, possibly producing more concise copy and lowering token cost; lowering it uses an earlier directive (e.g., “opener”) that may be longer.
  • Risk — Set too high, later touches receive an inappropriate directive that misleads the model; too low, the model repeats the opener in every follow‑up, wasting tokens and damaging conversion.

VERTICAL_HOOK_TEMPLATES

  • KnobVERTICAL_HOOK_TEMPLATES dictionary; each key–value pair is a static config entry (e.g., "legal-pi-demand" → a multi‑sentence instruction). No default: it is author‑defined.
  • Bounds — Limits the length and specificity of the opening‑hook instruction injected into the model’s system prompt. A longer template uses more tokens; a missing vertical skips the hook entirely.
  • Effect — Turning it up (adding verbose, example‑laden instructions) increases per‑call token count, raising latency and cost; turning it down (shorter templates) reduces both but may produce less grounded openers.
  • Risk — Too long a template risks exceeding the model’s context window or diluting the instruction; too short fails to ground the hook, leading to generic output and potential compliance failures caught by the later faithfulness gate.

VERTICAL_SEQUENCE_DEFS (steps list)

  • Knob — The steps list inside each VERTICAL_SEQUENCE_DEFS entry (e.g., three strings for the accounting vertical). No default length, but the source shows exactly three per vertical.
  • Bounds — Determines the number of model calls per contact in a sequence (each step triggers one draft). Each additional step adds latency and cost equal to one full model invocation.
  • Effect — Adding steps (lengthening the list) linearly increases throughput demand and total dollar cost per contact; removing steps reduces cost but may skip necessary touches.
  • Risk — Too many steps exhaust the contact before conversion and balloon cost; too few steps leave the sequence incomplete, failing to achieve the outreach goal.
Failure modes — what breaks, what catches it

Step Index Past End of Sequence

  • Trigger — The drafting step receives a step index that exceeds the number of defined steps in the sequence for the given vertical.
  • Guard — None found in the source. The text explicitly states “The failure mode is a step index past the end of the sequence” but does not describe an exception handler, retry, fallback, or validation.
  • Posture — fail‑hard. Without a guard the draft cannot be produced, aborting the run.
  • Operator signal — Not explicitly described. The operator would likely observe an unhandled exception (e.g., an index error) in the execution logs, but the source specifies no log line or metric.
  • Recovery — Manual correction of the sequence definition or adjustment of the step index so it stays within the defined range.

Empty Hook from Extraction

  • Trigger — The hook extraction step (preceding drafting) returns an empty or null hook because the post text was empty. The source calls this “an empty post text, which leaves the opener with nothing real to stand on”.
  • Guard — None found in the drafting step itself. The hook step identifies the failure but no explicit guard exists in the drafting subsystem to handle a missing hook.
  • Posture — fail‑soft. The drafting step may proceed without a grounded hook, producing a generic opener that lacks personalization.
  • Operator signal — Silent absence of evidence in the draft. Later the faithfulness gate may flag unsupported claims, but the drafting step provides no immediate signal.
  • Recovery — The operator must ensure the post text is non‑empty before the hook extraction runs; alternatively, the drafting step could be modified to abort when the hook is empty.

Missing Per‑Step Directive

  • Trigger — The lookup for the current step’s directive in the sequence definitions returns no match (e.g., the step index has no associated directive).
  • Guard — The generic drafting prompt (described as “a directive lookup with a generic fallback” and “generic drafting step” in the source). This fallback is used when no per‑step directive applies.
  • Posture — fail‑soft. The system falls back to a generic prompt, producing less tailored copy but not aborting.
  • Operator signal — Not explicitly logged. The operator might notice the draft lacks step‑specific instructions (e.g., a follow‑up repeats the opener), but no distinct metric or log line is prescribed.
  • Recovery — Automatic fallback applied. The operator can add a directive for the step to improve specificity.

Vertical Sequence Not Defined

  • Trigger — The sequence selector “deterministically looks up a sequence for the contact’s vertical” but no sequence definition exists for that vertical slug.
  • Guard — None found in the source. No fallback or default sequence is mentioned for an unknown vertical in the drafting step.
  • Posture — fail‑hard. Without a sequence plan the drafting step cannot proceed, aborting the run.
  • Operator signal — Not explicitly described; likely an error indicating a missing sequence definition for the vertical.
  • Recovery — Manual addition of a sequence definition for that vertical in the configuration.

Directive Malformed or Unparseable

  • Trigger — A directive exists for the current step but contains instructions that are malformed, contradictory, or unparseable by the drafting model.
  • Guard — None found in the source. The system has no validation or error handling for the content of directives.
  • Posture — fail‑soft. The model may produce incoherent copy or ignore the malformed directive, degrading output quality.
  • Operator signal — The resulting draft may appear nonsense or off‑topic. The faithfulness gate later may flag unsupported claims, but no direct signal arises from the drafting step.
  • Recovery — Manual correction of the directive text by the operator.
Interview — could you explain it?

Q1 (warm-up)
How does the drafting node know what to write for a specific touch in the sequence?

A
It calls get_step_directive(vertical, step, sub_niche), which returns a per-step copy directive string from the sequence definitions. The directive tells the model the email’s role—opener, value, or soft close—so the body fits that touch’s purpose. If the vertical or step is missing, the function returns None, triggering a fallback generic prompt.

Follow-up
What happens when the step index is larger than the number of defined steps?
A
get_step_directive clamps the index to the value of seq.get("fallback_step", len(steps) - 1) before returning the directive.

Weak answer misses
The fallback step is not hard‑coded at 2; it reads from the fallback_step key per sequence.


Q2 (design: why this way, not the obvious alternative)
Why does the system use a directive lookup with a generic fallback instead of a separate hard‑coded prompt for every step in every vertical?

A
A separate prompt per vertical and step would be precise but brittle and hard to maintain. The directive lookup (get_step_directive) returns None when no directive exists, and the drafting node falls back to a generic prompt. This follows the design rule: “specific where a directive exists and safe everywhere else.”

Follow-up
How does the generic fallback handle job‑application vs. sales mode?
A
_select_draft_prompt(sender_mode) returns DRAFT_SYSTEM_PROMPT_SALES_COLLAB when sender_mode == "sales-collab", otherwise DRAFT_SYSTEM_PROMPT_JOB_SEEK.

Weak answer misses
The exact constants DRAFT_SYSTEM_PROMPT_SALES_COLLAB and DRAFT_SYSTEM_PROMPT_JOB_SEEK are what switch the persona, not a separate code path.


Q3 (grounding)
How does the drafting step ensure the email body is anchored on the extracted hook and evidence?

A
The hook step picks one concrete fact from the supplied post text. That fact is handed forward, and the drafting step’s directive includes a {evidence} placeholder. At compose time, that placeholder is replaced with the enrichment evidence snippet from recipient_context (max 300 chars). The hook templates in VERTICAL_HOOK_TEMPLATES instruct the LLM to open with the problem, grounded on that evidence.

Follow-up
What prevents the model from inventing details beyond the evidence?
A
The faithfulness gate (DeepEval) enforces that every personalized sentence must be supported by evidence. The hook extraction is also limited to only the recipient’s supplied text.

Weak answer misses
The naming of the faithfulness gate and the {evidence} placeholder—not just “some prompt” but a specific substitution from recipient_context.


Q4 (deterministic vs. generative sequence)
Why not let the model invent the sequence of touches each time for flexibility, instead of a deterministic lookup?

A
A model‑invented sequence would be flexible but unrepeatable and impossible to audit; a single hand‑written sequence ignores how different recipients read cold emails. The deterministic lookup via get_sequence_def is repeatable, inspectable, and still specific per vertical and sub‑niche. The failure mode is a niche tag that no longer matches any definition after the taxonomy changes.

Follow-up
How does sub‑niche resolution work in V91?
A
get_sequence_def first checks SUB_NICHE_SEQUENCE_DEFS for a tailored sequence; if not found, it falls back to VERTICAL_SEQUENCE_DEFS.

Weak answer misses
The two dictionaries (SUB_NICHE_SEQUENCE_DEFS and VERTICAL_SEQUENCE_DEFS) and their precedence order.


Q5 (hard: design trade‑off)
The system extracts a single hook from the post text and uses only that fact for the opener. Why not let the drafting model use the full context for more fluent personalization?

A
Using the full context would be fluent but prone to inventing a detail the recipient never shared, which erodes trust and creates compliance risk. Extracting one grounded fact before drafting, via the hook step, keeps the opener honest and easy to inspect. The failure mode is an empty post text, which leaves the opener with nothing real to stand on.

Follow-up
What security measure prevents untrusted evidence from reaching the prompt?
A
The {evidence} slot is filled from recipient_context, which is already wrap_untrusted‑fenced in gather_context before it reaches the draft node.

Weak answer misses
The specific wrap_untrusted fencing—not just “input validation”—and that it happens in gather_context, not in the draft node itself.

09. The Faithfulness Gate

A personalized claim that is not true destroys trust and can cause a compliance problem, so every personalized sentence must have evidence behind it. The faithfulness gate uses a judge model to audit the draft. The judge checks each sentence against the assembled evidence, and it removes any sentence whose claim is not supported.

The gate produces a score between zero and one, and it posts that score as feedback. That feedback helps you rank prompt versions and model versions by how grounded each one stays. A higher score means more supported statements.

Three paths exist for grounding. Option one is to trust the drafting model to stay grounded, which is the cheapest path but ships a single confident fabrication. Option two is to run a keyword check, which is deterministic but blind to meaning. Option three is to use a judge that compares each claim to the evidence, which catches semantic fabrication but costs one extra model call.

The gate has two failure modes. First, a judge that is too strict may strip a true but short claim. The sentence was factual, but the judge did not find matching evidence exactly. Second, the evidence set might omit a real fact. The drafter used an accurate fact, but it was never loaded into the evidence, so the gate removes a sentence that should stay.

You can measure the judge itself. By tracking the gate scores over time, you see when the judge becomes too strict or too lenient, and then you can adjust or replace the judge model.

This gate sits after the drafting step, and it does not replace human review. It catches confident fabrications before the email enters the next stage. Use this gate when your system makes personalized claims based on user data. Do not use it when you have no evidence to compare against, or when the cost of one extra model call is too high for your latency budget.

<!-- mem:begin -->

Generate it: The judge checks each sentence against the assembled evidence and ________ any sentence whose claim is not supported. (cue: ________ any sentence; answer: removes)

Generate it: The gate produces a _____ between zero and one and posts it as feedback to rank prompt and model versions. (cue: a _____ between zero and one; answer: score)

Ask yourself: Why use a judge model rather than trusting the drafter or running a keyword check?

Answer: Trusting the drafter ships a single confident fabrication and a keyword check is blind to meaning; a judge compares each claim to the evidence, catching semantic fabrication at the cost of one extra model call.

Recall check (try before reading the answer):

  1. What are the gate's two failure modes? Answer: A too-strict judge strips a true but short claim, and an incomplete evidence set causes the gate to remove an accurate sentence whose fact was never loaded.

  2. How can you tell whether the judge itself is drifting too strict or too lenient? Answer: Track the gate scores over time; the trend reveals when to adjust or replace the judge model.

  3. Does the faithfulness gate replace human review? Answer: No — it catches confident fabrications before the email enters the next stage, but it sits alongside, not instead of, human review.

Looking back: In "Extracting The Hook," why extract one grounded fact before drafting? Answer: It gives the model one real fact from the supplied text, so the opener stays grounded and an inspector can confirm the fact is real.

<!-- mem:end -->

The faithfulness gate audits each personalized claim against evidence, removing unsupported sentences and scoring the draft.

python
async def faithfulness_check(state: EmailComposeState) -> dict:
    body: str = state.get("body") or state.get("draft") or state.get("text") or ""
    evidence_override: str = (state.get("faithfulness_evidence") or "").strip()
    raw_context: str = (state.get("recipient_context") or "").strip()
    context_summary: str = (state.get("context_summary") or "").strip()

    evidence_text = (
        evidence_override
        or context_summary
        or wrap_untrusted(raw_context, label="EVIDENCE")
    )

    if not body:
        return {"faithfulness_score": 1.0, "graph_meta": {"faithfulness_skipped": True}}

    if not (evidence_override or context_summary or raw_context):
        return {"faithfulness_score": 1.0, "graph_meta": {"faithfulness_skipped": True}}

    llm = make_llm(provider="deepseek", tier="standard", temperature=0.0)
    user_message = (
        f"EVIDENCE TEXT:\n{evidence_text}\n\n"
        f"EMAIL BODY TO AUDIT:\n{body}\n\n"
        "Return a JSON array of personalized-claim bindings as specified."
    )
    result, tel = await ainvoke_json_with_telemetry(
        llm,
        [
            {"role": "system", "content": FAITHFULNESS_SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        provider="deepseek",
    )
    # … parse bindings, compute score, suppress unsupported claims …
    final_body = body
    unsupported_sentences = {b.claim_sentence for b in bindings if not b.grounded}
    if unsupported_sentences:
        for sentence in unsupported_sentences:
            final_body = final_body.replace(sentence, "")
        final_body = re.sub(r"\n{3,}", "\n\n", final_body).strip()

    return {
        "body": final_body,
        "suppressed_claim_count": suppressed_count,
        "faithfulness_score": faithfulness_score,
    }
ELI5 — the plain-language version

Think of the Faithfulness Gate like a meticulous librarian who checks every claim in a book report against the actual books on the shelf. Before the email goes out, a judge model—a specialized AI—reads each personalized sentence and asks: “Is this fact backed by the recipient’s actual LinkedIn post, job title, or public article?” For each claim, the judge tags the exact evidence phrase or marks it “UNSUPPORTED.” Any unsupported sentence is removed from the email entirely, and the system computes a faithfulness score: the number of grounded claims divided by total claims. This score gets posted as feedback so teams can compare which prompt or model versions stay most truthful. Without this gate, the drafting model might invent a plausible-sounding detail—like “I loved your talk on quantum scalability” when the recipient never gave that talk. That single fabricated claim would break trust instantly, kill the deal, and could even create a compliance headache. The gate keeps every statement honest, making the email feel credible and safe. The failure mode? An over-eager judge that strips a true but terse claim, or evidence that’s missing a real fact—both would weaken the personalization unnecessarily.

Data flow — one request, in order
  1. faithfulness gate node (in email_compose_graph.py, V40 implementation) is called after the drafting step. It receives the current body and the recipient_context from the working state.

    • reads / writes: Inputs from state body (email draft), recipient_context (evidence). Outputs to state: mutated body (filtered), faithfulness_score.
    • branch: First checks the environment variable LLM_KILL_SWITCH. If set to "1", the node short-circuits: returns the unmodified body unchanged and sets faithfulness_score = 1.0. The happy path proceeds to step 2.
  2. Construct judge LLM prompt using the constant FAITHFULNESS_SYSTEM_PROMPT. The prompt is assembled with the current body as the email text and recipient_context as the EVIDENCE TEXT.

    • reads / writes: Reads body and recipient_context to fill the prompt. No state mutation yet.
    • branch: No branch; always construct unless short-circuited.
  3. Call the judge LLM with the constructed prompt. The judge returns a JSON array of objects, each containing keys claim_sentence, source_field, and grounded.

    • reads / writes: No state access; uses external LLM call.
    • branch: If the LLM call fails or returns malformed JSON, the gate may fall back to returning the unmodified body and setting faithfulness_score to 1.0 (implied by “fail closed” design). Happy path receives valid array.
  4. Parse and iterate over the JSON array. For each object, extract claim_sentence, source_field, and grounded.

    • reads / writes: Reads the LLM output. No state mutation yet.
    • branch: If the array is empty (no personalized claims found), treat faithfulness_score = 1.0 because no claims to ground. Happy path has at least one claim.
  5. Identify unsupported claims. For each claim where grounded is false or source_field equals "UNSUPPORTED", mark that sentence for removal.

    • reads / writes: Scans parsed claims. No state mutation.
    • branch: No branch; all claims are evaluated.
  6. Remove unsupported sentences from body. The sentences identified in step 5 are deleted or replaced with an empty string. The remaining body contains only grounded personalized sentences plus generic sentences that were never flagged as claims.

    • reads / writes: Mutates body in working state. Also calculates total_claims and grounded_count.
    • branch: If all claims are unsupported, the entire body could become empty (though generic sentences remain). If all are grounded, body unchanged.
  7. Compute faithfulness_score as grounded_count / total_claims. The score is a float in [0, 1].

    • reads / writes: Uses counts from step 6. Writes faithfulness_score to state.
    • branch: If total_claims is 0, score is set to 1.0 (implicit default).
  8. Emit telemetry to OpenTelemetry/LangSmith: only the source_field IDs and grounded booleans are published, never the raw claim_sentence text (PII protection).

    • reads / writes: Reads the parsed claims (filtered for PII-safe fields). No state mutation.
    • branch: No branch; telemetry is always emitted if the LLM call succeeded.
  9. Return the updated state containing the filtered body and the faithfulness_score.

    • reads / writes: Final state returned to the graph runner. No further mutation.
    • branch: End of node; no loop or fan-out. Control returns to the compose graph, which then continues to the next node (e.g., refine or send).
  10. Note on reuse: The same faithfulness gate node is shared across the compose graph, outreach graph, and reply graph, as described in the “Composing and replying” section. The logic is identical; only the source of body and recipient_context changes per graph.

    • reads / writes: No new reads/writes; this is a structural note about fan-out across graphs, not a runtime step.
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The Faithfulness Gate operates as the final audit layer in the outreach pipeline, executing after the drafting step has produced a body for a given touch. The ordered mechanism reads as: the drafting node first produces the email body, then the faithfulness judge (invoked as a LangGraph node that receives the body and the assembled evidence from recipient_context) iterates over each personalized claim. It does so by applying the FAITHFULNESS_SYSTEM_PROMPT, which instructs a judge LLM to output a strict JSON array of _FaithfulnessBinding objects, each containing claim_sentence, source_field, and grounded booleans. Any sentence tagged as "UNSUPPORTED" (i.e., grounded = false) is stripped from the final body. The judge then exposes two observability attributes exactly as _FAITH_SCORE_KEY = "email.compose.faithfulness_score" and _FAITH_SUPPRESSED_KEY = "email.compose.suppressed_claims", posting the score as feedback for ranking prompt and model versions. On failure—for example if the judge node is halted via LLM_KILL_SWITCH=1—the node short-circuits, returns the unmodified body, and sets faithfulness_score = 1.0, deliberately bypassing the audit.

The design preserves a single invariant: “every personalized sentence must be supported by evidence.” This is enforced by the judge’s output: if the source_field is "UNSUPPORTED", the sentence is removed, guaranteeing that no claim without a concrete evidence trace survives to the final email. The gate produces a faithfulness score (grounded_count / total_claims ∈ [0,1]) that operators can monitor; any score below 1.0 indicates that at least one unsupported claim was stripped, and the suppressed count is recorded. The invariant is thus a write-boundary guarantee on the body: after the gate, every remaining personalized claim has a verifiable evidence anchor.

The key trade-off the gate makes is cost vs. safety: it rejects the obvious cheaper alternatives in favor of a semantic judge. “Trusting the drafting model to stay grounded is cheapest but ships a single confident fabrication”; a “keyword check is deterministic but blind to meaning.” The chosen design accepts the “cost of one extra model call” per touch in order to “catch[] semantic fabrication.” This rejection avoids the compliance and trust failure that a single fabricated claim would cause—a high-stakes outcome in B2B outreach where an invented detail about a recipient can trigger legal liability. The extra LLM call is the explicit price paid for a guarantee that the drafting model’s output is audited for truthfulness by a separate, dedicated judge.

A concrete failure mode an operator would actually see is the “over-aggressive judge that strips a true but tersely worded claim.” In that scenario, the operator observes a _FAITH_SUPPRESSED_KEY counter greater than zero and a _FAITH_SCORE_KEY less than 1.0 for a touch where the evidence set did contain the fact, but the judge judged it unsupported—perhaps because the claim was paraphrased or the evidence snippet was slightly different. The operator would see an OTel span attribute like email.compose.suppressed_claims: 1 and a faithfulness score of, say, 0.75, alongside the final email body missing what they know is a true statement. This signal prompts a tuning of the FAITHFULNESS_SYSTEM_PROMPT or a review of how evidence is assembled, rather than a silent omission of a real capability.

Cost & performance — the real knobs

The faithfulness gate in this subsystem spends time and money primarily on the extra model call needed to audit each personalized claim against the evidence. That judge LLM call adds latency and incurs inference cost per email. The gate also posts the resulting faithfulness score as feedback, which costs storage and processing overhead for observability pipelines. A budget‑conscious alternative—the kill‑switch mode—bypasses the judge entirely, saving both time and money at the expense of grounding.

Two real performance knobs appear in the source, each with an exact identifier:

LLM_KILL_SWITCH — This environment variable (default likely 0) is the most direct cost‑control knob. Bounds: it trades off faithfulness for latency and dollar cost; when set to 1, the entire LLM path is halted, the email body is returned unmodified, and faithfulness_score is forced to 1.0. Effect: turning it on eliminates the judge model call, cutting total latency per email by the time of one LLM inference and reducing per‑email cost by the price of that inference. Risk: set too high (always on) removes all claim‑level verification, which can allow fabricated personalization to reach the recipient, eroding trust and creating compliance exposure. Set too low (never used) retains full grounding but always pays the extra model‑call cost.

FAITHFULNESS_SYSTEM_PROMPT — This Python constant is the judge’s instruction template. Bounds: its length and specificity control how many tokens are consumed on every audit call. Effect: a longer, more prescriptive prompt (like the one shown) improves accuracy of claim classification but increases per‑call token count and thus dollar cost; a shorter prompt saves tokens but may lead to missed or misclassified claims. Risk: too strict a prompt can flag legitimate, tersely‑worded claims as “UNSUPPORTED” and strip them, reducing personalization quality. Too lenient a prompt may let unsupported fabrications slip through, undermining the whole faithfulness goal.

Only these two identifiers are explicitly named in the source for this subsystem. Additional performance levers—such as concurrency limits on judge‑model calls, the choice of which LLM model serves as judge, or caching of judge responses—are absent from the provided context; tuning them would require extending the source with new configuration variables.

Failure modes — what breaks, what catches it

Judge Model False Negative: Unsupported Claim Retained

  • Trigger — The judge LLM (invoked by FAITHFULNESS_SYSTEM_PROMPT) fails to detect a personalized claim that has no basis in the evidence, and labels it "grounded": true.
  • Guard — None. The faithfulness gate is the only auditor; no secondary check or retry is present in the source.
  • PostureFail-soft. The erroneous claim remains in body, and the email is sent as-is. The system continues with no stop or write refusal.
  • Operator signalfaithfulness_score may be misleadingly high (e.g., 1.0) even though the email contains a fabrication. No error log or metric is emitted; the false claim is visible only by manual audit of the evidence_bindings list (where grounded: true for an unsupported claim).
  • Recovery — Manual review via LangSmith UI; the operator can use update_feedback to correct the score and, if needed, edit the body. No automatic retry or fallback exists.

Judge Model False Positive: True Claim Suppressed

  • Trigger — The judge LLM marks a genuine personalized claim as "UNSUPPORTED" (e.g., because the evidence set omitted the fact, or the judge is overly strict). The sentence is removed from body, and suppressed_claim_count incremented.
  • Guard — None. No override or confidence threshold is shown; the judge’s decision is final.
  • PostureFail-soft. The email loses a true personalization but still sends. The system degrades in effectiveness rather than aborting.
  • Operator signalfaithfulness_score drops (e.g., below 1.0), suppressed_claim_count > 0. No error is logged; the operator would see the removal in the final body compared to the draft.
  • Recovery — Manual inspection of evidence_bindings to identify the suppressed sentence; possible remediation by enriching the evidence set or adjusting the judge prompt. No automatic recovery.

Judge LLM Call Failure (Timeout, Network Error, or Malformed Output)

  • Trigger — The underlying LLM API for the judge call (using FAITHFULNESS_SYSTEM_PROMPT) returns an HTTP error, times out, or outputs a string that cannot be parsed as the expected JSON array of claims.
  • Guard — None shown in the provided source. There is no try/except, retry loop, or fallback for the judge call itself. The only LLM kill switch (LLM_KILL_SWITCH) is operator‐set and does not handle transient errors.
  • PostureFail-hard (likely). Without a guard, the exception would propagate up the graph, aborting the email generation for that contact. The system does not continue with a modified body.
  • Operator signal — A Python exception traceback (e.g., json.JSONDecodeError, HTTPError) recorded in LangSmith’s run logs. No specific identifier from the source is used.
  • Recovery — Not specified; the operator must retry the run manually or restart the campaign. No automatic backoff or retry is indicated.

Evidence Set Omission

  • Trigger — The assembled recipient_context (evidence) is missing a concrete detail that the recipient actually posted or shared, so a true personalized claim is correctly flagged as unsupported and removed.
  • Guard — None. The faithfulness gate validates against whatever evidence is supplied; it cannot invent missing context.
  • PostureFail-soft. The email loses the grounded detail but continues. The system does not halt.
  • Operator signalsuppressed_claim_count > 0 and faithfulness_score < 1.0, combined with no errors; the operator might notice the absence of a known detail in the final email.
  • Recovery — Manual fix of the evidence gathering step (not shown in the source) or enrichment of the recipient_context before re‑triggering the compose graph.

Feedback Posting Failure

  • Trigger — The call to post_faithfulness_feedback (which wraps record_outcome_feedback) fails to reach LangSmith due to a network outage or API key misconfiguration.
  • Guard — The function is designed to “Never raises” and silently skips when the score is absent or was produced by the LLM_KILL_SWITCH short-circuit; it also uses the fail‑safe record_outcome_feedback wrapper that swallows exceptions.
  • PostureFail-soft. The email is sent normally, but the faithfulness_score is not recorded as feedback. No system operation is blocked.
  • Operator signal — The absence of a feedback entry in LangSmith for that run; no error is logged (the function never raises). The operator may notice missing grounding metrics.
  • Recovery — Manually post feedback via LangSmith’s update_feedback API using the run id stored in emails.langsmith_run_id. No automatic retry is attempted.

Kill Switch Engaged (LLM_KILL_SWITCH=1)

  • Trigger — An operator sets the environment variable LLM_KILL_SWITCH=1, causing the faithfulness gate node to short‑circuit.
  • Guard — The condition check at the top of the node: if LLM_KILL_SWITCH=1, the node returns body unmodified and sets faithfulness_score=1.0. This is the exact guard.
  • PostureFail-soft. The email is sent with no grounding audit; the score is set trivially to 1.0, and no claims are suppressed.
  • Operator signalfaithfulness_score is exactly 1.0 on every email, and suppressed_claim_count is 0. evidence_bindings may be empty. No error is raised.
  • Recovery — Unset LLM_KILL_SWITCH=1 and re‑run the compose graph to re‑enable the judge model.
Interview — could you explain it?

Q — Warm-up: What does the faithfulness gate do and what exact output does it produce for each email draft?
A — The faithfulness gate invokes a judge model via the FAITHFULNESS_SYSTEM_PROMPT to audit every personalized claim in the body against the assembled evidence. It returns the filtered body, a faithfulness_score computed as grounded_count / total_claims, a suppressed_claim_count, and evidence_bindings (serialized list of _FaithfulnessBinding objects).
Follow-up — How does the gate handle an email that contains no personalized claims at all?
Answer — The judge is instructed to return [] if there are no personalized claims; the post_faithfulness_feedback function then tags that email with value="no_claims" so trivially‑perfect empties stay distinguishable from an audited 1.0.
Weak answer misses — The exact scoring formula (grounded_count / total_claims) and the special "no_claims" tagging in post_faithfulness_feedback.


Q — Design question: Why use a judge LLM for faithfulness instead of a simple keyword check or just trusting the drafting model?
A — The outreach.input.md explicitly compares three paths: trusting the drafting model is cheapest but ships a single confident fabrication; a keyword check is deterministic but blind to meaning. The judge model (the mechanism that runs the FAITHFULNESS_SYSTEM_PROMPT) catches semantic fabrication at the cost of one extra model call, which is justified because an untrue personalized claim destroys trust and creates a compliance problem.
Follow-up — What is the specific failure mode of the judge‑based approach?
Answer — An over‑aggressive judge that strips a true but tersely worded claim, or an evidence set that omitted a real fact (both stated in outreach.input.md under “The faithfulness gate”).
Weak answer misses — Naming all three alternatives explicitly and quoting the exact failure‑mode descriptions from the source.


Q — How does the subsystem ensure that PII or raw claim text never leaks to observability (OTel / LangSmith)?
A — The code comments for _FAITH_SCORE_KEY and _FAITH_SUPPRESSED_KEY state that only source‑field IDs and grounded booleans are emitted; the serialized_bindings list strips the claim_sentence field, keeping only source_field and grounded. The _FaithfulnessBinding model’s claim_sentence is used only during the audit and is never exported.
Follow-up — What happens if the LLM_KILL_SWITCH environment variable is set?
Answer — The node short‑circuits, returns the unmodified body, sets faithfulness_score=1.0, and post_faithfulness_feedback skips posting feedback to avoid polluting the signal with an un‑audited score.
Weak answer misses — The exact attribute key names (_FAITH_SCORE_KEY, _FAITH_SUPPRESSED_KEY) and the kill‑switch skip logic in post_faithfulness_feedback.


Q — Harder: The judge prompt demands “strict JSON only — no markdown, no prose.” How is the judge’s output parsed and validated, and what structure does it require?
A — The FAITHFULNESS_SYSTEM_PROMPT demands a strict JSON array where each element has claim_sentence, source_field, and grounded. The code parses this into a list of _FaithfulnessBinding Pydantic models (with min_length=1 on both string fields), so any malformed or extra fields are rejected. Only personalized claims are included; generic sentences, greeting, CTA, and signature are explicitly excluded.
Follow-up — What is the consequence for a sentence that the judge marks as "UNSUPPORTED"?
Answer — That sentence is removed from the body (suppressed), the suppressed_claim_count is incremented, and the faithfulness_score decreases because unsupported claims are not counted as grounded.
Weak answer misses — The exact field names (claim_sentence, source_field, grounded) and the fact that source_field must be "UNSUPPORTED" for unsupported claims.


Q — Hardest: The post_faithfulness_feedback function attaches feedback to a specific run. Why does it attach to the root run and not the node run, and what edge cases does it handle?
A — It attaches to the ROOT run via run_tree.trace_id so the feedback lands on the same run id stamped to emails.langsmith_run_id and captured by the campaign graph. It never raises (telemetry must not crash callers) and skips when the score is absent or was produced by the LLM_KILL_SWITCH / empty‑body short‑circuit. Zero‑claim emails are posted with value="no_claims" to keep them distinguishable from an audited 1.0.
Follow-up — Why is it important to distinguish zero‑claim emails from audited 1.0 scores?
Answer — So that reviewers can correctly rank prompt versions: a trivially‑perfect empty email should not inflate the average grounding of a model.
Weak answer misses — The exact trace‑id method (run_tree.trace_id) and the explicit "no_claims" tagging logic (both from the code comments in post_faithfulness_feedback).

10. Composing And Replying

Outreach is not the only flow. The system also writes batch emails and opportunity emails, and it answers inbound replies. All of them share the same grounding and safety discipline.

The compose graph drafts your message, then refines it. It strips phrases that sound like a machine. It tightens the subject line. Then the same faithfulness gate runs, the one you already know from outreach.

The reply graph first classifies the inbound message. It decides if the recipient is interested, raises an objection, or wants to unsubscribe. The classifier returns only a label. The routing is decided in code. So the model never directly chooses to halt or to send. That is a deliberate design choice. The unsubscribe path adds a suppression entry to the do-not-contact list.

Now you have three options for building these graphs. One option is a single graph that branches internally. That keeps everything dense but couples the flows together. A second option is separate graphs that each reimplement the gates. That keeps each flow clean, but you duplicate the safety logic. The third option is separate graphs that reuse shared steps. That gives clean separation with one copy of the safety rules. This is the design the system uses.

Think about the failure modes. A new sentiment label with no route will cause a message to fall through the cracks. A refine pass that removes too much can drop the signature from the email. Both failures are detectable: the first shows as an unhandled label in your traces, the second as a missing signature block in the output.

Now that you understand the options, here is the rule to remember. Use separate graphs with shared steps when you need clean separation and one copy of each safety gate. Do not use this shape when your classification labels grow faster than your routing code can follow. Avoid it too when the judge model is not reliable enough to be shared across all flows. The key is that the safety rules stay in one place, and each graph owns only its own flow logic.

<!-- mem:begin -->

Generate it: The reply graph's classifier returns only a _____, and the routing is decided in code, so the model never chooses to halt or send. (cue: only a _____; answer: label)

Generate it: The unsubscribe path adds a suppression entry to the do-not-_______ list. (cue: do-not-_______; answer: contact)

Ask yourself: Why have the classifier return only a label and decide routing in code?

Answer: So the model never directly chooses to halt or send — keeping the safety-critical routing decision in code rather than delegating it to the model.

Recall check (try before reading the answer):

  1. What two refinements does the compose graph apply before the faithfulness gate runs? Answer: It strips phrases that sound like a machine and tightens the subject line.

  2. Why are separate graphs that reuse shared steps chosen over separate graphs that each reimplement the gates? Answer: Reusing shared steps gives clean separation with one copy of the safety rules, avoiding the duplicated safety logic of reimplementation.

  3. How do the two failure modes here show up? Answer: A new sentiment label with no route surfaces as an unhandled label in traces; an over-aggressive refine pass surfaces as a missing signature block in the output.

Looking back: In "The Faithfulness Gate," what does the judge do to an unsupported sentence? Answer: It removes any sentence whose claim is not supported by the assembled evidence.

<!-- mem:end -->

The compose and reply graphs share the same faithfulness gate and evidence assembly, with the reply graph classifying inbound messages before routing by a fixed code map.

python

# and tightening the subject — before the same faithfulness gate runs.
# The reply graph first classifies the inbound message as interested, an objection,
# or an unsubscribe, then routes by a fixed map from label to handler.
# The classifier returns only a label; the routing is decided in code, so the
# model never directly chooses to halt or to send.
# The unsubscribe path adds a suppression entry.
# Keeping these as separate graphs that reuse shared steps — the same
# faithfulness gate, the same evidence assembly — gives clean separation with
# one copy of the safety rules.
ELI5 — the plain-language version

Think of a restaurant that handles both new menu items and customer complaints with the same quality inspector. That’s the sharing in this system: outreach isn’t the only flow; batch emails, opportunity emails, and inbound replies all go through the same grounding and safety checks. The compose graph works like a chef: it drafts the message, then refines it by stripping phrases that sound robotic and tightening the subject line. After that, the same faithfulness gate from outreach runs—it audits every personalized claim against the evidence and removes any unsupported sentence. The reply graph acts like a host who reads the incoming note and classifies it as interested, an objection, or an unsubscribe. The host only returns a label; the routing is decided in code, so the model never directly chooses to halt or send. On the unsubscribe path, the system adds a suppression entry. Without this shared discipline, a new sentiment label—say “angry”—would have no route, leaving a reply stranded. Or a refine pass might over-trim and drop the signature, making the email look incomplete or rude—frustrations a beginner would immediately feel.

Data flow — one request, in order
  1. compose_graph entry point – Receives the request (recipient, vertical, optional post text, optional opportunity link).

    • reads / writes: reads contact, vertical, post_text, opportunity from request; returns nothing yet.
    • branch: if vertical is missing or unknown, the graph may short‑circuit with an error; happy path proceeds.
  2. suppression_gate – Checks the contact against the global suppression list and stop‑condition rules.

    • reads / writes: reads suppression_list and stop_conditions from the database; writes a skip_reason field if the contact is suppressed.
    • branch: if contact is suppressed → return early with skip_reason; happy path continues.
  3. plan_sequence – For a compose (single‑shot) request, returns a trivial one‑step sequence plan.

    • reads / writes: reads vertical to look up sequence definitions (falling back to generic); writes sequence_plan with a single step (step 0).
    • branch: if the vertical has no sequence definition, uses the generic fallback; always a single step.
  4. extract_hook – Reads the supplied post_text and picks exactly one concrete fact to ground the opener.

    • reads / writes: reads post_text; writes hook (the extracted fact).
    • branch: if post_text is empty → hook is null, opener will have no grounded fact (failure mode). Happy path sets a non‑null hook.
  5. draft_touch – Calls the LLM with the per‑step directive (or the generic fallback prompt) and the hook, generating the email body.

    • reads / writes: reads sequence_plan[0].directive (the LLM directive), hook, vertical; writes draft_body and draft_subject.
    • branch: if opportunity is linked → uses the job‑application framing prompt (DRAFT_SYSTEM_PROMPT_JOB_SEEK); otherwise uses the standard outreach prompt. Also, if no per‑step directive exists, falls back to the generic drafting step.
  6. refine_pass – Strips machine‑sounding phrases (exact list in AI_MARKERS) from the body and tightens the subject line.

    • reads / writes: reads draft_body, draft_subject, and the AI_MARKERS tuple; writes refined_body and refined_subject.
    • branch: no conditional – always runs; failure mode is an over‑aggressive refine that drops the signature or other needed content.
  7. faithfulness_gate – Audits every sentence in the refined body against the assembled evidence (the hook and any other provided facts).

    • reads / writes: reads refined_body and evidence (from hook and contact fields); writes a faithfulness_score and a list of sentences to remove; writes clean_body.
    • branch: if a sentence’s claim is unsupported, it is removed; if the judge is over‑aggressive, a true claim may be dropped.
  8. Return result – Assembles the final output: subject line, clean body, HTML body (derived from plain text), and bookkeeping fields (skip_reason, engagement_signal, next_touch_time).

    • reads / writes: reads clean_body, refined_subject; writes the final response object.
    • branch: if any prior step set skip_reason, that is returned; otherwise the composed email is returned.

The control flow is linear (no loops or fans‑out) for a compose request; the only fan‑out happens in the campaign engine (over multiple touches) and in the reply graph (over sentiment labels).

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The subsystem for composing and replying is governed by two distinct graphs that share a common grounding discipline. In the compose graph, the pipeline begins by drafting the message, then runs a refine pass that strips machine-sounding phrases and tightens the subject line. Only after this refinement does the faithfulness gate execute: a judge model audits every personalized claim against the assembled evidence and removes any sentence whose claim is unsupported. In the reply graph, the flow starts with a classifier that categorizes the inbound message as interested, objection, or unsubscribe. The classifier returns only a label; routing is decided in code via a fixed map from label to handler. The unsubscribe path adds a suppression entry, while other labels are routed through the same faithfulness gate that the compose graph uses.

The central invariant the design preserves is that the model never directly chooses to halt or to send. In the reply graph, the classifier outputs only a label – the routing decision is made in code, ensuring that the model cannot accidentally terminate a conversation or trigger a send. For the compose graph, the faithfulness gate enforces a stronger invariant: every personalized sentence in the email body must be backed by a concrete field from recipient_context. The gate produces a score (faithfulness_score) between zero and one, and unsupported claims are suppressed before the email is finalized. This separation of classification from action and claim from evidence provides a auditable, deterministic safety layer.

The key trade-off is between fidelity and cost. Rejecting the cheap alternative of trusting the drafting model to stay grounded (which ships confident fabrications) and the deterministic but semantically blind approach of a keyword check, the design adopts a judge model that compares each claim to the evidence. This catches semantic fabrication at the cost of one extra model call. A second trade-off appears in the refine pass: it runs before the faithfulness gate, smoothing the draft but risking an over‑aggressive trim. The alternative of letting the refinement run after gating would waste the judge call on a draft that will be rewritten. The chosen order keeps the gate as the final check, so the audit reflects the final content.

A concrete failure mode for the reply graph is a new sentiment label with no route. An operator would see a KeyError or log entry stating “no handler for label: X” in the routing step, followed by a stalled thread that never responds. For the compose graph, the failure mode is an over‑aggressive judge that strips a true but tersely worded claim. The operator would observe a faithfulness_score below 1.0 in the trace feedback, and the final email would lack a sentence the evidence could support – a silent omission that might cause a follow‑up email to repeat the same fact. The refine pass that over‑trims and drops the signature produces a separate, visible signal: the draft returned from the compose graph lacks the closing line, which a human approver would notice immediately.

Cost & performance — the real knobs

The provided context contains only one explicit identifier for a performance knob that directly controls time and cost in the composing-and-replying subsystem. All other knobs (retry counts, batch sizes, concurrency limits, per-host limits, caches, retrieval top‑k) are absent from the source material for this subsystem. Below is the single real knob, rendered as requested.

  • LLM_KILL_SWITCH
    • KnobLLM_KILL_SWITCH (environment variable; default value is not stated, but implied to be 0/off).
    • Bounds — When set to 1, the LLM path is halted: the node short‑circuits, returns the unmodified body, and sets faithfulness_score = 1.0. When set to 0 (or unset), the normal LLM drafting and faithfulness‑judging path runs.
    • Effect — Turning it up (to 1) eliminates all LLM model calls in that flow, reducing both latency and dollar cost to nearly zero for the affected node, but the email becomes generic and loses all personalization. Turning it down (to 0) allows the LLM to draft and the judge to audit, increasing cost and time proportionally to the number of model invocations.
    • Risk — Mis‑set to 1 when personalization is needed: emails become untailored and may hurt conversion. Mis‑set to 0 when personalization is not required (e.g., a high‑volume batch) wastes money and adds unnecessary latency.

No other performance knobs are named in the source excerpts for the composing, replying, or grounding subsystem. All flows share the same faithfulness gate (which adds one extra model call), but no parameter—such as a model‑choice variable, a retry count, a batch size, a cache TTL, or a concurrency limit—is identified in the provided text.

Failure modes — what breaks, what catches it

Over-aggressive faithfulness judge strips a true but tersely worded claim

  • Trigger — A personalized sentence in the draft is factually correct and grounded in the evidence, but the judge LLM evaluates it as unsupported because the wording is terse (e.g., “You shipped v3 last week” when the evidence says “v3 released”).
  • Guard — None. The judge model (invoked via FAITHFULNESS_SYSTEM_PROMPT) performs the audit, but no further exception handler or fallback checks for over-aggressive suppression.
  • Posture — Fail-soft. The sentence is removed, the email continues with reduced personalization, and suppressed_claim_count is incremented.
  • Operator signal — A depressed faithfulness_score (less than 1.0) and a positive suppressed_claim_count observable in the OTel/LangSmith span where post_faithfulness_feedback posts the score. The final email body is shorter than expected.
  • Recovery — Manual review of the draft and the evidence bindings. The operator can adjust the judge prompt or re-classify the sentence. No automated retry exists.

Refine pass over-trims and drops the signature

  • Trigger — The “refine” step (post-draft, pre-faithfulness) strips machine-sounding phrases but also removes the sender’s signature block, either because it matches a heuristic or because the tightening logic is too aggressive.
  • Guard — None. The refine step is described but no named function, exception handler, or validation is provided for signature preservation.
  • Posture — Fail-soft. The email continues through the faithfulness gate and is finalized without a signature.
  • Operator signal — The delivered email lacks a signature line. The campaign logs may show no error, so detection relies on recipient feedback or manual inspection of pending drafts.
  • Recovery — Manual resend of the email with signature restored, or adjustment of the refine logic in code. No automatic retry.

Classifier receives a new sentiment label with no route

  • Trigger — The reply graph’s classifier returns a label (e.g., “question”) that is not in the fixed routing map from label to handler (interested, objection, unsubscribe).
  • Guard — None. The source states “the routing is decided in code, so the model never directly chooses to halt or to send” and explicitly calls out “The failure mode is a new sentiment label with no route.” No fallback or default route is described.
  • Posture — Fail-closed. Because no handler exists, the reply cannot be processed; the inbound message is not answered and no suppression entry is added.
  • Operator signal — A missing email in the reply thread, or a logged error (e.g., “unknown label” or “no route”). The classifier’s output label is visible in traces.
  • Recovery — Manual intervention to add the missing route to the code’s fixed map, then reprocess the inbound message.

Evidence set omitted a real fact, causing a personalized claim to be stripped

  • Trigger — The gathered evidence (from recipient_context) does not contain a fact that actually exists (e.g., a recent job change), so a draft sentence referencing that fact appears unsupported to the faithfulness gate.
  • Guard — The faithfulness gate (the judge model) identifies the claim as unsupported and suppresses it. This is not a guard against the failure but the mechanism that manifests it.
  • Posture — Fail-soft. The email loses personalization but still sends. The gate’s suppression is logged via suppressed_claim_count.
  • Operator signal — A faithfulness_score below 1.0 and a positive suppressed_claim_count. The final email may lack a relevant detail.
  • Recovery — Fix the evidence assembly pipeline to include the missing fact. No automated retry; the campaign continues without that personalization.

Hook extraction receives empty post text

  • Trigger — The compose graph (or any outreach flow) calls the hook step with an empty post_text field. The hook step “reads the supplied post text” and, finding nothing, cannot pick a concrete opener.
  • Guard — None. The source notes “The failure mode is an empty post text, which leaves the opener with nothing real to stand on.” No fallback hook or validation check is mentioned.
  • Posture — Fail-soft. The subsequently drafted email opens with a generic statement because no grounded hook was available.
  • Operator signal — The opener is noticeably generic. No explicit error is raised; the trace may show a missing or null hook field.
  • Recovery — Ensure the post text is supplied before invoking the compose graph. Manual rewrite of the email if already sent.
Interview — could you explain it?

Interview Q&A: Composing and Replying Subsystem


Q — How does the reply graph handle an inbound message, and what prevents the model from accidentally sending or halting?

A — The reply graph first runs a classifier that labels the inbound message as interested, objection, or unsubscribe. The classifier returns only a label; the actual routing is decided in code by a fixed map from label to handler, so the model never directly chooses to halt or to send.

Follow-up — What happens if a new sentiment label is introduced?

A — That’s a documented failure mode: a new sentiment label with no route breaks the system until the map is updated.

Weak answer misses — The classifier returns only a label; the routing is always deterministic in code, not model-driven.


Q — Why implement the reply graph with a separate classifier and code‑based routing instead of letting the LLM decide the action directly—isn’t that more flexible?

A — The system deliberately keeps the classifier separate so the model never directly chooses to halt or to send. The routing is hard‑coded via a fixed map from label to handler, which prevents hallucinated actions and ensures that the same faithfulness gate and evidence assembly (reused from outreach) still apply to the drafted reply.

Follow-up — Doesn’t that make the system brittle when new reply intents appear?

A — Yes; the failure mode is a new sentiment label with no route, and the map must be updated manually.

Weak answer misses — The model never chooses to halt or send; routing is in code, not a model output.


Q — The compose graph has a refine pass that strips machine‑sounding phrases and tightens the subject. Why not just ask the LLM to produce polished copy in one shot?

A — The compose graph first drafts the body, then refines it by stripping machine‑sounding phrases and tightening the subject. This separates creative generation from editing, and because the refinement runs after the faithfulness gate has already verified claims against evidence, the editing step does not risk introducing unsupported statements.

Follow-up — What is the failure mode of the refine pass?

A — An over‑aggressive refine pass that over‑trims and drops the signature.

Weak answer misses — The refine pass does not re‑run the faithfulness gate; it only strips phrases and tightens the subject.


Q — Why are compose and reply implemented as separate graphs that share steps like the faithfulness gate, rather than one monolithic graph?

A — Separating the compose graph and reply graph while reusing the same faithfulness gate and evidence assembly gives clean separation of concerns with one copy of the safety rules. This avoids duplicating grounding logic across flows and allows each graph to evolve independently while maintaining identical safety discipline.

Follow-up — What specific safety mechanism is reused?

A — The faithfulness gate, which uses a judge model to audit each claim against evidence and remove unsupported sentences.

Weak answer misses — Both graphs reuse the same faithfulness gate and evidence assembly; there is a single copy of the safety rules.


Q — A warm‑up: what happens when an inbound reply is classified as unsubscribe?

A — The reply graph routes to the unsubscribe handler, which adds a suppression entry. The classifier returns only the label unsubscribe, and the code‑based map determines that specific action.

Follow-up — Could the model handle the unsubscribe itself instead?

A — No; the routing is decided in code, so the model never directly chooses to halt or send, preventing accidental unsuppression or mis‑routing.

Weak answer misses — The unsubscribe path adds a suppression entry; the classifier never outputs an action, only a label.

11. Durable Campaign Threads

A multi-touch sequence unfolds over days or weeks, every touch needs human approval, and the process must survive restarts. The campaign engine runs one durable thread per campaign and contact, identified by a stable thread name and checkpointed in the database. The compose step drafts the touch by invoking the outreach engine and holds it as a pending draft. The approval pause stops the thread and waits for a human decision: approve, edit, reject, or skip. The send step is the only step that actually sends, and it records the send. The scheduling step writes a waiting status and a wake time, then pauses again. An external timer drains threads whose wake time has passed, one at a time, and resumes each.

Three design options exist for this flow. A process that simply sleeps between touches is simplest to write. But it loses all state on a restart, and it cannot pause for approval cleanly. A queue of scheduled jobs with no shared thread state is durable across restarts. However, the thread's history is scattered across job logs, so inspecting the full arc is hard. A checkpointed graph is the third option. It pauses for approval and for cadence, then a timer resumes it. The result is durable. It is inspectable. It is human-gated. That is the shape the engine uses.

Now consider what can go wrong. A thread left waiting forever is one failure mode. This happens if the external timer stops entirely. No thread wakes, and approvals pile up. The detection signal is a rising count of threads in a waiting state past their expected wake times. An operator would query the thread table for status equal to waiting with a wake time older than five minutes. The blast radius is isolated to the campaign and contacts served by that timer instance. Other timers for other campaigns keep running.

A second failure mode is an approval edit that is not synced into the sent body. The human edits the draft during the approval pause. The edit updates the thread state locally. But if a concurrent save or a race condition loses that patch, the send step fires the original unedited draft. The detection signal is a mismatch between the approval audit log and the sent body log. An on-call engineer would check the send history table against the approval snapshot for that thread. The blast radius is one email per occurrence, but it erodes sender reputation and trust.

The team chose the checkpointed graph after rejecting two alternatives. A process that sleeps between touches was rejected because the graph must survive a platform restart, and that design loses state. A queue of scheduled jobs was rejected because the approval gate needs the full thread context. That context is the draft, the sequence plan, and the contact snapshot, all accessible in one pause. A job queue cannot hold that context cheaply. The checkpointed graph stores it in the database, so the pause is free and the resume is atomic.

Here is the transferable rule. Use the checkpointed graph design when your flow needs a human pause that must survive crashes. It also fits when each unit of work carries a moderate context you cannot afford to rebuild. Do not use it when the number of concurrent threads exceeds your database write capacity. Avoid it too when the pause is very short, under one second, because the checkpoint write overhead then dominates the work.

<!-- mem:begin -->

Generate it: The approval pause stops the thread and waits for a human decision: approve, edit, reject, or ____. (cue: or ____; answer: skip)

Generate it: An external _____ drains threads whose wake time has passed, one at a time, and resumes each. (cue: external _____; answer: timer)

Ask yourself: Why choose a checkpointed graph over a process that simply sleeps between touches?

Answer: A sleeping process loses all state on a restart and cannot pause cleanly for approval, while a checkpointed graph survives a platform restart and stores the full pause context in the database so the resume is atomic.

Recall check (try before reading the answer):

  1. What is the detection signal that the external timer has stopped? Answer: A rising count of threads in a waiting state past their expected wake times — querying for status "waiting" with a wake time older than five minutes.

  2. How can an approval edit fail to reach the sent body? Answer: A concurrent save or race condition loses the patch, so the send step fires the original unedited draft.

  3. Why does a job queue of scheduled jobs fail the approval requirement? Answer: The approval gate needs the full thread context — draft, sequence plan, and contact snapshot — which a job queue cannot hold cheaply, whereas the checkpointed graph stores it in the database.

Looking back: In "The Faithfulness Gate," how can you tell the judge model is drifting too strict or too lenient? Answer: By tracking the gate scores over time and watching the rate at which it strips true claims.

<!-- mem:end -->

The campaign engine runs one durable thread per campaign and contact, checkpointed in the database, with approval pauses and timer-driven resumption.

python

async def campaign_thread(state: CampaignState) -> dict:
    """One durable thread per campaign+contact, checkpointed in DB."""
    # Compose step: draft via outreach engine, hold as pending draft
    draft = await outreach_engine(state)
    state["pending_draft"] = draft

    # Approval pause: stop thread, wait for human decision
    state["status"] = "awaiting_approval"
    await checkpoint_state(state)  # persists to DB

    # External timer drains threads whose wake time has passed
    # Timer resumes thread when human approves
    if state["approval"] == "approved":
        # Send step: only step that actually sends
        await send_email(state["pending_draft"])
        state["status"] = "sent"
        await record_send(state)
    elif state["approval"] == "edit":
        state["pending_draft"] = state["edited_draft"]
        await send_email(state["pending_draft"])
        state["status"] = "sent"
        await record_send(state)
    elif state["approval"] in ("reject", "skip"):
        state["status"] = "skipped"
        await checkpoint_state(state)
        return

    # Scheduling step: write waiting status + wake time, pause again
    state["status"] = "waiting"
    state["wake_time"] = calculate_next_touch(state)
    await checkpoint_state(state)
ELI5 — the plain-language version

Imagine a careful assistant who drafts letters but never mails them without a manager’s nod. The assistant keeps a notebook—one page per contact—and after writing each draft, pauses, puts the draft on the manager’s desk, and waits for a sticky note: “Approve,” “Edit it,” “Reject,” or “Skip.” Once approved, the assistant sends the letter, jots down that it was sent, and schedules the next draft exactly when it should go, writing the wake‑up time in the notebook. If the electricity cuts out, the assistant just reopens the notebook and picks up exactly where it left off—because every page is checkpointed in a database under a stable thread name. That notebook is the durable thread, one per campaign and per contact. Without this system, a real outage would lose all progress, or a manager’s edit to a draft might never make it into the final mailed copy. Worst case: a thread is scheduled to wake, but if the external timer that drains threads stops, that contact’s sequence sits forgotten forever, waiting for a wake‑up call that never comes.

Data flow — one request, in order
  1. external timer drain — The external timer drains threads whose wake time has passed, one at a time, and resumes each.

    • reads / writes — reads the database for threads with wake_time ≤ now; writes a resumed thread state.
    • branch — if no thread is due, the drain does nothing; happy path picks one thread and resumes it.
  2. compose step — Drafts the touch by invoking the outreach engine and holds it as a pending draft.

    • reads / writes — reads the stable thread_name and the checkpointed thread state; writes a pending_draft into the thread state.
    • branch — no branch within this step; the outreach engine sub‑steps handle failures.
  3. look up the contact — Reads the contact from the database once and loads their role, seniority, department, and profile into the working state.

    • reads / writes — reads contact row; writes role, seniority, department, profile into the working state.
    • branch — missing contact row leaves personalization with nothing; happy path continues.
  4. suppression gate — Checks a central do-not‑contact list using a one‑way fingerprint of the email address plus the domain; fails closed.

    • reads / writes — reads the do‑not‑contact list via fingerprint; writes an audit record of the decision.
    • branch — if the contact is suppressed, the run ends with a skip reason; happy path continues.
  5. stop conditions — Checks the contact’s current thread state for replied, bounced, unsubscribed, or never_verified.

    • reads / writes — reads the contact’s thread state from the database; writes a distinct machine‑readable reason if a condition holds.
    • branch — if any condition holds, the run ends with that reason; happy path continues.
  6. plan the sequence — Deterministically looks up a sequence definition from VERTICAL_SEQUENCE_DEFS or the nested sub‑niche map, based on the vertical and sub‑niche.

    • reads / writes — reads vertical, sub_niche from working state; writes the sequence_id, touch_angles, steps, cadence_days, fallback_step.
    • branch — if the sub‑niche key is missing or None, falls back to the vertical‑level definition.
  7. extract the hook — Reads the supplied post text and picks exactly one concrete hook to ground the opener.

    • reads / writes — reads post_text from working state; writes a single hook fact.
    • branch — if post text is empty, the opener has nothing real to stand on (failure mode); happy path continues.
  8. draft the step — Looks up the directive for the current step from the sequence definitions and writes copy that fits that step’s role (opener, value, soft close).

    • reads / writes — reads the current step_index and the steps directives; writes the drafted copy.
    • branch — if the step index is past the end of the sequence (failure mode), falls back to the fallback_step; happy path writes per‑step copy.
  9. faithfulness gate — Uses a judge model to audit the draft against the assembled evidence; removes any unsupported sentence and produces a score between 0 and 1.

    • reads / writes — reads the draft and the evidence set; writes the filtered draft and a faithfulness_score as feedback.
    • branch — over‑aggressive judge may strip true claims (failure mode); happy path keeps only supported sentences.
  10. approval pause — Stops the thread and waits for a human decision: approve, edit, reject, or skip.

    • reads / writes — reads the pending draft; writes an approval_status field.
    • branch — if rejected or skip, the thread ends with a reason; if edit, the human modifies the draft; happy path continues on approve.
  11. send step — The only step that actually sends the email; it records the send.

    • reads / writes — reads the approved draft; writes a sent record (including timestamp and message‑id) to the database.
    • branch — no branch described; any send failure would be an error but is not detailed in the source.
  12. scheduling step — Writes a waiting status and a wake time for the next touch, then pauses the thread again.

    • reads / writes — reads the cadence_days from the sequence definition; writes status = "waiting" and wake_time into the thread state.
    • branch — if the timer stops, the thread is left waiting forever (failure mode); happy path pauses until the next timer drain.
  13. external timer drain (next invocation) — The external timer drains threads whose wake time has passed and resumes them, looping back to step 2 for the next touch in the campaign.

    • reads / writes — same as step 1.
    • branch — if the sequence has no further touches, the thread ends (no more scheduling).
Diagram — the real call graph
System design — mechanism, invariant, trade-off

The Durable Campaign Threads subsystem operates as a checkpointed graph that begins with a compose step invoking the outreach engine to draft a touch, which is then held as a pending draft. Execution pauses at the approval pause, where the thread waits for a human decision — approve, edit, reject, or skip. Only after approval does the send step execute, and it is the sole step that actually transmits the email and records the send. Next, the scheduling step writes a waiting status and a wake time into the database, then the thread pauses again. An external timer drains threads whose wake time has passed, resuming each one so the next touch can be composed. On failure — for example, if the timer stops — a thread is left waiting forever with no forward progress.

The core invariant is that the campaign engine maintains a durable thread per campaign and contact, identified by a stable thread name and checkpointed in the database. This ensures the entire multi-touch sequence survives restarts and remains inspectable. The guarantee is that no send occurs without explicit human approval, and that the send step is the only path to transmit — the approval pause and the send step are strictly ordered, establishing a hard write boundary between drafting and delivery.

This design rejects the obvious alternative of a single long‑running process that sleeps between touches. That approach loses all state on a restart and cannot pause cleanly for human approval. The checkpointed graph with pauses and timer is chosen instead because it is durable, inspectable, and human‑gated. The cost avoided is the inability to recover from crashes mid‑sequence, and the risk of sending without oversight — exactly the failure that the approval pause and checkpointing prevent.

A concrete failure mode is a thread left waiting forever if the timer stops. An operator would observe that no pending drafts advance to the send step, that threads remain in waiting status indefinitely, and that the normal cadence of scheduled wakes never fires. The signal is a cluster of stalled campaign threads with stale wake times, with no new sends recorded, and no timer activity in system logs.

Cost & performance — the real knobs

cadence_days

  • KnobVERTICAL_SEQUENCE_DEFS.cadence_days, default [0, 4, 7]
  • Bounds — Controls the number of days between touches in a sequence; each entry corresponds to a step’s delay after the previous send.
  • Effect — Shorter intervals increase the frequency of model calls (compose, draft, faithfulness gate) and sends per unit time, raising dollar cost and thread throughput. Longer intervals reduce cost but may allow engagement to decay.
  • Risk — Too short a gap can trigger the adaptive‑cadence clamp (preventing same‑day blasts) but still wastes model calls on unreceptive contacts; too long a gap may lose the contact’s attention entirely.

fallback_step

  • KnobVERTICAL_SEQUENCE_DEFS.fallback_step, default 2
  • Bounds — Defines which step directive to use when the current step index exceeds the defined sequence length (failure mode: “step index past the end of the sequence”).
  • Effect — Prevents a crash by re‑using a known directive; lowering the value (e.g., to 1) forces an earlier fallback, raising the chance of a generic message; raising it may point to a nonexistent step if the sequence is short.
  • Risk — A mis‑aligned fallback can produce copy that does not match the intended touch role (e.g., a soft‑close when an opener is needed), wasting the model call and confusing the contact.

timer drain concurrency

  • Knob — Not named explicitly in source; the external timer “drains threads whose wake time has passed, one at a time” – a de‑facto concurrency of 1.
  • Bounds — Limits the number of threads processed simultaneously by the timer; only one thread is resumed per timer tick.
  • Effect — Raising concurrency would allow multiple threads to be drafted and sent in parallel, reducing the wall‑clock time for a batch of campaigns but increasing instantaneous load on the outreach engine and database.
  • Risk — Too high a concurrency could overload the model endpoints or cause database contention on checkpoint writes; too low a concurrency leaves threads waiting longer, slowing the overall campaign cadence.

model version for drafting and faithfulness

  • Knob — Not given a specific identifier; the source refers to “the drafting model” and “the judge model” used by the faithfulness gate.
  • Bounds — The model identity (e.g., cost‑per‑token, speed, accuracy) directly trades off latency, throughput, and dollar cost per execution.
  • Effect — Switching to a cheaper, faster model reduces per‑call cost and may speed up the compose step, but risks lower‑quality drafts or a stricter judge that strips true claims. Moving to a more expensive model improves accuracy and grounding but increases cost and latency per touch.
  • Risk - An overly aggressive judge (cheap model) can remove valid content, eroding personalization; an under‑powered drafting model can produce unfaithful claims that the judge then rejects, causing wasted model calls.
Failure modes — what breaks, what catches it

Step Index Past End of Sequence

  • Trigger – The drafting step in the outreach engine receives a step index that exceeds the length of the vertical sequence (taxonomy changed or sequence truncated).
  • Guard – The system falls back to the generic drafting step when no per-step directive applies.
  • Posture – Fail-soft: the touch is still produced but as a generic email that ignores the intended step role, degrading personalisation.
  • Operator signal – The campaign trace shows a touch that lacks the expected opener/value/soft‑close framing; the generic fallback is observable in the draft text.
  • Recovery – The human approver sees the generic email during the approval pause and can reject or edit it; no automatic retry occurs.

Empty Post Text

  • Trigger – The hook extraction step receives no post text (e.g., a missing or blank field in the contact profile).
  • Guard – None shown in the source. The system has no fallback for an empty post text; the opener is left with “nothing real to stand on.”
  • Posture – Fail-hard: the compose step cannot produce a grounded opener, so the draft is empty or hallucinates. The outreach engine likely aborts this touch.
  • Operator signal – An error or missing‑hook log entry; the campaign thread remains stuck in a pending draft state without progress.
  • Recovery – Manual step: the operator must supply valid post text (e.g., a recent public post or job description) and restart the touch.

Missing Verification Check

  • Trigger – The stop‑conditions step fails to run the email‑verification check for the contact’s address.
  • Guard – No guard shown. The verification check is described as a distinct condition that “must” happen, but no code‑level exception handler or fallback is mentioned.
  • Posture – Fail-soft (dangerous): the system proceeds to draft and send to an unverified address, risking a bounce that damages sender reputation.
  • Operator signal – A bounce‑related stop condition (bounce reason) appears later, or the sender’s reputation metrics degrade silently. No immediate log of the missing check.
  • Recovery – Manual: the operator must inspect the contact’s verification status via the database and suppress the address if needed.

Over‑Aggressive Faithfulness Judge

  • Trigger – The faithfulness gate (judge model) evaluates a true but tersely worded claim as unsupported, stripping it from the draft.
  • Guard – The judge model itself (faithfulness_gate / post_faithfulness_feedback) is the guard, but its over‑aggressiveness is a failure mode.
  • Posture – Fail-soft: the email is finalised with the claim removed, losing a valid personalisation detail.
  • Operator signal – The suppressed_claim_count metric increases and the faithfulness_score is recorded (e.g., via record_outcome_feedback). The draft text is missing a claim that a human reviewer would consider true.
  • Recovery – Manual: the human approver can edit the draft during the approval pause to re‑insert the claim, or the operator can lower the judge’s threshold (not shown in source).

Missing Contact Row

  • Trigger – The first step (looking up the contact) attempts to read a contact from the database and finds no row for the given thread’s contact ID.
  • Guard – No guard shown. The source states “The failure mode to watch is a missing contact row” with no fallback or error handler.
  • Posture – Fail-hard: the campaign thread cannot proceed past the lookup step; the run aborts immediately.
  • Operator signal – A database query returning zero rows, logged as a contact not found error. The thread remains uninitialised.
  • Recovery – Manual: the operator must verify the contact ID and re‑seed the thread with a valid contact row.
Interview — could you explain it?

Interview Q&A on Durable Campaign Threads

Q (warm-up): The system needs to survive restarts across multi-day sequences. What mechanism guarantees a thread can be resumed after a crash?

A: The campaign engine runs one durable thread per campaign and contact, identified by a stable thread name and checkpointed in the database. Every state change (draft created, approved, sent, scheduled) is persisted, so after a restart the engine reads the checkpoint and resumes from the exact step where it left off.

Follow-up: What specific data is stored in the checkpoint to ensure no step is skipped?
Grounded answer: The checkpoint stores the current step index, the pending draft state, and the next wake time written by the scheduling step, so the engine knows exactly which touch to compose or send next.

Weak answer misses: A shallow answer might mention “persistence” without naming the stable thread name as the unique identifier that prevents two threads for the same contact from colliding after recovery.


Q (design): Why does the compose step create a pending draft instead of sending the email immediately, given that drafting is already done?

A: Because every touch needs human approval. The compose step invokes the outreach engine to draft the email, then holds it as a pending draft rather than sending directly. The approval pause then stops the thread and waits for a human decision (approve, edit, reject, or skip); only the send step is allowed to actually dispatch the email. This ensures no model output reaches a recipient without explicit human sign-off.

Follow-up: Couldn’t the approval be merged into the compose step to reduce latency?
Grounded answer: No — the approval pause is a separate state that blocks the thread until a human acts; merging it would eliminate the only point where edits or rejections are possible before sending.

Weak answer misses: A shallow answer would omit that the send step is “the only step that actually sends,” missing the architectural guarantee that drafts are never automatically transmitted.


Q (intermediate): After a human approves a touch, how does the system decide when to schedule the next one, and what prevents it from sending on the same day or after a long silence?

A: The scheduling step writes a waiting status and a wake time based on the configured cadence (e.g., cadence_days: [0, 4, 7]) and any engagement signal. The engine then pauses until the wake time elapses, and only then resumes to compose the next touch. The actual send is further gated by the approval pause, so scheduling and sending are fully decoupled.

Follow-up: Can the engagement signal override the cadence days? If so, what bounds exist?
Grounded answer: Yes — the adaptive cadence mechanism (described elsewhere) lets the model propose a gap, but code clamps it to a safe range; the core thread still writes a wake time via the scheduling step, respecting the clamped value.

Weak answer misses: A shallow answer might say “it just waits for the next day” without mentioning that the scheduling step explicitly writes both a waiting status and a wake time as separate fields in the database.


Q (hard): Let’s say two approvals arrive at nearly the same time for the same thread — one approve and one skip. How does the system avoid sending an already‑skipped email?

A: The approval pause is a single‑state gate; once a human decision is recorded (e.g., skip), the thread transitions to the next step and the pending draft is discarded. The send step checks the current state before dispatching; if the thread has moved past the approve state, the send is blocked. Because all state transitions are serialized through the checkpointed database record, concurrent decisions are ordered by the write transaction.

Follow-up: What prevents a stale approval callback from acting on a thread that has already advanced?
Grounded answer: The thread’s stable thread name is used as a lock key; the handler reads the current checkpoint and only applies the decision if the step index still matches the expected approval step.

Weak answer misses: A shallow answer would ignore the role of the stable thread name in providing an atomic identifier for concurrency control, focusing only on “database transactions” without the precise mechanism.

12. Putting It Together

You now have every piece of the outreach engine. The question is when to build this shape. The idea that runs through everything is simple: outreach is a graph of small gated steps. The same engine drafts copy but never sends. It serves an autonomous pipeline, a human approved campaign, and a one shot preview. Because the graph never assumes how it is invoked, the caller always controls sending.

Reach for this shape when safety, grounding, and observability each need their own testable seam. Reach for it when the same copy engine must be reused under different approval policies. A single big send function has few moving parts, but one failure domain. Separate microservices per flow give clean isolation, but you pay the platform tax once per flow. The shared graph runtime with a registry gives gated, traceable steps with additive growth. It costs one routing layer and the discipline to keep the registry simple.

Here is the spine. Look up the contact once. Gate on suppression early and fail closed. Gate on stop conditions early, with a distinct reason for each. Plan the sequence deterministically. Ground the opener on one real hook from the supplied text. Draft per step, looking up the directive for that touch. Then let the faithfulness gate remove any unsupported claim. Only ever send from a step the caller explicitly drives.

The registry is where each graph lives. The outreach graph, the compose graph, the reply graph, and the durable campaign engine are all separate entries.

Now the standing tensions. First, a deterministic route or sequence table can outgrow static configuration as verticals multiply. You must decide when to move from a configuration file to a database lookup. Second, the faithfulness judge can be too strict or too lax. It must itself be measured. You watch its score and the rate at which it strips true claims. You also watch for claims it misses.

Use this shape when you need to reuse a draft engine across approval policies. Do not use it when every flow has unique requirements and a single monolithic function has few moving parts. Do not use it when you have no need for human gates or separate safety checks.

<!-- mem:begin -->

Generate it: Only ever send from a step the ______ explicitly drives. (cue: the ______ explicitly drives; answer: caller)

Generate it: Ground the opener on one real ____ from the supplied text. (cue: one real ____; answer: hook)

Ask yourself: When should you reach for this shared-graph-plus-registry shape, and when not?

Answer: Reach for it when safety, grounding, and observability each need their own testable seam and the same copy engine must serve different approval policies; avoid it when every flow has unique requirements, a single monolithic function suffices, or you need no human gates or separate safety checks.

Recall check (try before reading the answer):

  1. Recite the spine of the outreach run in order. Answer: Look up the contact once; gate on suppression early and fail closed; gate on stop conditions with a distinct reason; plan the sequence deterministically; ground the opener on one real hook; draft per step via the directive; let the faithfulness gate strip unsupported claims; send only from a step the caller drives.

  2. What is the first standing tension the chapter names? Answer: A deterministic route or sequence table can outgrow static configuration as verticals multiply, so you must decide when to move from a config file to a database lookup.

  3. Which four graphs live as separate entries in the registry? Answer: The outreach graph, the compose graph, the reply graph, and the durable campaign engine.

<!-- mem:end -->

The outreach graph is a shared draft-not-send engine reused across autonomous, human-approved, and preview pipelines, with the caller always controlling sending.

python

# and some bookkeeping: a skip reason when it short-circuits, an engagement
# signal, and the time of the next touch. It never calls send.
# The same drafting engine is reused three ways: by an autonomous pipeline
# that sends without a human, by a human-approved campaign that pauses for
# sign-off, and by a one-shot preview that just shows a draft.
# Because the graph never assumes how it is invoked, the caller decides
# approval and sending every time.
ELI5 — the plain-language version

Think of the outreach engine like a professional kitchen’s prep station: the chef (graph) can chop, season, and plate a dish (draft an email) but never hands it to the customer—the server (caller) decides when to deliver. The station works the same whether the order is for dine-in, takeout, or a tasting preview. In concrete terms, the engine runs a series of small, gated steps—for example, a “faithfulness gate” that uses a judge model to check each personalized claim against the assembled evidence and removes any sentence that isn’t supported, before the draft ever leaves the station. This way the same prep workflow serves an autonomous pipeline, a human-approved campaign, or a one-shot preview, while the caller always controls sending. Without this shape, you’d have a single big send function that can’t insert a safety gate or reuse the same copy logic under different approval policies—so a fabricated claim would slip straight to the recipient, or you’d need separate kitchens for every order, multiplying complexity and risk.

Data flow — one request, in order
  1. Looking up the contact – Reads the contact from the database once and loads their role, seniority, department, and profile into the working state.
    reads / writes: reads contact row from DB; writes role, seniority, department, profile into working state.
    branch: If the contact row is missing, personalization has nothing to stand on (failure path). Happy path proceeds to the next gate.

  2. The suppression gate – Checks a central do-not-contact list using a one-way fingerprint of the email address plus domain; ends the run if the contact is on it.
    reads / writes: reads fingerprint derived from email; checks DNC list; writes an audit record of the decision.
    branch: If the contact is suppressed, the run ends early (failure). If the check cannot be completed, the contact is treated as suppressed (fail closed). Happy path proceeds to stop conditions.

  3. Stop conditions – Examines the contact’s current thread state and ends the run with a specific machine‑readable reason when any condition holds: replied, bounced, unsubscribed, or email never verified.
    reads / writes: reads thread_state; writes distinct stop_reason (e.g., “bounced”).
    branch: If any stop condition is true, the run ends with that reason (failure). Happy path proceeds to sequence planning.

  4. Plan the sequence (deterministic lookup) – Looks up the outreach sequence definition from VERTICAL_SEQUENCE_DEFS, optionally overridden by a sub‑niche entry from micro_verticals.py; returns the seq_def containing steps (3 LLM directives), cadence_days, fallback_step.
    reads / writes: reads vertical and optional sub_niche from state; reads VERTICAL_SEQUENCE_DEFS dict; writes seq_def into state.
    branch: If sub‑niche is missing or None, falls back to the vertical‑level definition. If the vertical itself is missing, no fallback is defined (undefined behavior, but assumed not reached). Happy path proceeds to hook extraction.

  5. Extracting the hook – Reads the supplied post text (a recent post or job description) and picks exactly one concrete hook to ground the opener, using only that text.
    reads / writes: reads post_text from supplied input; writes hook (a single concrete fact).
    branch: If post_text is empty, the opener has nothing real to stand on (failure path). Happy path proceeds to drafting.

  6. Drafting the step (touch 1) – Writes the body for the first touch in the sequence using the directive from seq_def.steps[0], the extracted hook, and the contact’s attributes.
    reads / writes: reads directive from seq_def.steps[0], hook, role, etc.; writes draft (subject, plain_text_body, html_body) for this touch.
    branch: If steps[0] is beyond the sequence length, falls back to the generic fallback_step. Happy path proceeds to faithfulness gate.

  7. The faithfulness gate (touch 1) – Audits the draft against assembled evidence using a judge model; removes any sentence whose claim is not supported; produces a faithfulness_score between 0 and 1, posted as feedback.
    reads / writes: reads draft and evidence; returns filtered_draft (cleaned) and faithfulness_score.
    branch: An over‑aggressive judge may strip true but terse claims; evidence that omitted a real fact may cause false removal. Happy path proceeds to the next iteration.

  8. Drafting the step (touch 2) – Writes the body for the second touch (follow‑up) using the directive from seq_def.steps[1].
    reads / writes: reads directive from seq_def.steps[1], hook, state; writes draft for second touch.
    branch: Same fallback as step 6. Happy path proceeds.

  9. The faithfulness gate (touch 2) – Audits the second touch draft (same node reused).
    reads / writes: same as step 7.

  10. Drafting the step (touch 3) – Writes the body for the third touch (soft close) using the directive from seq_def.steps[2].
    reads / writes: reads directive from seq_def.steps[2], etc.; writes draft for third touch.
    branch: Same fallback.

  11. The faithfulness gate (touch 3) – Audits the third touch draft (same node reused).
    reads / writes: same as step 7.

  12. Assemble final output (terminal step) – Packages the results: subject line, plain‑text body, HTML body, a skip_reason (if short‑circuited), an engagement_signal, and the time_of_next_touch (from cadence_days).
    reads / writes: reads all three filtered drafts, skip_reason, engagement_signal, next_touch_time from state; returns these as the graph’s output.
    branch: No further branches; the caller (autonomous pipeline, human‑approved campaign, or preview) decides whether to send. The graph never sends.

Control loops over touches 0..2 (steps 6–11), with each iteration comprising a drafting step and a faithfulness gate. The loop fans out over the three sequence steps, each with its own directive and cadence.

Diagram — the real call graph
System design — mechanism, invariant, trade-off

The outreach engine is implemented as a directed acyclic graph of gated steps that runs in a fixed order. Execution begins with a contact lookup that loads _contact_row from the database once, establishing a single consistent snapshot of role, seniority, department, and profile for the entire run. Next, the suppression gate checks a do-not-contact list keyed on a one-way fingerprint of the email address and domain; if the check fails or the contact is suppressed, the run ends immediately. After suppression, stop conditions examine the contact’s thread state for bounce, unsubscribe, unverified address, or existing reply, and terminate with a distinct machine-readable reason per condition. Only then does adaptive cadence compute a next-touch delay, clamped to a safe range, using the engagement signal and days since last send. The sequence selector performs a deterministic lookup against the contact’s vertical and optional niche tag to produce a structured three-step arc. Within each touch, the hook step extracts exactly one grounded fact from the supplied post text, the drafting step looks up the per-step directive and writes copy fitted to that step’s role (opener, value, or soft close), and finally the faithfulness gate runs faithfulness_check with build_outreach_evidence to audit every personalized sentence against assembled evidence, stripping any unsupported claim before the email leaves the graph.

The invariant the design preserves is that the graph never decides to send. The caller—whether an autonomous pipeline, a human-approved campaign, or a one-shot preview—always controls the final send decision. The graph produces a draft and a pending state; it holds the draft hostage to approval. A second invariant, enforced by the faithfulness gate, is that every personalized claim in the output is supported by evidence assembled in the faithfulness_evidence block, which is built from the hook, source post, memory context, and contact role. The judge model from DeepEval produces a score between zero and one, and any sentence whose claim is unsupported is removed. This means the system can be audited for groundedness without exposing a compliance surface.

The key trade-off is between flexibility and auditability in sequence selection. The design chooses a deterministic lookup from VERTICAL_HOOK_TEMPLATES and a hard-coded vertical-to-sequence map, rejecting the alternative of letting the model invent the sequence each time. That alternative would be more flexible and adapt freely to each recipient, but it would be unrepeatable and impossible to audit—each run would generate a different arc with no traceable provenance. The deterministic lookup buys repeatability and inspectability: every run for a given vertical and niche produces the same sequence structure, so a human approver and the traces can see the whole multi-touch arc before any copy is written. The cost the rejection avoids is the inability to prove, after a compliance incident, exactly what sequence was followed for a given contact. A similar trade-off appears in the hook extraction: extracting one grounded fact from the supplied text before drafting, rather than letting the model freelance the opener from full context, avoids the cost of invented details that never appeared in the recipient’s post.

One concrete failure mode is a missing contact row during the initial lookup. The mechanism reads _contact_row from the database once; if the row is absent, the personalization step receives an empty role and department, and the build_outreach_evidence assembly produces a faithfulness_evidence block with no contact facts. The drafting step then has no grounded evidence for any personalized claim, and the faithfulness gate will likely strip all such claims, producing an email that is generic or empty. The signal an operator would actually see is a span attribute logged with a missing-contact tag, and the counter email.compose.vertical_hook_rate would not increment because no vertical hook was selected. Additionally, the run would end with no draft in the pending state, leaving an orphaned thread with no email text.

Cost & performance — the real knobs

The subsystem spends time and money primarily on LLM model calls – for drafting each touch, for extracting the hook, and for the faithfulness judge that audits every personalized sentence. One read of the contact at the top buys consistency but adds a single database query. The deterministic sequence lookup is a cheap dict access. The main cost drivers are the number of model calls per run and the choice between a keyword check (free) and a judge model (one extra call per sentence). The source provides no explicit env vars for concurrency, retries, batch sizes, or caches, but it does expose several named constants and configuration maps that act as real performance knobs.

  • CADENCE_DAYS (list [0, 4, 7] in VERTICAL_SEQUENCE_DEFS)

    • Knobcadence_days under each vertical’s sequence definition; default values shown as [0, 4, 7] for accounting_ai.
    • Bounds — Controls the minimum gap between touches (first touch at day 0, second at day 4, third at day 7). Also subject to code-level clamping (adaptive cadence clamps to a safe range, but the static table is the fallback).
    • Effect — Shorter gaps increase send frequency, raising throughput and total model cost per campaign over a fixed period. Longer gaps reduce the number of touches per time unit, lowering dollar cost and reducing risk of over-messaging.
    • Risk — If set too short (e.g., same-day blasts are clamped, but if the list values were changed to [0,1,2] without clamping, it could annoy recipients). Too long may lose engagement momentum.
  • FALLBACK_STEP (integer 2 in VERTICAL_SEQUENCE_DEFS)

    • Knobfallback_step in each sequence definition; default 2.
    • Bounds — Defines which step index to use when the current step index is past the end of the sequence. It prevents a runtime error by providing a valid step.
    • Effect — Affects which directive is used for emails that exceed the planned sequence length. A higher fallback step means the extra touches use a later-stage directive (e.g., soft close) rather than the first step. This can change the content cost (tokens per touch) and slightly alter the number of model calls if the fallback triggers often.
    • Risk — If set too low (e.g., 0), extra touches repeat the opener, which may be inappropriate and waste model calls. If set too high (past the sequence), it could cause another index error.
  • SUB_NICHE_SCORE_WEIGHTS (defined in micro_verticals.py)

    • Knobsub_niche_score_weights (a nested map in micro_verticals.py); no default value shown, but it is part of the per-sub-niche calibrated weights.
    • Bounds — Determines which sub-niche sequence is selected based on the contact’s vertical and niche tag. It controls routing to narrower, more tailored copy.
    • Effect — Properly calibrated weights send better‑targeted emails, improving response rates and reducing wasted model calls on ineffective drafts. A stale or missing weight may cause fallback to the generic vertical sequence, which may use different directives and potentially different token costs.
    • Risk — If weights are mis-set (e.g., all zero), the sub-niche lookup fails and falls back; the fallback sequence may be less efficient. If a niche tag no longer matches after taxonomy changes, the lookup finds nothing and the run uses the vertical-level copy.
  • Faithfulness gate judge model (not named in source, but described as “a judge that compares each claim to the evidence” costing “one extra model call”)

    • Knob — The choice of judge model (or binary switch between keyword check and judge model). No identifier is given in the source, but the cost is explicit.
    • Bounds — Each sentence is audited; one extra model invocation per sentence.
    • Effect — Turning it on (judge) adds a full model call per sentence, increasing latency and dollar cost proportionally to sentence count. Turning it off (keyword check or none) saves money but risks shipping fabricated claims.
    • Risk — An over‑aggressive judge may strip true claims (increasing rejection rate and forcing re‑drafts). A missing judge allows unsupported claims to be sent, eroding trust.
  • Contact lookup read (single read at top of graph)

    • Knob — Not a tunable identifier, but a design constant: “one read at the top buys one consistent view.”
    • Bounds — Limits the number of database queries to one per run, no matter how many steps.
    • Effect — Eliminates repeated reads, reducing latency and database load. Increasing to more reads would increase I/O and risk of stale data.
    • Risk — A missing contact row (the failure mode) leaves personalization with nothing, so the whole run fails or produces generic copy.

These are the only explicit numeric or configurable identifiers in the source that directly influence time and money. The source does not provide env vars for concurrency, per‑host limits, retry counts, batch sizes, caches, or retrieval top‑k; those would be additional knobs in a production deployment not described here.

Failure modes — what breaks, what catches it

Missing contact row

  • Trigger — the contact database lookup in the first spine step returns no row for the given contact ID.
  • Guard — No guard is shown in the source. The spine reads the contact once and does not validate that a row was found; the personalization steps later receive None attributes.
  • PostureFail‑soft: the run continues but produces a draft with empty or default personalization, which may still be sent (the caller controls sending). No early abort is triggered.
  • Operator signal — The log would show the personalization steps receiving missing role, seniority, department, and profile, but no explicit error or warning around the absent contact row is specified.
  • Recovery — No automatic recovery exists. The draft would be human‑edited during approval pause, or the operator must manually fix the contact data and rerun.

Suppression gate bypass due to unnormalized address

  • Trigger — The suppression gate computes a one‑way fingerprint on the email address before checking the do‑not‑contact list, but the address was not normalized – for example, User@Example.com vs user@example.com – causing a mismatch that lets a suppressed contact through.
  • Guard — No guard is shown in the source. The source explicitly notes this as a failure mode and only states that the gate “fails closed” when the check cannot be completed, not when the check silently misses due to normalization differences.
  • PostureFail‑open: the contact is not recognized as suppressed and the run proceeds, potentially sending to someone who opted out, violating compliance.
  • Operator signal — No immediate signal; the audit record written by the gate would show a false “not suppressed” decision, but that record is not inspected automatically.
  • Recovery — No automatic recovery. The operator must detect the compliance issue externally, manually add a suppression entry, and investigate why the fingerprint mismatched. The normalization step must be fixed in code.

Step index past end of sequence

  • Trigger — The current step number exceeds the number of steps defined in the sequence for the contact’s vertical and niche. For example, a sequence with only 3 steps but the campaign engine requests step 4.
  • Guardget_step_directive clamps to the fallback_step when step_idx >= len(steps). In the code: if step_idx >= len(steps): step_idx = seq.get("fallback_step", len(steps) - 1).
  • PostureFail‑soft: the drafting step uses the directive of the fallback step (or the last step if fallback_step is not set), producing copy that fits a different role in the sequence rather than the intended one. The run continues.
  • Operator signal — No explicit error is raised; the operator would observe a draft that repeats an earlier step’s message or mismatches the expected step role (e.g., a “soft close” appearing earlier than intended).
  • Recovery — The clamp provides automatic fallback. The operator may manually edit the draft during approval, or adjust the sequence definitions to match the expected number of touches.

Faithfulness gate over-aggressive

  • Trigger — The judge model removes a sentence from the draft that is actually true but expressed tersely or using synonyms that the judge does not consider supported by the evidence.
  • Guard — No guard is shown in the source. The gate itself is the last line of defense; it posts a score for measurement but has no runtime validation or override to prevent stripping true claims.
  • PostureFail‑soft: the draft is truncated by losing one or more true statements. The email may become generic or lose a persuasive claim, but the run continues and the shortened draft is sent (if approved).
  • Operator signal — The operator can see the removed sentences if the draft is compared to the gate’s score feedback, but no log or metric automatically highlights over‑aggression. The score alone does not distinguish over‑aggression from correct removal.
  • Recovery — No automatic recovery. The operator must manually restore the removed claim during the approval edit step, or a future version of the judge model must be tuned based on the measured scores to reduce strictness.

Stale engagement signal biasing cadence

  • Trigger — The cadence_days decision reads an engagement signal (e.g., a recorded email open) that was recorded hours or days after the event, so it appears to be a recent open when it is actually old, or vice versa. The model proposes a gap based on that stale signal, and the code clamps within bounds but does not validate freshness.
  • Guard — No guard is shown in the source. The clamp (safe range) limits the proposal to a min‑max window, but a stale signal can still cause the gap to be shortened or lengthened incorrectly inside those bounds.
  • PostureFail‑soft: the timing adapts but may send the next touch too soon (risk of annoyance) or too late (risk of losing interest). The run continues.
  • Operator signal — No explicit error; the operator might observe an unusually short (e.g., 2‑day) gap after a contact who actually opened a week ago, or a long gap after a recent open, visible only by cross‑checking the engagement log with the schedule.
  • Recovery — No automatic recovery. The operator can override the scheduled time during the approval pause, or the engagement signal pipeline must be fixed to provide time‑stamped events with a staleness check.

Niche tag no longer matching any definition after taxonomy changes

  • Trigger — A contact’s sub‑niche field contains a value that no longer exists in SUB_NICHE_SEQUENCE_DEFS because the taxonomy was revised (e.g., a sub‑niche was renamed or removed).
  • Guardget_sequence_def returns None when sub_niche is not found in the sub‑map, and then get_step_directive returns None, which triggers the generic fallback drafting step. The sequence selector also falls back to the vertical‑level VERTICAL_SEQUENCE_DEFS.
  • PostureFail‑soft: the system uses the broader vertical sequence instead of the tailored niche sequence. The personalization is less specific but still grounded and functional. The run continues.
  • Operator signal — No log or metric explicitly raises the tag mismatch. The operator would notice that contacts with an orphaned niche tag receive generic copy instead of the expected tailored variant, only by comparing the draft to the expected niche arc.
  • Recovery — Automatic fallback to the vertical sequence. The operator must update the taxonomy and either map the old tag to a new one or reassign contacts. No retry or backoff is triggered.
Interview — could you explain it?

Q – How does the outreach engine ensure that the same copy generation logic can be reused across different sending policies?
A – The outreach graph drafts copy but never sends; sending is a separate decision controlled by the caller. The system keeps a registry that lists each graph by a short name and pairs it with the module that builds it, and the same drafting engine is reused by an autonomous pipeline, a human‑approved campaign, and a one‑shot preview, because the graph never assumes how it is invoked.
Follow-up – What bookkeeping does the outreach graph return to support those different callers?
It returns a skip reason, an engagement signal, and the time of the next touch, leaving the send/no‑send decision entirely to the caller.
Weak answer misses – The exact bookkeeping fields: skip reason, engagement signal, next touch time.


Q – Why did you choose a shared graph runtime fronted by a registry over a single big send function? (design question)
A – A single send function has the fewest moving parts but one failure domain and nowhere to insert a human or a safety gate. The shared graph runtime with a registry gives gated, traceable steps with additive growth, at the cost of one routing layer and the discipline to keep the registry simple. This allows each step to own one concern and lets the team test, gate, and trace each step independently.
Follow-up – How does the graph runtime handle the trade‑off between isolation and platform overhead compared to separate microservices?
Separate microservices give clean isolation but require deployment, tracing, and state plumbing per flow; the shared runtime avoids that overhead while still keeping steps separated.
Weak answer misses – The explicit mention of the “platform tax” (deployment, tracing, state plumbing) that separate microservices incur, and the “one failure domain” of a single send function.


Q – How does the system select the correct sequence of outreach touches for a given contact?
A – The function get_sequence_def looks up a sequence definition by vertical and optionally by sub‑niche. If a sub‑niche is provided and a matching entry exists in SUB_NICHE_SEQUENCE_DEFS, that tailored sequence wins; otherwise it falls back to VERTICAL_SEQUENCE_DEFS. The per‑step directives are then retrieved by get_step_directive, which clamps out‑of‑range steps to the fallback_step.
Follow-up – What happens if a sub‑niche tag no longer matches any definition after a taxonomy change?
That is a known failure mode: the niche tag no longer matches any definition, causing fallback to the vertical‑level sequence, which may be less specific.
Weak answer misses – The exact nesting structure: SUB_NICHE_SEQUENCE_DEFS is a nested map {vertical: {sub_niche: seq_def}} and the resolution logic (additive, with sub‑niche wins if present, else vertical).


Q – How does the system prevent the drafting model from fabricating personalized claims?
A – The faithfulness gate uses a judge model to audit each sentence in the draft against the assembled evidence. Any sentence whose claim is not supported is removed before the email is finalized. The gate produces a score between zero and one that is posted as feedback, allowing prompt and model versions to be ranked by how grounded their output is.
Follow-up – What are the failure modes of the faithfulness gate?
An over‑aggressive judge that strips a true but tersely worded claim, or an evidence set that omitted a real fact.
Weak answer misses – The judge model compares each claim to the evidence (semantic fabrication detection), and the gate’s output is a score used as feedback for ranking versions.


Q – The system describes three approaches to staying grounded: trusting the drafting model, a keyword check, and a judge model. Under what circumstances would you choose the judge model over the cheaper alternatives?
A – Trusting the drafting model is cheapest but ships a single confident fabrication; a keyword check is deterministic but blind to meaning. The judge model catches semantic fabrication by comparing each claim to the evidence, at the cost of one extra model call. The system uses the judge model because personalized claims that are not true erode trust and can be a compliance problem, so semantic accuracy is worth the extra call.
Follow-up – How does the judge model’s output get used beyond the immediate gate?
The score is posted as feedback to rank prompt versions and model versions by how grounded their output is.
Weak answer misses – The exact trade‑offs: cheapest vs. blind to meaning vs. semantic detection, and the compliance/trust motivation that justifies the extra cost.

Glossary — the domain terms, grounded in the code

16terms, each defined from this subsystem’s real source.

suppression_gate

suppression_gate is a node that runs after lookup_contact and before check_stop_conditions; it checks the suppression_list table by SHA-256 email hash and domain, and on a hit writes an audit row and sets skip_reason to short-circuit the graph to END, blocking the send.

Memory hook The suppression_gate is a bouncer that checks your email’s hash and domain, then boots you out before any AI draft.

From email_outreach_graph.py

check_stop_conditions

check_stop_conditions is a node that inspects the contact’s current thread state (using status values like bounced, complained, followup_status stopped, and reply classification) and terminates the run with a specific machine-readable reason when a stop condition applies, serving as a separate guard from the permanent suppression gate.

Memory hook check_stop_conditions stops runs for bounced, complained, or unsubscribed contacts — a live conversation guard.

From email_outreach_graph.py

decide_cadence

decide_cadence is a graph node that calls the `_cadence_decision` function, which uses a DeepSeek LLM to determine the optimal `days_gap` (clamped between `CADENCE_MIN_DAYS` and `CADENCE_MAX_DAYS`) based on engagement signals like opened, no_response, or first_touch, and runs after `check_stop_conditions` and before `select_template` in the email outreach flow.

Memory hook Decide_cadence uses DeepSeek to set days_gap (clamped) from opened, no_response, or first_touch.

From email_outreach_graph.py

select_sequence

select_sequence is a pure lookup function that returns a structured sequence plan containing a sequence_id and a list of touches (each with step and angle) for a given vertical and optional sub_niche, or None for unknown/missing verticals (graceful fallback); in the graph flow, the select_sequence_node wraps it and runs after template selection but before hook extraction so that the full sequence plan is available in state before any LLM call.

Memory hook select_sequence maps a vertical to its step-and-angle blueprint in the graph's pre-draft phase.

From email_outreach_graph.py

draft_step

draft_step is a graph node that generates per-step email copy by applying a vertical-specific step directive from VERTICAL_SEQUENCE_DEFS, falling back to the generic draft node logic when company_vertical is absent or sequence_step is None or 0, and it runs after extract_hook and before build_outreach_evidence in the outreach flow.

Memory hook draft_step steps after extract_hook to write each sequence touch's email, using vertical directives or generic fallback.

From email_outreach_graph.py

VERTICAL_SEQUENCE_DEFS

VERTICAL_SEQUENCE_DEFS is the dictionary that maps vertical slugs to their default sequence definitions, serving as the fallback when no sub-niche-specific entry exists in SUB_NICHE_SEQUENCE_DEFS, and is referenced by get_sequence_def and get_step_directive to retrieve the vertical-level sequence plan or step directive.

Memory hook VERTICAL_SEQUENCE_DEFS: the generic fallback plan that catches every vertical without a sub-niche override.

From email_outreach_graph.py

SUB_NICHE_SEQUENCE_DEFS

SUB_NICHE_SEQUENCE_DEFS is a nested dictionary mapping vertical names to sub-niche names to sequence definitions, used to provide tailored outreach copy for specific sub-niches, with fallback to the vertical-level VERTICAL_SEQUENCE_DEFS entry when a sub-niche is missing or None.

Memory hook SUB_NICHE_SEQUENCE_DEFS holds tailored sequences for sub-niches, falling back to VERTICAL_SEQUENCE_DEFS when absent.

From email_outreach_graph.py

sequence_id

`sequence_id` is the unique identifier for a vertical-level or sub-niche sequence definition returned by `select_sequence`; it is logged alongside the touch count for observability and stored in `selected_sequence` so the approval layer can inspect the plan before any draft is generated.

Memory hook Sequence_id tags every sequence plan for observability logging and approval-layer inspection before drafting.

From email_outreach_graph.py

touch_angles

touch_angles is a list of angle descriptions for each step in a sequence definition, converted by `build_sequence_touches` into the structured `[{step, angle}]` list that forms part of the sequence plan returned by `select_sequence`.

Memory hook Touch angles: the persuasive hook for each sequence step, converted to structured {step, angle} by build_sequence_touches.

From email_outreach_graph.py

build_sequence_touches

build_sequence_touches is a helper function that converts a sequence definition's `touch_angles` list into the structured `[{step, angle}]` list required by the spec, and it is called within `select_sequence` to build the `touches` portion of the returned sequence plan.

Memory hook build_sequence_touches forges raw touch_angles into structured step-angle pairs for the sequence plan.

From email_outreach_graph.py

cadence_days

In the sequence definitions (such as those stored in `VERTICAL_SEQUENCE_DEFS` or `SUB_NICHE_SEQUENCE_DEFS`), `cadence_days` is a list of integers that specifies the number of days between each sequential step, with the same length as the `steps` array.

Memory hook Cadence_days is the silent gap between each step's beat in your sequence's rhythm.

From email_outreach_graph.py

fallback_step

fallback_step is an integer field in each vertical or sub-niche sequence definition that specifies the step index to use when the requested step index exceeds the length of the steps list, effectively clamping out-of-range steps to a designated fallback step (typically the last step).

Memory hook When your step index overshoots, fallback_step is the emergency brake that stops at the last defined step.

From email_outreach_graph.py

wrap_untrusted

wrap_untrusted is a function that fences attacker-influenceable text with a label and logs a structured warning when it detects an injection marker, neutralising the text to prevent it from steering the draft.

Memory hook wrap_untrusted fences attacker-influenceable text with a label and sounds an alarm when it detects injection markers.

From email_compose_graph.py

LLM_KILL_SWITCH

LLM_KILL_SWITCH is an environment variable that when set to truthy values like "1" or "true" short-circuits the faithfulness_check node by returning a faithfulness_score of 1.0 and the unmodified body with no LLM call; it also causes the make_llm() function to raise LlmDisabledError for every LLM path in the system.

Memory hook LLM_KILL_SWITCH flips a kill‑switch that kills all LLM calls and hands back a perfect 1.0 score untouched.

From email_compose_graph.py

skip_reason

skip_reason is a field in the EmailOutreachState dictionary that, when set to a non‑None string (e.g., "email_unverified", "bounced", "unsubscribed", "stopped"), causes the conditional edge function _route_after_stop_check to return "skip" and short‑circuit the graph to END; it is set by lookup_contact, check_stop_conditions, and suppression_gate when they determine the recipient should not be contacted.

Memory hook skip_reason is the “skip” flag that, when set, short‑circuits the graph straight to END.

From email_outreach_graph.py

plan-approval gate

The plan-approval gate is an interrupt-based gate in pipeline_graph.outreach_queue that enforces a human decision (approve, edit, reject, or skip) before the outreach graph runs; this gate is already enforced upstream of the graph described here, which is composed after approval and never sends.

Memory hook The plan-approval gate is an interrupt that halts for a human to approve, edit, reject, or skip before the graph runs.

From email_outreach_graph.py