Evaluation & Feedback — Transcript

📄 10 chapters · read at your own pace · back to the Evals hub →

How the agentic-sales fleet keeps itself honest — every chapter grounded in the real evaluation code: the deterministic evaluators that score the same way in CI and production, the typed dataset contracts, the DeepEval judge and its bar, trajectory checks, the coverage and feedback gates, and cost tracked as a first-class metric.

01. What Evaluation Is

To evaluate a fleet of language model graphs, you need more than a single accuracy number. Each graph is a multi-stage pipeline with independent failure points. A single score hides where things break.

Take a generation graph. It needs two metrics. One checks that every factual claim in the output is grounded in the provided context. Another verifies the output actually answers the input prompt. A high accuracy could still hide a hallucinated claim that sounds relevant but is not true.

Now consider an extraction graph. It also requires two metrics. A stricter faithfulness check ensures no hallucinated extractions appear. A contextual precision metric measures whether the top ranked retrieved nodes are actually relevant to the query. A single accuracy number would never reveal a mismatch here.

Each stage has its own failure mode. Faithfulness failures, relevance failures, precision failures. You cannot diagnose them with one number. By splitting evaluation into separate metrics, you pinpoint exactly where the pipeline goes wrong. That is the trade-off: more work setting up metrics, but much clearer signal. Without that, you are flying blind.

<!-- mem:begin -->

Generate it: A generation graph needs two metrics: one checks every claim is grounded in context, the other checks the output answers the f______ prompt. (cue: f______; answer: input)

Generate it: An extraction graph uses a stricter faithfulness check plus a contextual p_________ metric for whether top ranked nodes are relevant. (cue: p_________; answer: precision)

Ask yourself: Why does a single accuracy number leave you flying blind across a multi-stage pipeline?

Recall check (try before reading the answer):

  1. What are the two metrics a generation graph needs? Answer: One checks every factual claim is grounded in the provided context; the other verifies the output answers the input prompt.

  2. Why can a high accuracy score still be misleading? Answer: A high accuracy could still hide a hallucinated claim that sounds relevant but is not true.

  3. What do you gain by splitting evaluation into separate metrics, and what does it cost? Answer: You pinpoint exactly where the pipeline goes wrong; the trade-off is more work setting up metrics.

<!-- mem:end -->

Evaluation uses separate metrics for generation and extraction graphs to pinpoint failure modes.

python
def make_generation_metrics(judge=None, *, with_faithfulness=True, strict=False):
    from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
    j = judge if judge is not None else make_judge()
    faith_threshold = THRESHOLD_STRICT if strict else THRESHOLD_DEFAULT
    metrics = []
    if with_faithfulness:
        metrics.append(FaithfulnessMetric(threshold=faith_threshold, model=j, include_reason=True))
    metrics.append(AnswerRelevancyMetric(threshold=THRESHOLD_DEFAULT, model=j, include_reason=True))
    return metrics

def make_extraction_metrics(judge=None):
    from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric
    j = judge if judge is not None else make_judge()
    return [
        FaithfulnessMetric(threshold=THRESHOLD_STRICT, model=j, include_reason=True),
        ContextualPrecisionMetric(threshold=THRESHOLD_DEFAULT, model=j, include_reason=True),
    ]

02. Deterministic Code Evaluators

A deterministic evaluator takes a run and an example, then returns three things: a key, a score between zero and one, and a reason. Because it is pure, the same input always gives the exact same score. No hidden state, no randomness, no variation. That is a key property. It means the evaluator behaves identically every time you call it.

Now here is the concrete detail. In the code, there are pure fixture-based evaluators that check step directives for expected keywords. They do not call a language model. They just compare text to a list. That makes them deterministic: the same example always passes or fails in exactly the same way. No surprise in production that never showed up in continuous integration.

The trade-off is clear. A deterministic evaluator is fast and reliable. You can run it in continuous integration and in production without any change. But it cannot catch every nuance. It will only check for the patterns you hardcoded. A language model judge can handle more, but it is not deterministic. The same input might give different scores because the model changes or because of randomness. So if certainty matters, a pure evaluator is better. You trade depth for reproducibility. And that is often a good deal when you need to trust the numbers.

<!-- mem:begin -->

Generate it: A deterministic evaluator returns three things: a key, a s_____ between zero and one, and a reason. (cue: s_____; answer: score)

Generate it: Pure fixture-based evaluators check step directives for expected keywords without ever calling a l________ model. (cue: l________; answer: language)

Ask yourself: What makes a deterministic evaluator give the exact same score every time you call it?

Recall check (try before reading the answer):

  1. What three things does a deterministic evaluator return? Answer: A key, a score between zero and one, and a reason.

  2. How do the pure fixture-based evaluators decide pass or fail? Answer: They compare text to a list of expected keywords; they do not call a language model.

  3. What is the trade-off versus a language model judge? Answer: You trade depth for reproducibility — the pure evaluator only checks hardcoded patterns, the judge handles more but is not deterministic.

<!-- mem:end -->

The trajectory coverage evaluator is a pure deterministic function that checks which steps from an expected list appear in the actual run output, producing a score between 0 and 1 and a comment.

python
def pipeline_trajectory_coverage(run: Any, example: Any) -> dict[str, Any]:
    """Score how well the pipeline_graph execution covered the expected stages."""
    run_outputs = (
        run.outputs if hasattr(run, "outputs")
        else (run.get("outputs") or {}) if isinstance(run, dict)
        else {}
    )
    example_outputs = (
        example.outputs if hasattr(example, "outputs")
        else (example.get("outputs") or {}) if isinstance(example, dict)
        else {}
    )

    actual: list[str] = run_outputs.get("actual_trajectory") or []
    expected: list[str] = example_outputs.get("expected_trajectory") or []

    if not expected:
        return {"score": 1.0, "comment": "no expected_trajectory in example"}

    actual_set = set(actual)
    matched = sum(1 for step in expected if step in actual_set)
    score = matched / len(expected)
    missing = [s for s in expected if s not in actual_set]
    comment = (
        f"coverage={score:.2f}  matched={matched}/{len(expected)}"
        + (f"  missing={missing}" if missing else "  all steps present")
    )
    return {"score": score, "comment": comment}

03. Read Only Query Checks

The provided source material does not contain any information about database query validation, such as checks that reject write keywords or run an EXPLAIN on an in-memory database. The context focuses on DeepEval metrics for generation and extraction graphs, along with test fixtures for cold emails and outreach scenarios. Without grounding in the source, I cannot produce a narration on the requested topic. Please supply the relevant documentation or code that describes those two deterministic checks.

<!-- mem:begin -->

Generate it: The provided source material does not contain any information about database query validation, such as checks that reject write keywords or run an E_______ on an in-memory database. (cue: E_______; answer: EXPLAIN)

Generate it: The context focuses on DeepEval metrics for generation and extraction graphs, along with test fixtures for cold emails and o________ scenarios. (cue: o________; answer: outreach)

Ask yourself: Why does the source not let the narrator produce a narration on the requested topic?

Recall check (try before reading the answer):

  1. What database query validation does the provided source material not contain? Answer: Checks that reject write keywords or run an EXPLAIN on an in-memory database.

  2. What does the context focus on instead? Answer: DeepEval metrics for generation and extraction graphs, along with test fixtures for cold emails and outreach scenarios.

  3. What does the narrator request to narrate the topic faithfully? Answer: The relevant documentation or code that describes those two deterministic checks.

<!-- mem:end -->

No-op database connection prevents writes during evaluation.

python
class _NoopCursor:
    def execute(self, *_a, **_kw):
        return None
    def fetchone(self):
        return None
    def fetchall(self):
        return []
    description = []

class _NoopConn:
    def cursor(self):
        return _NoopCursor()

def _noop_connect(*_a, **_kw):
    return _NoopConn()

class _ShimPsycopg:
    connect = staticmethod(_noop_connect)

04. Typed Dataset Contracts

Versioned datasets store a complete input and expected output for each core graph. Each dataset is strict on what goes in. It fixes the retrieval context, the instruction text, and any metadata. But it is lenient on the output. Instead of requiring an exact match, it checks for signals. It looks for key phrases or concepts that must appear. It also lists words that must not appear. This trade-off means the same dataset can drive evaluation both locally and in the hosted tracing tool. The accuracy bar stays the same. You test your graph offline. Then when you deploy, the hosted tool uses the exact same set of inputs and checks. There is no drift between local and hosted results. Each test case pairs one input with one output specification. The specification tells the evaluator what to look for. That keeps the evaluation consistent. The result is a shared standard. You know if the graph passes locally, it will pass in production.

<!-- mem:begin -->

Generate it: Each dataset is strict on the input — fixing retrieval context, instruction text, and metadata — but l_______ on the output. (cue: l_______; answer: lenient)

Generate it: Instead of an exact match, the output check looks for key phrases that must appear and lists words that must n__ appear. (cue: n__; answer: not)

Ask yourself: Why does keeping the output check lenient (signals, not exact match) let one dataset run both locally and in the hosted tool?

Recall check (try before reading the answer):

  1. What does a dataset fix strictly on the input side? Answer: The retrieval context, the instruction text, and any metadata.

  2. How does the output specification check a result without an exact match? Answer: It looks for key phrases or concepts that must appear and lists words that must not appear.

  3. What guarantee does running the same dataset locally and hosted give you? Answer: There is no drift — if the graph passes locally, it will pass in production.

<!-- mem:end -->

Versioned dataset evaluation checks required signals and bans prohibited phrases, enabling consistent local and hosted evaluation.

python
async def run_eval_for_graph(graph: str) -> dict[str, Any]:
    examples = [_scrub_example(e) for e in DATASET_REGISTRY[graph]]
    correct = 0
    total = len(examples)
    if graph == "email_compose":
        from graphs.email_compose_graph import build_graph
        g = build_graph()
        for ex in examples:
            out = await g.ainvoke(ex["inputs"])
            subject = (out.get("subject") or out.get("draft_subject") or "").lower()
            body = (out.get("body") or out.get("draft_body") or "").lower()
            full_text = subject + " " + body
            expected = ex["outputs"]
            signals_hit = sum(
                1 for sig in expected.get("must_contain_signals", [])
                if sig.lower() in full_text
            )
            no_banned = all(
                phrase not in full_text
                for phrase in expected.get("banned_phrases", [])
            )
            if signals_hit == len(expected.get("must_contain_signals", [])) and no_banned:
                correct += 1
    # ... return {"graph": graph, "correct": correct, "total": total, ...}

05. The Judge Model

The deep evaluation judge is a language model that grades open-ended answers. A simple string match cannot handle that. So the team built this judge once and reuse it across every graph test. It is a shared wrapper that keeps evaluation consistent. For faithfulness metrics, there is a default threshold and a stricter threshold for critical tasks like outreach or extraction. The answer relevancy metric always uses the default threshold. The entire test suite must clear an overall pass rate, but the exact number is not given in the source. The judge is created once through a helper function and passed to each metric. That means each test uses the same evaluator, but thresholds can be adjusted per graph. For example, extraction graphs use the strict faithfulness threshold to catch any hallucinated detail. Generation graphs that have a retrieval context use the default faithfulness threshold, while free generation graphs skip it entirely. All this is grounded in the code files provided, which show the functions and their parameters but do not spell out the numeric values beyond the variable names. The key idea is that one judge model, built once, grades every test case, with thresholds set per metric and scenario.

<!-- mem:begin -->

Generate it: The judge is created once through a helper function and passed to each m_____, so every test uses the same evaluator. (cue: m_____; answer: metric)

Generate it: Extraction graphs use the strict faithfulness threshold to catch any h___________ detail. (cue: h___________; answer: hallucinated)

Ask yourself: Why build one shared judge model and reuse it across every graph test instead of a string match per test?

Recall check (try before reading the answer):

  1. Which metric always uses the default threshold? Answer: The answer relevancy metric always uses the default threshold.

  2. How is the judge wired into each metric? Answer: It is created once through a helper function and passed to each metric, so each test uses the same evaluator.

  3. Which faithfulness threshold do generation graphs with a retrieval context use, and what do free generation graphs do? Answer: They use the default faithfulness threshold, while free generation graphs skip it entirely.

Looking back: In "Read Only Query Checks", why did the narrator decline to write that chapter? Answer: The source material had no information about database query validation, so it could not be narrated faithfully.

<!-- mem:end -->

The shared DeepSeek judge is created once via make_judge() and reused across metrics, with thresholds adjustable per graph.

python
def make_judge() -> Any:
    """Return a DeepEval-compatible LLM (DeepSeek)."""
    return _build_deepseek_judge()

def make_generation_metrics(
    judge: Any = None,
    *,
    with_faithfulness: bool = True,
    strict: bool = False,
) -> list[Any]:
    from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

    j = judge if judge is not None else make_judge()
    faith_threshold = THRESHOLD_STRICT if strict else THRESHOLD_DEFAULT
    metrics: list[Any] = []
    if with_faithfulness:
        metrics.append(
            FaithfulnessMetric(threshold=faith_threshold, model=j, include_reason=True)
        )
    metrics.append(
        AnswerRelevancyMetric(threshold=THRESHOLD_DEFAULT, model=j, include_reason=True)
    )
    return metrics

06. Trajectory Evaluation

When a graph runs, it takes a path through a series of nodes and tools. To catch regressions, we compare that actual path against a golden expected trajectory. Test cases define allowed entity nodes and edges. Two key metrics help with this check. Faithfulness makes sure every extracted item is backed by text in the source document. It uses a strict threshold for extraction graphs. Contextual precision checks that the highest ranked nodes are actually relevant to the query. Both metrics need a retrieval context filled with source document chunks. Together, they enforce that each step in the path is correct. A final answer check alone would only look at the overall output. It might miss when an intermediate step goes wrong. But an ordered path check catches each misstep. For example, if the graph outputs a node not in the allowed set, the faithfulness metric flags it. The source also provides a fail open for empty subgraphs. That ensures zero recommendations when no path exists. This combination of metrics and test fixtures gives a thorough evaluation. That is why an ordered path check catches regressions that a final answer check alone would miss.

<!-- mem:begin -->

Generate it: To catch regressions, we compare the actual path against a golden expected t__________. (cue: t__________; answer: trajectory)

Generate it: Contextual precision checks that the highest ranked n____ are actually relevant to the query. (cue: n____; answer: nodes)

Ask yourself: Why does an ordered path check catch regressions that a final-answer check alone would miss?

Recall check (try before reading the answer):

  1. What do test cases define for trajectory evaluation? Answer: The allowed entity nodes and edges.

  2. What do the two metrics both require to run? Answer: A retrieval context filled with source document chunks.

  3. What does the fail-open for empty subgraphs ensure? Answer: Zero recommendations when no path exists.

<!-- mem:end -->

Trajectory coverage evaluator scoring fraction of expected nodes present in actual trajectory

python
def pipeline_trajectory_coverage(run: Any, example: Any) -> dict[str, Any]:
    """Score how well the pipeline_graph execution covered expected stages."""
    run_outputs = (
        run.outputs if hasattr(run, "outputs")
        else (run.get("outputs") or {}) if isinstance(run, dict)
        else {}
    )
    example_outputs = (
        example.outputs if hasattr(example, "outputs")
        else (example.get("outputs") or {}) if isinstance(example, dict)
        else {}
    )
    actual: list[str] = run_outputs.get("actual_trajectory") or []
    expected: list[str] = example_outputs.get("expected_trajectory") or []
    if not expected:
        return {"score": 1.0, "comment": "no expected_trajectory in example"}
    actual_set = set(actual)
    matched = sum(1 for step in expected if step in actual_set)
    score = matched / len(expected)
    missing = [s for s in expected if s not in actual_set]
    comment = (
        f"coverage={score:.2f}  matched={matched}/{len(expected)}"
        + (f"  missing={missing}" if missing else "  all steps present")
    )
    return {"score": score, "comment": comment}

07. The Coverage Gate

There is a coverage map for graphs. It shows which graphs have an evaluation and which do not. The map comes from the source code. New graphs that call a model are missing. They have no evaluation yet. Someone needs to write one. For generation graphs, the standard metrics are faithfulness and answer relevancy. Faithfulness checks each claim is grounded in the source text. Answer relevancy checks the output answers the prompt. For extraction graphs, the metrics are strict faithfulness and contextual precision. Strict faithfulness stops any made up extractions. Contextual precision checks that the most relevant nodes are ranked highest. The gate fails when a covered graph drops below the threshold. The threshold is a set number. If the faithfulness score goes under it, the gate blocks the graph. This keeps the system honest. The coverage map updates itself as new evaluations appear. It is self maintaining. It ensures only high quality graphs pass. Every claim must be backed by the source. That is the rule. The map derives from which graphs have their own evaluation code. A graph that calls a model but has no evaluation is marked uncovered. Once someone writes a test for it, the map adds it. The gate then checks its scores. Any regression below the bar causes a failure.

<!-- mem:begin -->

Generate it: For extraction graphs, the metrics are strict faithfulness and contextual p_________. (cue: p_________; answer: precision)

Generate it: The coverage map is self maintaining — it u______ itself as new evaluations appear. (cue: u______; answer: updates)

Ask yourself: How does a graph become marked uncovered, and what removes that mark?

Recall check (try before reading the answer):

  1. Where does the coverage map get its information? Answer: From the source code — specifically which graphs have their own evaluation code.

  2. What condition makes the gate fail? Answer: When a covered graph's faithfulness score drops below the set threshold, the gate blocks the graph.

  3. What kind of graph is marked uncovered? Answer: A graph that calls a model but has no evaluation.

Looking back: In "The Judge Model", how was the judge shared across graph tests? Answer: It was created once through a helper function and passed to each metric, so every test uses the same evaluator.

<!-- mem:end -->

The coverage gate checks if a graph that calls a model has an evaluation; the helper function detects LLM usage in graph source code.

python

_GRAPH_IMPORT_RE = re.compile(
    r"\bgraphs(?:\.|\s+import\s+)([a-z0-9_]+_graph)\b"
)

def graph_uses_llm(module: str) -> bool:
    """True if the graph source has any LLM call site (see ``_LLM_PATTERNS``)."""
    src = (PKG_DIR / "graphs" / f"{module}.py").read_text(encoding="utf-8")
    code = _strip_comments_and_strings(src)
    return any(p.search(code) for p in _LLM_PATTERNS)

08. The Feedback Gate

This is the feedback gate. It runs every feedback dimension and prints a one-screen digest. It rolls up over confident calibration bands from email and application feedback. And it checks reply classifier classes that are chronically low confidence. When production contradicts a classifier, the gate exits with a non-zero code. That turns the feedback signal into something continuous integration can act on. The gate adds no new analysis. It reuses the exact reports and thresholds from the other scripts. So it never disagrees with them. But it turns a report into a hard check, exiting clean or flagged. Exit codes matter. Zero means clean. One means at least one flag. Two means a source errored, like a database being unreachable. That way you can distinguish a real miscalibration from a temporary outage. The gate is designed to be run in cron or continuous integration pipelines. Each flagged issue describes one miscalibration. For email and application, those are bands with a high mean confidence but low realized positive rate. For replies, they are classes where most predictions land below a confidence threshold. The gate uses the same minimum sample sizes and gap thresholds as the individual scripts. It keeps everything consistent. This makes the feedback loop enforceable, not just a report. It’s a simple but powerful guard.

<!-- mem:begin -->

Generate it: When production contradicts a classifier, the gate exits with a n___zero code. (cue: n___; answer: non)

Generate it: Exit code two means a source e______, like a database being unreachable. (cue: e______; answer: errored)

Ask yourself: Why distinguish exit code one (a flag) from exit code two (a source errored)?

Recall check (try before reading the answer):

  1. What new analysis does the feedback gate add? Answer: None — it reuses the exact reports and thresholds from the other scripts, so it never disagrees with them.

  2. What does exit code zero versus one mean? Answer: Zero means clean; one means at least one flag.

  3. For replies, what counts as a flagged miscalibration? Answer: Classes where most predictions land below a confidence threshold.

<!-- mem:end -->

The feedback gate runs every dimension and exits non-zero on miscalibration.

python
"""Enforceable feedback gate: run every feedback dimension, print a one-screen
digest, and exit non-zero when production contradicts a classifier.

Where the other feedback_*.py scripts *report*, this *gates*. It rolls up:

  * over-confident calibration bands (feedback_email + feedback_applications) —
    high mean confidence, low realised positive-rate,
  * reply-classifier classes that are chronically low-confidence
    (feedback_replies),

and exits 1 if anything is flagged (unless ``--warn-only``). That makes the
feedback signal something cron / CI can act on, in the same spirit as the
repo's eval ≥0.80 gate and ``pnpm strategy:check`` — see OPTIMIZATION-STRATEGY.md
(Observability pillar). It adds no new analysis: it reuses the exact reports,
dimensions, and thresholds the individual scripts already produce, so the gate
can never disagree with them.

Exit codes: 0 = clean (or --warn-only), 1 = at least one flag, 2 = a source
errored (e.g. D1 unreachable) — distinguishable from a real miscalibration.

Usage (from backend/):
    python scripts/feedback_gate.py [--limit N] [--min-n N] [--gap-threshold F]
                                    [--low-threshold F] [--warn-only] [--json]
"""

09. Cost As A Metric

This system treats cost as a first class metric. Every graph run records its token usage and cost into a dedicated database table. This logging is best effort, meaning a database failure never stops the graph from running. A kill switch can halt all large language model calls instantly. When the kill switch is active, no cost log is written because no calls were made. A daily spend budget alert reads today’s cost from the log file. If the total exceeds a set threshold, it fires a webhook notification with only numeric data. Each workflow also has its own token budget for production monitoring. When a run goes over its budget, an observability event is emitted. But the run is never aborted. This lets teams monitor spending without breaking the application. So the trade off is between comprehensive cost data and system reliability. The best effort logging ensures that the graph always continues. The kill switch provides a safety net. And per workflow budgets alert without interrupting work. Recording every token alongside its run allows detailed cost analysis per workflow. That is how cost is treated as a first class metric here.

<!-- mem:begin -->

Generate it: Cost logging is best effort, meaning a database f______ never stops the graph from running. (cue: f______; answer: failure)

Generate it: A k___ switch can halt all large language model calls instantly. (cue: k___; answer: kill)

Ask yourself: Why is no cost log written while the kill switch is active?

Recall check (try before reading the answer):

  1. Where does every graph run record its token usage and cost? Answer: Into a dedicated database table.

  2. What happens when a run goes over its per-workflow token budget? Answer: An observability event is emitted, but the run is never aborted.

  3. What does the daily spend budget alert do when the total exceeds the threshold? Answer: It fires a webhook notification with only numeric data.

Looking back: In "The Coverage Gate", what made the coverage map self-maintaining? Answer: It updates itself from the source code as new evaluations appear.

<!-- mem:end -->

The _emit_cost_log function records cost, token, and call totals for each graph run as a structured INFO log line.

python
def _emit_cost_log(
    graph: str,
    model: str | None,
    totals: dict[str, Any] | None,
    feature: str | None,
    vertical: str | None = None,
    error: Any = None,
) -> None:
    if not _cost_log_enabled():
        return
    totals = totals or {}
    calls = int(totals.get("total_llm_calls", 0) or 0)
    cost = float(totals.get("total_cost_usd", 0.0) or 0.0)
    if calls == 0 and cost == 0.0 and not error:
        return
    retries = int(totals.get("total_retries", 0) or 0)
    vert = vertical if vertical is not None else ""
    log.info(
        "cost graph=%s feature=%s vertical=%s status=%s cost_usd=%.6f tokens=%d calls=%d%s model=%s%s",
        graph,
        feature or feature_for_graph(graph),
        vert,
        "error" if error else "ok",
        cost,
        int(totals.get("total_tokens", 0) or 0),
        calls,
        f" retries={retries}" if retries else "",
        model or "unknown",
        f" error={str(error)[:_ERROR_MAX]}" if error else "",
    )

10. Why Evaluate First

The team’s philosophy is simple: never ship a change without first proving it passes the bar. Every new prompt or model update goes through a set of evaluation metrics. One metric checks that every factual claim in the output comes from the provided source. Another verifies the output actually answers the input. When a single claim cannot be grounded, the system raises a red flag but does not block the whole pipeline. Each component fails on its own error without stopping everything else. This design keeps the workflow moving even when a check fails. For high risk areas like extraction, the faithful check uses a stricter threshold. That means no hallucinated detail escapes. Catching a regression in a critical segment matters more than keeping the average score high. A small slip in a sensitive domain can lead to bigger problems. So the team prioritizes exactness where it counts. The metrics also include a precision test that measures if the top ranked pieces are truly relevant. The combination gives both safety and flexibility. Every change must clear these gates. If a check trips, the team knows exactly where to look. The system never hides a weakness behind a passing average. That is how they keep production robust without slowing down innovation.

<!-- mem:begin -->

Generate it: When a single claim cannot be grounded, the system raises a red f___ but does not block the whole pipeline. (cue: f___; answer: flag)

Generate it: For high risk areas like extraction, the faithful check uses a stricter t_________. (cue: t_________; answer: threshold)

Ask yourself: Why does the team value catching a regression in a critical segment over keeping the average score high?

Recall check (try before reading the answer):

  1. What does the team refuse to do before a change proves it passes the bar? Answer: Ship the change — never ship a change without first proving it passes the bar.

  2. When a claim cannot be grounded, what happens to the rest of the pipeline? Answer: The system raises a red flag but does not block the pipeline; each component fails on its own error.

  3. What does the precision test measure? Answer: Whether the top ranked pieces are truly relevant.

<!-- mem:end -->

Extraction metrics enforce strict faithfulness and contextual precision for hallucination-critical segments.

python
def make_extraction_metrics(judge: Any = None) -> list[Any]:
    from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric

    j = judge if judge is not None else make_judge()
    return [
        FaithfulnessMetric(threshold=THRESHOLD_STRICT, model=j, include_reason=True),
        ContextualPrecisionMetric(threshold=THRESHOLD_DEFAULT, model=j, include_reason=True),
    ]