01. What Evaluation Is
To evaluate a fleet of language model graphs, you need more than a single accuracy number. Each graph is a multi-stage pipeline with independent failure points. A single score hides where things break.
Take a generation graph. It needs two metrics. One checks that every factual claim in the output is grounded in the provided context. Another verifies the output actually answers the input prompt. A high accuracy could still hide a hallucinated claim that sounds relevant but is not true.
Now consider an extraction graph. It also requires two metrics. A stricter faithfulness check ensures no hallucinated extractions appear. A contextual precision metric measures whether the top ranked retrieved nodes are actually relevant to the query. A single accuracy number would never reveal a mismatch here.
Each stage has its own failure mode. Faithfulness failures, relevance failures, precision failures. You cannot diagnose them with one number. By splitting evaluation into separate metrics, you pinpoint exactly where the pipeline goes wrong. That is the trade-off: more work setting up metrics, but much clearer signal. Without that, you are flying blind.
<!-- mem:begin -->Generate it: A generation graph needs two metrics: one checks every claim is grounded in context, the other checks the output answers the f______ prompt. (cue: f______; answer: input)
Generate it: An extraction graph uses a stricter faithfulness check plus a contextual p_________ metric for whether top ranked nodes are relevant. (cue: p_________; answer: precision)
Ask yourself: Why does a single accuracy number leave you flying blind across a multi-stage pipeline?
<!-- mem:end -->Recall check (try before reading the answer):
What are the two metrics a generation graph needs? Answer: One checks every factual claim is grounded in the provided context; the other verifies the output answers the input prompt.
Why can a high accuracy score still be misleading? Answer: A high accuracy could still hide a hallucinated claim that sounds relevant but is not true.
What do you gain by splitting evaluation into separate metrics, and what does it cost? Answer: You pinpoint exactly where the pipeline goes wrong; the trade-off is more work setting up metrics.
Evaluation uses separate metrics for generation and extraction graphs to pinpoint failure modes.
def make_generation_metrics(judge=None, *, with_faithfulness=True, strict=False):
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
j = judge if judge is not None else make_judge()
faith_threshold = THRESHOLD_STRICT if strict else THRESHOLD_DEFAULT
metrics = []
if with_faithfulness:
metrics.append(FaithfulnessMetric(threshold=faith_threshold, model=j, include_reason=True))
metrics.append(AnswerRelevancyMetric(threshold=THRESHOLD_DEFAULT, model=j, include_reason=True))
return metrics
def make_extraction_metrics(judge=None):
from deepeval.metrics import ContextualPrecisionMetric, FaithfulnessMetric
j = judge if judge is not None else make_judge()
return [
FaithfulnessMetric(threshold=THRESHOLD_STRICT, model=j, include_reason=True),
ContextualPrecisionMetric(threshold=THRESHOLD_DEFAULT, model=j, include_reason=True),
]