Offline evaluation tells you whether a system was good against a fixed dataset on the day you ran it. Production tells you whether it is good right now, against real traffic, real users, and a real model endpoint that can change underneath you. These are different questions. A prompt that scored 0.82 on your golden set can quietly degrade to 0.6 in production because users ask things your dataset never anticipated, a provider silently ships a new model snapshot, or a retrieval index drifts. Online evaluation is the discipline of measuring quality continuously on live traffic, detecting regressions before users churn, and feeding what you learn back into the offline suite. This article covers the architecture of a production eval pipeline, sampling and cost control, reference-free metrics, drift and regression detection, online experimentation, and how to close the loop โ grounded in the LangGraph backend this roadmap ships with.
A mature team runs offline evals in CI: a curated golden set, an LLM judge, a pass-rate gate (this is exactly what the /eval-fundamentals and /benchmark-design articles describe, and what the roadmap's own backend/tests/deepeval/ suite implements). That catches regressions you can anticipate. It cannot catch:
gpt-4o, claude-sonnet, and deepseek-chat are moving targets. Providers re-quantize, re-tune, and re-route without changing the model string. A prompt tuned against last month's behavior is an untested prompt today./adversarial-prompting) show up in production long before they show up in your test fixtures.The mental model: offline eval gates deploys; online eval gates trust. You need both. Online eval is what turns "we shipped it" into "we know it still works."
The defining constraint of online eval is that scoring must not be on the critical path of the user request. Users wait for answers, not for a judge model to deliberate. The standard architecture decouples serving from scoring:
User โ App โ LLM/Graph โ Response โโโโโโโโโโโโโโโโโโโโโโโโโโบ User
โ
โโโบ emit trace event (async, fire-and-forget)
โ
โผ
sampling decision
โ
(sampled subset)
โผ
async scoring workers โโโบ metrics store
โ โ
โผ โผ
feedback join dashboards / alerts
Every production inference emits a structured trace (the same trace the /observability article describes โ prompt, context, output, model version, latency, cost, request metadata). A sampler decides which traces get scored. Sampled traces go to an out-of-band worker pool that runs metrics โ heuristic checks inline, LLM-judge checks in a queue โ and writes results to a metrics store keyed by trace ID, model version, and prompt version. User feedback, which arrives seconds to hours later, is joined back on the trace ID.
A minimal trace schema sufficient for online eval:
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Any
@dataclass
class EvalTrace:
trace_id: str
ts: datetime
graph: str # "chat", "course_review", ...
prompt_version: str # git sha or content hash of the prompt
model: str # "deepseek-chat"
model_fingerprint: str | None # provider system_fingerprint if exposed
inputs: dict[str, Any] # user query, retrieved context refs
output: str
latency_ms: int
cost_usd: float
metadata: dict[str, Any] = field(default_factory=dict)
# filled in later, out of band:
scores: dict[str, float] = field(default_factory=dict)
feedback: dict[str, Any] = field(default_factory=dict)
The prompt_version and model_fingerprint fields are not optional bookkeeping โ they are the join keys that let you attribute a quality drop to a specific prompt edit or a silent provider change. Without them, online eval produces a number that trends but cannot be acted on.
Scoring every request with an LLM judge would roughly double your inference bill and add a second model's worth of failure surface. The /rag-evaluation article quotes a realistic figure of ~$30 per 1,000 LLM-judged items; at 100k requests/day that is $3,000/day of judging to shadow your serving traffic. You sample. The question is how, because uniform sampling wastes budget on the easy middle of the distribution.
Uniform random sampling is the baseline. Sample rate r; every request scored with probability r. Unbiased, trivially correct for population-level metrics, and the right default for the headline "is overall quality stable" question. Pick r so the daily judged volume fits the budget, and check it gives you enough samples per segment to detect the effect size you care about (see the power calculation below).
Stratified sampling allocates budget across segments โ language, feature, customer tier, query length bucket โ so a low-volume but high-stakes segment (enterprise, a new feature) isn't drowned out by the high-volume head. Score each stratum at a rate inversely related to its volume, then weight when aggregating to recover an unbiased population estimate.
Targeted / importance sampling spends budget where failures are likely. Score at an elevated rate when cheap signals predict trouble:
import random
def sample_rate_for(trace: EvalTrace) -> float:
"""Heuristic, cheap, runs inline. Returns P(score this trace)."""
base = 0.02 # 2% uniform floor
if trace.feedback.get("thumb") == "down":
return 1.0 # always score complaints
if trace.metadata.get("regenerated"):
return 0.5 # user asked again โ suspect
if trace.metadata.get("guardrail_flag"):
return 1.0
if trace.latency_ms > 20_000:
return 0.3 # timeouts correlate with junk
if trace.metadata.get("prompt_version_is_new"):
return 0.25 # watch fresh deploys closely
return base
def should_score(trace: EvalTrace) -> bool:
return random.random() < sample_rate_for(trace)
Targeted sampling introduces bias by construction โ you are over-representing suspected failures โ so keep the uniform stratum as a separate, unbiased population estimate and report targeted-stratum metrics separately. Mixing them produces a quality number that looks alarming for no reason. A useful framing: the uniform stratum answers "how are we doing," the targeted stratum answers "what is going wrong."
Sample size sanity check. To detect a quality drop of size d (in proportion terms) with the usual 80% power at 5% significance you need roughly n โ 16 ยท p(1โp) / dยฒ scored items per comparison window. To catch a 5-point drop from a 0.85 baseline (p=0.85, d=0.05) that's ~816 judged items per window. If your sample rate and traffic don't produce that many in the window you want to alert on, you cannot detect that regression no matter how good the judge is โ widen the window or raise the rate.
Offline evals lean on golden answers. Production has no ground truth โ nobody wrote the reference answer for a query that arrived 40ms ago. Online metrics must be reference-free. The workhorses:
Cheap deterministic checks (run inline, on 100% of traffic โ they're nearly free):
course_review graph emits a fixed score shape; a sudden rise in parse-fallback rate is a louder, faster signal than any judge./guardrails-filtering). Cheap classifiers, run on everything.LLM-as-judge checks (run on the sampled subset):
Reference-free judging asks the judge to assess properties intrinsic to the (input, output) pair rather than agreement with a reference. The /llm-as-judge article covers calibration and bias in depth; the production-specific points:
The roadmap already has the judge infrastructure for this. backend/tests/deepeval/conftest.py wraps make_llm() in a DeepEvalBaseLLM so the judge honors the same LLM_BASE_URL / DEEPSEEK_API_KEY as the graphs. The online scorer is the same judge, invoked from a worker instead of a pytest fixture:
# online_scorer.py โ runs in the async worker pool, NOT in the request path
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from knowledge_agent.llm import make_llm
_judge = make_llm(temperature=0.0) # reuse the graphs' LLM factory
async def score_trace(trace: EvalTrace) -> dict[str, float]:
tc = LLMTestCase(
input=trace.inputs["query"],
actual_output=trace.output,
retrieval_context=trace.inputs.get("context_chunks", []),
)
scores: dict[str, float] = {}
for metric in (FaithfulnessMetric(model=_judge),
AnswerRelevancyMetric(model=_judge)):
try:
metric.measure(tc)
scores[metric.__class__.__name__] = metric.score
except Exception as e: # judge JSON flake
scores[metric.__class__.__name__] = float("nan")
log.warning("judge failed trace=%s: %r", trace.trace_id, e)
return scores
This is the key architectural insight for this codebase: online eval is not a new system, it is the offline harness invoked from a sampler instead of CI. Same judge, same metrics, same run_metric-style defensive error handling โ different trigger and a metrics store instead of an assertion.
Tiered judging for cost. Don't use your most expensive judge on every sampled item. Run a cheap small-model judge on the full sample; escalate only the bottom-quartile and disagreement cases to a stronger judge. This is the same tiering logic the /cost-optimization article applies to serving, applied to evaluation. Two-tier judging typically cuts judge cost 60โ80% while preserving detection of real regressions, because the cheap judge is reliable at separating "clearly fine" from "needs a closer look."
Production has something offline never will: real users reacting to real outputs. Feedback is the closest thing to ground truth you get at scale, but it is sparse, biased, and noisy. Treat it as a signal to be calibrated, not a label to be trusted blindly.
Explicit feedback โ thumbs, star ratings, "report" buttons. High precision, terrible recall: well under 1% of users click anything, and those who do skew negative and non-representative. Use explicit negatives as a high-priority sampling trigger (always score a thumbs-down) and as seeds for failure analysis, not as an unbiased quality estimate. A rising thumbs-down rate is meaningful; the absolute rate is not.
Implicit feedback โ far denser and often more honest:
article_generate and memorize_generate graphs have a natural implicit signal โ was the generated lesson/card kept and reviewed, or discarded?Join feedback to traces on trace_id and treat the labeled subset as a calibration set for your automated judge: periodically check that judge scores actually predict user-observed outcomes. If "high faithfulness" outputs get regenerated as often as low ones, your judge is measuring the wrong thing โ calibrate it (see /llm-as-judge) or replace the metric. An online judge that has never been checked against a behavioral signal is decoration.
A raw quality time series is not an alert. Production quality is noisy: it wanders with traffic mix, time of day, and judge variance. The job is separating signal (a real regression) from noise (normal wobble). Three layers, cheapest first.
1. Versioned change-point detection. The highest-value, lowest-effort check. Because every trace carries prompt_version and model_fingerprint, you can compute quality per version and compare deploys directly instead of staring at a continuous line:
from statistics import mean
from scipy import stats # available in the backend toolchain
def regression_check(baseline: list[float], candidate: list[float],
min_n: int = 200, max_drop: float = 0.03):
"""Compare a new prompt/model version's scored window to the prior
version's. Welch's t-test: unequal variance, unequal n."""
if len(candidate) < min_n:
return {"status": "insufficient_data", "n": len(candidate)}
drop = mean(baseline) - mean(candidate)
t, p = stats.ttest_ind(baseline, candidate, equal_var=False)
regressed = drop > max_drop and p < 0.05
return {
"status": "regression" if regressed else "ok",
"baseline_mean": round(mean(baseline), 4),
"candidate_mean": round(mean(candidate), 4),
"drop": round(drop, 4),
"p_value": round(p, 4),
}
A silent provider model swap shows up here as a regression with no corresponding prompt_version change โ that signature ("quality fell, we didn't deploy anything") is precisely the failure offline CI structurally cannot catch, and the single strongest argument for running online eval at all.
2. Sequential monitoring (CUSUM). For continuous monitoring without a deploy boundary, a CUSUM chart accumulates small deviations and fires when the running sum crosses a threshold โ it catches slow degradation that a single-window t-test misses and reacts far faster than waiting for a window mean to visibly move:
def cusum(scores: list[float], target: float, slack: float = 0.005,
threshold: float = 0.05) -> int | None:
"""Return the index where a sustained downward shift is detected."""
s_lo = 0.0
for i, x in enumerate(scores):
s_lo = min(0.0, s_lo + (x - target) + slack)
if s_lo < -threshold:
return i
return None
3. Distribution drift on inputs. Quality can be stable while inputs shift โ the prelude to a future quality drop. Embed incoming queries (the roadmap already has an embedding path), and track population stability (PSI) or the distance between this week's query-embedding centroid and a reference window. Rising input drift with flat quality is a leading indicator: your golden set is going stale and the next model change will hurt more than the metrics currently show.
Alerting discipline. Alert on sustained, statistically significant drops attributable to a version or a sustained shift โ never on a single bad window. Route every quality alert with its diagnostic payload (version diff, sample traces, segment breakdown) the way /observability describes for operational alerts. A quality alert with no example traces attached is unactionable and will be ignored within a week.
Offline eval predicts whether a change is better. Online experimentation proves it on real traffic. The progression, in increasing order of exposure:
Shadow evaluation. Run the candidate prompt/model on a copy of live traffic without showing results to users. Score both old and new with the online judge and compare. Zero user risk; the cleanest way to validate a model migration (e.g., a Claude or DeepSeek version bump โ see /eval-frameworks-comparison) before anyone sees it. Cost is the catch: you pay double inference for shadowed traffic, so shadow a sample, not everything.
Canary. Route a small fraction (1โ5%) of real traffic to the candidate. Watch online quality and guardrail metrics on the canary slice. Auto-rollback on a significant drop. This is the production-deploy analogue of the CI gate in /ci-cd-ai โ same gate logic, live traffic instead of a fixture set.
A/B testing with guardrail metrics. A controlled split with enough traffic for statistical power. The trap unique to LLM A/Bs: a change that improves your headline judge score while quietly degrading latency, cost, or refusal rate. Always pair the primary quality metric with guardrail metrics (p95 latency, cost/request, refusal rate, guardrail-hit rate) and require the candidate not to regress any of them. Optimizing a single eval number is how you ship a model that judges love and users leave.
Interleaving. For ranking-style outputs (the roadmap's fetch_courses graph ranks courses by relevance), interleave results from both variants in one response and attribute clicks. Interleaving needs roughly an order of magnitude less traffic than a split A/B to reach significance, because each user sees both variants and within-user comparison removes between-user variance.
Sequential testing. Don't peek at a fixed-horizon A/B test and stop when it crosses significance โ repeated peeking inflates the false-positive rate dramatically. Use an always-valid sequential test (mSPRT or a group-sequential design) if you want to monitor continuously and stop early. This is the single most common statistical error in LLM experimentation; the /eval-fundamentals article covers the underlying multiple-comparisons problem.
Online eval that only produces dashboards is wasted spend. The point is to feed production reality back into the parts of the system that improve:
backend/tests/deepeval/golden/<graph>.json. This is how a static golden set stays representative โ it grows from the long tail production discovers, not from what you imagined. The /deepeval-synthesizer article covers generating variations around each confirmed seed so one production failure becomes a small robust test cluster instead of a single brittle case./dataset-curation).The loop, stated as an invariant: a failure observed in production should make it impossible for the same failure to ship silently again. If a production regression can recur without a test going red, the loop isn't closed yet.
Pulling the pieces together for the roadmap's LangGraph backend, the smallest viable online-eval system is four additions, each reusing infrastructure that already exists:
/runs/wait handler in backend/app.py, after graph.ainvoke(...) returns, emit an EvalTrace to a queue (a Postgres table on the existing Neon instance, or a Cloudflare Queue โ the checkpoint store is already there). Fire-and-forget; never block the response. The applications/[id]/prep route already uses a fire-and-forget pattern to copy.sample_rate_for heuristic above, computed inline from data already in the response and request metadata. Negative feedback and structured-output parse failures get rate 1.0.make_llm() and the exact deepeval metrics from backend/tests/deepeval/, writing scores to a eval_scores table keyed by trace_id, graph, prompt_version, model. Reuse the suite's run_metric retry/nan-on-flake handling verbatim โ production judge flake is the same DeepSeek JSON-parse flake the offline suite already hardened against.regression_check per (graph, prompt_version) window and the CUSUM monitor, alerting through the existing observability path. Start it read-only (alert only) for two weeks to learn the noise floor before letting it gate or roll back anything.Prompt versioning falls out for free: hash each graph's prompt module (course_review_prompts.py, the article_generate prompt constants) at build time and stamp the hash into every trace. A prompt edit now produces a new prompt_version, regression_check compares the new version's window against the old automatically, and a quality drop with an unchanged hash is, by elimination, an upstream model or data change โ the one class of failure the offline suite in /ci-cd-ai cannot see.
make_llm() judge, same deepeval metrics, same defensive error handling โ invoked from a sampler and a scheduled watcher instead of pytest, writing to a metrics store instead of asserting.This article covers evaluation in production. The following companion articles go deeper on adjacent topics: