A passing demo is not a benchmark. To know whether your agent got better or worse between two commits, you need a fixed suite of tasks, deterministic checkers, and trajectory-level metrics that survive the stochasticity of an LLM. This lesson is the runnable companion to agent evaluation: it walks through agents_lab/benchmark/, a small offline benchmark harness that scores an agent over a task suite, aggregates success rate / steps / tool-error rate, and gates regressions against a committed baseline so quality becomes a number CI can enforce. For the conceptual grounding behind every metric here, read agent evaluation and eval fundamentals first.
Mental Model
Measure the trajectory and gate regressions, don't trust vibes. A right answer after ten tool errors is not a good run, and a green demo on Tuesday tells you nothing about whether Wednesday's prompt tweak silently broke arithmetic. So the benchmark records the whole trajectory โ did it succeed, how many steps did it burn, how often did a tool error out โ then collapses a suite of those runs into three numbers and compares them to a number you committed last week. If success drops past a tolerance, the gate fails the build. That is the entire loop: a frozen suite turns "feels better" into success rate; a committed baseline turns success rate into a pass/fail signal.
Real agent benchmarks are built this way. GSM8K is multi-step grade-school math with a numeric answer; HotpotQA is multi-hop question answering where the answer is a span you can substring-match; AgentBench and the newer tool-use evals (TRAJECT-Bench, tau-bench) score whether the agent selected the right tool, with valid arguments, and used the result โ not just the final string. Our suites are deliberately small, offline JSON inspired by that exact shape: a prompt, a kind, and an expected value with a deterministic checker.
Why trajectory metrics beat final-answer-only
The naive benchmark scores one thing: did the final string contain the right answer? That number lies in two directions. An agent can stumble to 291 after the calculator tool erred four times and it brute-forced its way back โ final-answer-only calls that a win, but it is slow, expensive, and one prompt change away from never recovering. Conversely, an agent can have a flawless three-step trajectory and fail only because the checker is too strict. You cannot tell these apart from the answer alone.
So the harness scores three numbers per suite, reusing the trajectory metrics from agents_lab/eval_harness.py:
- success rate โ fraction of tasks where the checker passed. The headline, but never the whole story.
- avg steps โ mean number of agent steps to terminate. Rising steps at flat success means the agent is getting less efficient โ a regression that final-answer-only is blind to.
- tool-error rate โ tool errors divided by total steps. This is the "ten tool errors before a right answer" detector: it spikes when tool schemas drift or the agent misuses a tool, well before success rate visibly drops.
EvalReport.render() prints all three plus a per-task table, so a failing run tells you which task and how it failed, not just that the number moved. The ReAct trajectory is adapted into an EvalRun by run_from_react_state, which counts any message whose content starts with error: as a tool error โ the same trace the ReAct lab produces.
Task suites as bundled offline JSON
Benchmarks must be reproducible, which means no network at eval time. Each suite is a committed JSON file under agents_lab/benchmark/tasks/; load_builtin("arithmetic") reads it into TaskSpec objects. No Hugging Face download, no API for the dataset itself, so the suite is byte-identical on every machine and every CI run.
A task is a prompt plus a kind that selects a deterministic checker:
@dataclass
class TaskSpec:
id: str
prompt: str
kind: str # "numeric" | "contains" | "regex"
expected: object # str | list[str]
def check(self, answer: str) -> bool:
ans = (answer or "").strip()
if self.kind == "numeric":
# commas stripped so "1,025" still matches "1025"
return str(self.expected) in ans.replace(",", "")
if self.kind == "contains":
opts = self.expected if isinstance(self.expected, list) else [self.expected]
return any(str(o).lower() in ans.lower() for o in opts)
if self.kind == "regex":
return re.search(str(self.expected), ans) is not None
raise ValueError(f"unknown task kind: {self.kind}")
The three kinds map onto the three flavors of real agent benchmark:
- numeric โ GSM8K-style math. The expected number must appear in the answer (commas ignored), so
"The answer is 1,025."matches"1025". Tolerant of surrounding prose, strict on the value. - contains โ HotpotQA-style span answers. Any expected substring matches, case-insensitive, so
expected: ["Paris"]passes on"The capital of France is Paris.". A list lets you accept synonyms or aliases. - regex โ for structured outputs (a date, an ID, a JSON field) where you want a pattern, not a literal.
A bundled arithmetic suite (six numeric tasks like "What is 17 * 23 minus 100?" โ 291) and a tool_use suite (calculator and lookup tasks mixing numeric and contains) ship in the repo. Adding a task is one JSON line; adding a checker is one branch in check. Keep checkers deterministic โ an LLM-judge checker reintroduces the variance you built the suite to control (see the judge-bias discussion in agent evaluation).
The regression gate
Metrics on their own are a dashboard, not a guardrail. The gate is what makes them CI-enforceable: it compares this run's success rate against a baseline you committed and fails if it dropped beyond a tolerance.
def regression_gate(report, baseline, *, tolerance: float = 0.05):
"""True if success_rate is within `tolerance` of (or above) the baseline."""
base = float(baseline.get("success_rate", 0.0))
ok = report.success_rate >= base - tolerance
verb = "OK" if ok else "REGRESSION"
return ok, f"[{verb}] success {report.success_rate:.0%} vs baseline {base:.0%} (tol {tolerance:.0%})"
The baseline is a tiny JSON snapshot โ {"success_rate": 0.6667, "avg_steps": 2.0, "tool_error_rate": 0.15, "n": 6} โ written by save_baseline and committed under agents_lab/benchmark/baselines/. The tolerance (default 5 points) absorbs the stochasticity inherent to an LLM agent: you do not want a flaky one-task flip to red-bar an otherwise healthy PR, but a real 20-point drop blows past tolerance and fails. This is the single mechanism that turns agent quality into a number a pipeline can block on โ wire it into CI/CD for AI so every PR that touches the prompt, the tools, or the model runs the suite and the gate decides green or red.
Two operational notes. First, regenerate and re-commit the baseline deliberately, the same way you update a golden file or a snapshot test โ never let it auto-bump, or the gate silently ratchets down and protects nothing. Second, the gate keys on success rate, but a wise reviewer also eyeballs avg_steps and tool_error_rate in the printed report: success can hold flat while the agent quietly doubles its step count, and that is a regression worth catching before it becomes a cost or latency incident.
Run it
The whole loop in five lines, using the real API. react_runner() wires up the ReAct agent (it needs DEEPSEEK_API_KEY โ DeepSeek is the only paid API in this lab; the suites and checkers themselves are free and offline):
from agents_lab.benchmark.suite import load_builtin
from agents_lab.benchmark.run import run_suite, regression_gate, react_runner
# Run the ReAct agent over the bundled arithmetic suite.
report = run_suite(react_runner(), load_builtin("arithmetic"))
print(report.render())
# task pass steps tool_err
# arith_1 โ 2 0
# arith_2 โ 2 0
# arith_3 โ 1 0
# arith_4 โ 2 0
# arith_5 โ 3 1
# arith_6 โ 2 0
#
# success 83% avg_steps 2.0 tool_error_rate 8%
# Gate against a committed baseline (here inline; in CI, load the JSON).
ok, msg = regression_gate(report, {"success_rate": 0.66})
print(msg) # [OK] success 83% vs baseline 66% (tol 5%)
run_suite builds an EvalTask per TaskSpec (binding each task's check as the scorer) and hands them to evaluate from the shared eval harness, so the benchmark and the standalone harness compute metrics identically โ one definition of "success rate," not two.
You can swap in any runner. A Callable[[str], EvalRun] is the whole contract, so a plain LLM call, a different agent pattern, or a stub for offline tests all plug in unchanged:
from agents_lab.eval_harness import EvalRun
# A deterministic stub runner โ no API, useful for testing the harness itself.
def stub_runner(prompt: str) -> EvalRun:
answer = "291" if "17 * 23" in prompt else ""
return EvalRun(answer=answer, steps=1, tool_errors=0)
report = run_suite(stub_runner, load_builtin("arithmetic"))
print(report.success_rate) # 0.1667 โ only the one task it hard-codes
From the CLI
The module is its own entrypoint. Pointed at a suite, it runs the agent, prints the metrics table, and gates against the committed baseline for that suite:
# Runs the arithmetic suite, prints metrics, gates vs baselines/arithmetic.json.
# Exits 0 on OK, 1 on REGRESSION โ so CI fails the job on a real drop.
uv run python -m agents_lab.benchmark.run arithmetic
# Snapshot the current run as the new committed baseline.
uv run python -m agents_lab.benchmark.run arithmetic --save-baseline \
agents_lab/benchmark/baselines/arithmetic.json
If DEEPSEEK_API_KEY is unset the CLI prints a notice and exits cleanly rather than erroring โ the offline parts (suites, checkers, metric aggregation, the gate logic) are fully testable without the paid API; only the live ReAct runner needs it. Pass --baseline path to gate against a different snapshot, or omit it to use the default baselines/<suite>.json.
Where this fits
This benchmark is the enforcement layer for everything in agent evaluation: that lesson explains why trajectory metrics, failure taxonomies, and cost-aware reliability matter; this one gives you a small, real harness that computes a slice of them and blocks a regression. Start with the ReAct lab to build the agent the suite runs, lean on eval fundamentals for what a good checker and a good baseline look like, and route the gate through CI/CD for AI so no prompt or model change ships without first clearing the bar you committed.
The discipline scales up unchanged. Swap the six-task offline suite for a real GSM8K slice, a HotpotQA sample, or a tool-use eval; swap the substring checker for a functional-correctness check; tighten the tolerance as the agent stabilizes. The shape โ frozen suite, trajectory metrics, committed baseline, regression gate โ is the same one the production benchmarks use, just with bigger datasets and a budget for the runs.