The benchmarks we use to evaluate language models shape research direction, influence deployment decisions, and define what "good" means for AI systems. Yet benchmark design is fraught with subtle pitfalls: training data contamination can inflate scores silently, popular benchmarks saturate and lose discriminative power, and off-the-shelf evaluations rarely capture domain-specific requirements. This article examines the science and craft of benchmark design, from detecting contamination to building evaluations that remain meaningful over time.
Benchmark contamination occurs when test examples from an evaluation dataset appear in a model's training data. Since modern LLMs are trained on massive web crawls (Common Crawl, refined subsets like C4 and The Pile), and since many benchmarks are publicly available on the internet, contamination is not a theoretical concern but a practical reality.
The consequences are significant. A contaminated model may appear to "reason" through a problem when it is actually recalling a memorized answer. This inflates benchmark scores and can lead to incorrect conclusions about model capabilities. OpenAI's GPT-4 technical report (2023) dedicated an entire section to contamination analysis, finding that some benchmarks showed measurable contamination effects.
Several approaches have been developed to detect and quantify contamination:
N-gram overlap analysis is the simplest approach. For each test example, check whether long n-grams (typically 8-13 grams) appear in the training corpus. The GPT-3 paper (Brown et al., 2020) used 13-gram overlap to flag potentially contaminated examples and reported results both with and without them.
def detect_contamination_ngram(test_example: str, training_docs: list[str],
n: int = 13) -> bool:
test_ngrams = set(ngrams(test_example.split(), n))
for doc in training_docs:
doc_ngrams = set(ngrams(doc.split(), n))
if test_ngrams & doc_ngrams:
return True
return False
Membership inference takes a model-centric approach. The intuition is that a model will assign higher likelihood to memorized text. Shi et al. (2023) proposed detecting contamination by measuring whether the model's perplexity on benchmark examples is suspiciously low compared to rephrased versions of the same questions.
Canary string insertion is a proactive technique. When creating a benchmark, embed unique strings that should never appear naturally. If a model can complete these strings, it has seen the benchmark data. This is more useful for new benchmarks than retroactive analysis.
Rephrasing-based detection compares model performance on original benchmark questions versus semantically equivalent rephrasings. A large performance gap suggests memorization rather than genuine capability. Golchin and Suragan (2023) formalized this in their work on benchmark contamination in LLMs.
Contamination mitigation operates at multiple levels:
A benchmark saturates when the best models achieve scores so high that the remaining performance gap is within noise. At this point, the benchmark no longer discriminates between models and stops driving meaningful research progress.
The pattern is well-documented. MNIST reached near-perfect accuracy in the early 2010s. ImageNet top-5 accuracy exceeded human performance by 2015. In NLP, benchmarks from the GLUE suite (Wang et al., 2018) saturated quickly, prompting the creation of SuperGLUE (Wang et al., 2019), which itself was substantially solved within a year.
For LLM benchmarks, HellaSwag has seen frontier models exceed 95%, reducing its utility for distinguishing between top-tier systems. Certain MMLU subjects are similarly approaching ceiling. The ARC Easy set is effectively solved.
Watch for these signs that a benchmark is losing discriminative power:
The research community has developed several strategies:
Harder versions of existing benchmarks: MMLU-Pro (Wang et al., 2024) adds more answer choices and harder questions. ARC-Challenge selects the hardest subset of ARC questions.
New capability dimensions: Rather than making existing tasks harder, test entirely new capabilities. GPQA (Rein et al., 2023) introduced graduate-level science questions where even domain experts need time to answer correctly.
Process evaluation: Instead of just checking the final answer, evaluate the reasoning process. This is harder to saturate because the space of valid reasoning paths is much larger.
A model that scores well on MMLU may fail catastrophically in your specific domain. Generic benchmarks test broad capabilities but cannot capture the nuances of specialized applications: legal reasoning under specific jurisdictions, medical diagnosis given particular patient populations, or code generation in a specific framework.
Domain-specific evaluation requires understanding what success looks like in your context. This means working with domain experts to define the task, construct examples, and validate the evaluation methodology.
Step 1: Define the task taxonomy. Break down the target capability into specific, testable sub-tasks. For a legal AI system, this might include: statute interpretation, case law application, procedural compliance, and argument construction.
Step 2: Construct diverse test cases. Each sub-task needs examples spanning the difficulty range and covering edge cases. Aim for at least 100 examples per sub-task for statistical reliability.
# Example: Domain-specific eval schema
eval_schema = {
"task": "medical_diagnosis_differential",
"subtasks": [
{
"name": "symptom_to_differential",
"description": "Given patient symptoms, generate differential diagnosis",
"n_examples": 200,
"difficulty_distribution": {"easy": 0.3, "medium": 0.5, "hard": 0.2},
"metrics": ["recall_at_5", "ndcg", "critical_miss_rate"],
"expert_validated": True
},
{
"name": "test_ordering",
"description": "Given differential, recommend diagnostic tests",
"n_examples": 150,
"metrics": ["precision_at_3", "cost_efficiency", "urgency_accuracy"],
"expert_validated": True
}
]
}
Step 3: Define metrics that match your success criteria. Generic metrics like accuracy may not capture what matters. For a medical system, missing a critical diagnosis is far worse than including an unlikely one. Design metrics that reflect these asymmetries.
Step 4: Establish baselines. Measure human expert performance on the same task to calibrate expectations. Also evaluate simpler systems (keyword matching, traditional ML) to ensure LLM-based approaches add genuine value.
Step 5: Validate the evaluation itself. Run the evaluation on models with known capability differences and verify that your evaluation ranks them correctly. This is a meta-evaluation that builds confidence in the benchmark.
Construct validity: Does your benchmark actually measure what you think it measures? A benchmark testing "reasoning" that can be solved by pattern matching is not testing reasoning.
Demographic and distribution bias: Ensure test cases represent the population the system will serve. A medical eval trained on one demographic may not generalize.
Answer leakage through format: Multiple-choice formats can introduce artifacts that allow models to score above chance without understanding the question. Adversarial filtering (as used in HellaSwag) helps but does not eliminate this.
Temporal validity: Domain knowledge evolves. Medical guidelines change, laws are updated, best practices shift. Build in processes to update the benchmark periodically.
LiveBench (White et al., 2024) addresses contamination and saturation simultaneously by generating evaluation questions from recent data. Questions are derived from newly released papers, recent news events, and fresh datasets, ensuring that no model could have seen the answers during training.
The approach has clear advantages: contamination is structurally impossible for questions based on post-training information, and the benchmark automatically refreshes. However, it introduces new challenges: question quality may vary across releases, and temporal anchoring means models trained at different times face different evaluations, complicating longitudinal comparisons.
Chatbot Arena (Zheng et al., 2023) takes a radically different approach to evaluation: human preference in head-to-head comparisons. Users submit prompts and rate responses from two anonymous models. The resulting pairwise preferences are aggregated into Elo ratings.
The Elo system, borrowed from chess, has attractive properties:
def expected_score(rating_a: float, rating_b: float) -> float:
return 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
def update_elo(rating: float, actual: float, expected: float,
k: float = 32) -> float:
return rating + k * (actual - expected)
Chatbot Arena's key insight is that human preference is ultimately what matters for assistant-style models, and pairwise comparison is more reliable than absolute scoring. The platform has collected over 500,000 votes as of early 2025, making it one of the largest-scale human evaluations in AI.
Limitations include:
Organizations increasingly maintain private evaluation suites that are never published. This prevents contamination by construction. The trade-off is reproducibility: external researchers cannot verify claims made on private benchmarks.
A middle ground is the held-out evaluation approach: publish the benchmark format and a development set, but keep the test set private and offer evaluation as a service. This is the model used by several Kaggle-style competitions and platforms like HELM.
Generative rather than extractive: Instead of sourcing questions from existing text, generate novel questions that test the same capabilities. This reduces the chance of verbatim overlap with training data.
Compositional complexity: Create questions that require combining multiple pieces of knowledge in novel ways. Even if individual facts are contaminated, the specific combination may not be.
Parameterized templates: Use templates with variable slots that can generate many instances. This creates a large space of possible questions, reducing the value of memorizing any particular one.
import random
def generate_math_problem(difficulty: str) -> dict:
"""Generate a novel math word problem from parameterized template."""
names = ["Alice", "Bob", "Carlos", "Diana", "Elena"]
items = ["books", "apples", "tickets", "paintings", "coins"]
name = random.choice(names)
item = random.choice(items)
n1 = random.randint(5, 50)
n2 = random.randint(2, n1)
if difficulty == "easy":
question = f"{name} has {n1} {item} and gives away {n2}. How many remain?"
answer = n1 - n2
elif difficulty == "medium":
n3 = random.randint(1, 10)
question = (f"{name} has {n1} {item}. Each day for {n3} days, "
f"{name} gives away {n2} and receives {n2 + 1}. "
f"How many {item} does {name} have at the end?")
answer = n1 + n3
return {"question": question, "answer": answer, "difficulty": difficulty}
Regular refresh cycles: Schedule periodic benchmark updates. Publish version numbers and deprecate old versions.
Multi-modal grounding: Where possible, ground questions in images, audio, or other modalities that are harder to contaminate through text-based web crawling.
Even with careful design, contamination can occur after release. Monitor for:
When comparing models across benchmarks, simple averaging can be misleading. More sophisticated approaches include:
Elo or Bradley-Terry models: Treat benchmark performance as pairwise comparisons and compute global rankings. This handles benchmarks of different difficulties and scales.
Item Response Theory (IRT): Borrow from psychometrics to model both item difficulty and model ability simultaneously. Polo et al. (2024) applied IRT to LLM evaluation, finding it provides more robust rankings than simple accuracy averaging.
Multi-dimensional scaling: Rather than collapsing performance to a single number, visualize models in a capability space that preserves the structure of their performance profiles.
Goodhart's Law -- "when a measure becomes a target, it ceases to be a good measure" -- applies to AI evaluation with uncomfortable precision. As benchmarks determine which models receive attention, funding, and deployment, the incentive to optimize for benchmark scores rather than genuine capability becomes overwhelming.
The dynamic is straightforward. A benchmark is introduced to measure a real capability. Researchers and labs use it as a progress metric. Over time, training procedures, data mixtures, and even architectural choices are tuned specifically to improve scores on that benchmark. The score goes up, but the underlying capability the benchmark was designed to measure may not improve proportionally. The benchmark has been gamed, not through fraud, but through the ordinary incentive structure of competitive research.
MMLU and the data mixture effect. The Massive Multitask Language Understanding benchmark (Hendrycks et al., 2021) quickly became the standard measure of broad knowledge. As a result, training data mixtures were adjusted to increase coverage of the topics MMLU tests. Models that looked impressive on MMLU sometimes showed no corresponding improvement on novel questions in the same domains. The benchmark score improved because the models were trained on more relevant data, not because they achieved deeper understanding. This motivated the creation of MMLU-Pro, but the same optimization cycle will likely recur.
HumanEval and code generation. OpenAI's HumanEval benchmark for code generation contains 164 hand-crafted programming problems. Its popularity made it a de facto standard, and training pipelines were adjusted accordingly. Some open-source models that achieved high HumanEval scores showed much weaker performance on alternative code benchmarks like SWE-bench or real-world programming tasks. The narrow scope of HumanEval -- self-contained function generation with clear test cases -- became a liability when models were optimized specifically for that pattern. This illustrates why evaluation must always be multi-dimensional, as discussed in Article 31: LLM Evaluation Fundamentals.
Chatbot Arena and style optimization. Even Chatbot Arena's Elo ratings are not immune. As labs began tracking their Arena rankings closely, some tuned their models for the stylistic preferences of Arena users: longer responses, more structured formatting, and confident tone. This is a subtler form of gaming because the optimization target is real human preference, but the preference signal of Arena's self-selected user base may diverge from the needs of production deployments. The style-over-substance bias that Chatbot Arena's creators acknowledged as a limitation has, in some cases, become an optimization target.
The benchmark gaming cycle follows a predictable arc:
This cycle is not inherently harmful. Each iteration tends to produce benchmarks that are harder, more comprehensive, and more resistant to gaming. The problem arises when stakeholders -- investors, enterprise buyers, media -- treat benchmark numbers as direct measures of capability without understanding the optimization dynamics behind them. The gap between benchmark performance and real-world utility is a recurring theme in deployment failures, and bridging it requires the kind of domain-specific evaluation covered earlier in this article and the dataset curation practices discussed in Article 22: Dataset Curation.
Standard benchmarks like MMLU and HellaSwag primarily test knowledge retrieval and linguistic pattern completion. They do not reliably distinguish between a model that reasons through a problem and one that recognizes a familiar pattern and produces the associated answer. As models become more capable, this distinction matters more. A model that "reasons" by pattern matching will fail on novel problems that require genuine multi-step inference.
A new generation of benchmarks has emerged specifically to test reasoning capabilities that resist pattern-based shortcuts.
FrontierMath (Glazer et al., 2024) consists of original mathematics problems created by professional mathematicians. The problems are designed to be novel -- they do not appear in any textbook, competition archive, or online resource. Each problem requires multiple steps of genuine mathematical reasoning, and the answers are verifiable (typically numerical or symbolic, eliminating subjective grading).
The key design choice is that FrontierMath problems are not drawn from existing mathematical literature, which means contamination through training data is structurally impossible for the initial release. Early results were striking: frontier models that scored well on GSM8K and MATH benchmarks solved fewer than 2% of FrontierMath problems. This gap between performance on established math benchmarks and performance on genuinely novel problems reveals how much of what we measure as "mathematical reasoning" may actually be sophisticated retrieval from training data.
The Abstraction and Reasoning Corpus (ARC), created by Chollet (2019), takes a different approach. ARC presents visual grid-transformation tasks where the model must infer an abstract rule from a few input-output examples and apply it to a new input. The tasks are designed to test fluid intelligence -- the ability to adapt to novel situations -- rather than crystallized knowledge.
ARC-AGI (the benchmark built around ARC tasks) has proven remarkably resistant to scaling. Larger models do not dramatically outperform smaller ones, suggesting that the tasks test something fundamentally different from what standard language modeling optimizes for. The ARC-AGI-2 competition in 2025 continued to show that even frontier models struggle with tasks that most humans find straightforward, though the gap has narrowed with reasoning-focused architectures. This connects to the broader question of what evaluation signals actually tell us about model internals, a topic explored in Article 33: LLM-as-Judge.
GPQA (Rein et al., 2023) targets a specific failure mode: the ability of models to answer questions that require deep domain expertise rather than surface-level knowledge retrieval. Questions are drawn from physics, chemistry, and biology at the graduate level, and they are designed to be "Google-proof" -- answering them requires genuine understanding, not information lookup.
The benchmark's validation process is noteworthy. Domain experts validated each question, and non-expert validators (PhD-level researchers in adjacent fields) were given extensive time and internet access to attempt the questions. The gap between expert and non-expert performance confirms that the questions test deep understanding rather than searchable knowledge. For models, GPQA performance correlates weakly with performance on easier knowledge benchmarks, reinforcing the point that knowledge retrieval and reasoning are distinct capabilities.
Taken together, FrontierMath, ARC-AGI, and GPQA paint a consistent picture: current models are substantially better at tasks that resemble their training distribution than at tasks requiring genuinely novel reasoning. This does not mean models lack all reasoning capability, but it means that standard benchmarks significantly overestimate the generality of that capability. For teams building applications that depend on reliable reasoning -- mathematical proof verification, scientific hypothesis generation, complex planning -- these harder benchmarks provide a more honest assessment of what current models can and cannot do.
A persistent and underappreciated problem in LLM evaluation is that different evaluation frameworks frequently produce different scores for the same model on the same benchmark. A model might score 78% on MMLU when evaluated with one harness and 83% with another. These differences are not small, and they can reverse the ordering of models on a leaderboard.
Understanding why this happens is essential for anyone who relies on benchmark results to make decisions.
Prompt formatting. The most common source of discrepancy is prompt construction. MMLU questions can be presented in dozens of ways: with or without a system prompt, with different numbers of few-shot examples, with varying formatting of the answer choices, with or without chain-of-thought instructions. Each variation can shift scores by several percentage points. The EleutherAI Language Model Evaluation Harness, the HELM framework, and OpenAI's internal evals all use different prompt templates, and none is objectively "correct."
# Same MMLU question, two prompt formats, different results
prompt_v1 = """The following is a multiple choice question about physics.
Question: What is the SI unit of electric current?
A. Volt
B. Ampere
C. Ohm
D. Watt
Answer:"""
prompt_v2 = """Answer the following physics question by selecting A, B, C, or D.
Q: What is the SI unit of electric current?
(A) Volt (B) Ampere (C) Ohm (D) Watt
The answer is"""
These two prompts test the same knowledge, but the model's log-probabilities over answer tokens will differ. Multiply this across thousands of questions and the aggregate effect is substantial.
Sampling parameters. Temperature, top-p, top-k, and the choice between greedy decoding and sampling all affect results. A model evaluated with greedy decoding will produce different (often higher) scores than the same model evaluated with temperature 0.7. Some frameworks default to greedy; others use sampling with specific parameters. This is particularly impactful for reasoning tasks where chain-of-thought quality varies significantly between samples.
Answer extraction. After a model generates a response, the framework must extract the model's "answer" to compare against ground truth. Some frameworks look for the letter label (A, B, C, D) in the last token; others parse the full response for the answer string; still others use log-probabilities over answer tokens rather than generated text. These extraction strategies handle edge cases differently -- a model that says "The answer is B, but I initially considered A" might be scored differently depending on which extraction method is used.
Tokenization and context handling. Different frameworks may truncate context differently, handle special tokens differently, or process multi-turn examples differently. For benchmarks with long contexts, these differences compound.
The reproducibility problem has real consequences. When a model's benchmark scores vary by 5+ points depending on evaluation framework, claims like "Model X outperforms Model Y on MMLU" become framework-dependent assertions rather than objective facts. This is especially problematic for enterprise buyers comparing models and for researchers trying to measure incremental progress.
The field has made some progress through standardization efforts. The Open LLM Leaderboard uses a consistent evaluation harness, and papers increasingly report the exact evaluation framework and parameters used. But standardization is far from complete, and new models frequently introduce evaluation complications (different chat templates, different special tokens, different context windows) that existing frameworks must accommodate.
For practitioners, the takeaway is clear: never trust a single benchmark number in isolation. When comparing models, ensure evaluations use identical frameworks, prompts, and sampling parameters. When this is not possible, focus on large performance gaps that are unlikely to be artifacts, and validate with your own domain-specific evaluations. Evaluation methodology is as much a part of the result as the model itself. For multi-modal models, these reproducibility concerns are compounded by additional variables in image preprocessing and visual grounding, as explored in Article 49: Vision-Language Models.
The quality of our evaluations sets the upper bound on the quality of our understanding of model capabilities. Investing in better benchmark design pays dividends across the entire AI development lifecycle.