The Benchmark Industrial Complex

In April 2025, Meta submitted a private, non-public variant of Llama 4 to Chatbot Arena — the leaderboard the AI industry treats as its neutral arbiter. The model was specifically tuned to charm human voters. A 68-page analysis of approximately 2 million comparison records showed that selective variant submissions could inflate scores by up to 100 points in simulations. Meta's response: they admitted they "cheated a little bit."¹

A little bit. Twenty-seven private model variants tested before the public launch. Only the best one submitted. The score that appeared on the leaderboard — the number that investors cite, journalists repeat, and regulators reference — was the output of a selection process designed to produce the highest possible number.

Meta was not alone. The same analysis found that OpenAI, Google, and Amazon all ran private tests and published only their winning results.¹ The leaderboard that was supposed to be the honest broker was the casino.

This is the system working as designed.

How Benchmarks Get Made

Every frontier model launch comes with a press release claiming "state of the art" on a curated selection of tests. The tests have names that sound authoritative: MMLU, GPQA, HumanEval, SWE-bench. The numbers go up. The press releases multiply. The press release never mentions who made the test.

HumanEval — 164 Python programming problems — was created by OpenAI, alongside the Codex model it was designed to evaluate.² The company being measured built the ruler. MMLU, the de facto standard for general knowledge, was originally academic, but MMLU-Pro was expanded by researchers including those affiliated with industry labs.² GPQA was written by graduate students trying to make questions "Google-proof" — but the pool of expert question-writers overlaps with the research community at frontier labs.²

No benchmark has transparent independent funding, a chain of custody comparable to a clinical drug trial, or standardized proctoring conditions. The entire evaluation infrastructure is built by the people it evaluates.

Imagine a pharmaceutical company designing its own FDA trial, choosing which patients to include, running the experiment in its own lab, and publishing only the results that show efficacy. That company would face criminal prosecution. In AI, it's called a model card.

The Contamination Treadmill

Even if the benchmarks were well-designed, the scores would be unreliable. The reason is contamination: training data leaking into test sets.

The numbers are not subtle. StarCoder-7b scored 4.9 times higher on leaked benchmark data than on clean data.³ Retrieval-based audits report over 45% overlap between training corpora and QA benchmark datasets.³ GPT-4 can infer masked MMLU answers in 57% of cases — suggesting deep memorization of test content.³

The contamination isn't limited to memorization. A new category emerged in 2025: search-time contamination. When AI agents are evaluated on tasks that involve web retrieval, the retrieval step itself surfaces sources containing the test question and its answer. The model doesn't need to have memorized anything. It reads the answer key during the exam.⁴

Mitigation attempts exist. AntiLeak-Bench constructs samples with explicitly new knowledge absent from training data.⁴ Inference-time decontamination tries to clean results during evaluation rather than filtering datasets. Neither is widely adopted. The industry that would need to adopt them is the industry that benefits from not adopting them.

The result is saturation. Stanford HAI's 2025 AI Index documented the speed:

GPQA scores surged 48.9 percentage points in a single year. SWE-bench jumped 67.3 points. MMMU rose 18.8 points.⁵ Benchmarks designed to measure frontier capabilities for years were maxed out in months.

Vanessa Parli of Stanford HAI asked the obvious question: "Are we measuring the right thing? Are those benchmarks compromised?"⁵

The industry's response to saturation is always the same: build a harder benchmark. But the harder benchmark follows the same lifecycle. Created by insiders. Gamed within months. Saturated. Replaced. The treadmill never produces accountability. It produces new metrics to market.

Models scoring 90%+ on everything makes differentiation impossible on existing benchmarks. The incentive shifts to new, untested benchmarks where margins are larger and contamination hasn't been measured yet. The measuring stick keeps moving.

"State of the art" is a marketing claim, not a scientific one.

The Gap Nobody Talks About

If benchmarks measured what companies claim they measure, high scores would predict real-world performance. They do not.

METR's randomized controlled trial — real developers, real repositories, real issues — found that experienced open-source developers were 19% slower with AI tools. Not faster. Slower. The developers expected to be 24% faster. After the trial, they believed they had been 20% faster. The perception gap is as alarming as the performance gap.⁶

On real open-source software tasks, top models succeed a fraction of the time — far below what benchmark scores would predict. They invent APIs that don't exist. They skip available tools. They loop endlessly. The METR study found experienced developers were 19% slower with these near-perfect-scoring tools.⁷

The inferential gap between "90% on MATH" and "useful in production" is enormous. Benchmarks measure something. That something is not what developers experience. That something is not what enterprises are buying. But that something is what the press release says, and the press release is what moves the valuation.

The Money

Global AI investment hit $202.3 billion in 2025 — half of all venture capital worldwide. Foundation model companies alone captured $80 billion. OpenAI and Anthropic combined absorbed 14% of all global venture investment across all sectors.⁸

OpenAI's valuation more than tripled in a single year — from $157 billion to $500 billion — and is now seeking $750–830 billion. Anthropic hit $380 billion after a $30 billion round in February 2026. Seed-stage AI startups command a 42% valuation premium over non-AI startups.⁹

Every investor presentation cites benchmark performance. Every funding round coincides with model announcements touting benchmark improvements. Even Tiger Global acknowledges that AI valuations are "at times unsupported by company fundamentals."⁹

The pipeline works like this: a two-point benchmark improvement appears in a press release within hours. The press release moves market perception. Market perception moves valuation. Valuation determines the next funding round. There is no independent verification at any point in this chain.

When billions in valuation ride on a two-point improvement on MMLU-Pro, and the test is unproctored, the data is contaminated, and the company chooses which scores to publish — the question isn't whether the numbers are gamed. It's who benefits from the game continuing.

Who Watches the Watchmen?

The auditing landscape is barren.

The Future of Life Institute published a 2025 AI Safety Index evaluating companies on 35 transparency and safety indicators. But FLI has no access to models, training data, or internal incident logs. They grade the press releases, not the product.¹⁰

METR runs real-world evaluations with actual developers on actual codebases. Their methodology is rigorous. They are also a small nonprofit in Berkeley that cannot audit at the scale the industry requires.⁶

Artificial Analysis conducts independent MMLU-Pro evaluations — one of the few truly independent evaluation organizations. They are the exception that proves the rule.²

The EU's Joint Research Centre reviewed approximately 110 studies over ten years and concluded that AI benchmarks are "fundamentally shaped by cultural, commercial and competitive dynamics that often prioritize state-of-the-art performance at the expense of broader societal concerns." Many benchmarks, the JRC found, are "inadequate and/or useless proxies for what they are meant to evaluate."¹¹

What's missing: consistent, public, forensic third-party auditing with unrestricted model access. No chain of custody for benchmark test data. No proctoring of evaluation conditions. No standardized reporting requirements. The NTIA has called for independent audits.¹² Harvard JOLT has published on AI auditing standards.¹² Nobody is conducting them.

The Regulatory Bet

This matters beyond marketing because regulators are starting to use these benchmarks.

The EU AI Act incorporates benchmark performance into risk classification for general-purpose AI models with systemic risks. The regulation expects benchmarks to be "of fundamental importance" for assessing high-impact capabilities.¹³

There is a problem. CEN/CENELEC — the European standards bodies tasked with developing harmonized evaluation standards — couldn't deliver by the August 2025 deadline. The Commission pushed timelines by up to 16 months via the Digital Omnibus in November 2025. The first harmonized standard to enter public enquiry was prEN 18286 — a quality management system, not an evaluation methodology.¹³

A 2025 paper titled "Can We Trust Benchmarks for EU AI Compliance?" asked the question directly. The answer was no.¹⁴

Regulators are building policy on an evaluation infrastructure that the academic community has declared broken. The standards to fix it haven't been written. The timeline to write them keeps slipping. And the models subject to regulation are advancing faster than the regulations can be drafted.

In the United States, there is no federal legislation requiring standardized AI evaluation. The NIST AI Risk Management Framework is referenced but not mandatory. The audit gap is not a European problem. It's a global one.

What Would Actually Work

The "Benchmarking is Broken" paper, published in October 2025, proposed PeerBench: community-governed evaluation with sealed execution, item banking with rolling renewal, and delayed transparency.¹⁵ The analogy is standardized human testing. The SAT, GRE, and bar exam evolved over decades to balance security, fairness, and credibility. AI evaluation has none of these properties.

PeerBench has not been implemented anywhere. The paper has been cited. The idea has been praised. Nobody has built it. Building it would require the cooperation of companies whose current competitive advantage depends on the system staying broken.

METR's approach — real developers, real tasks, randomized controlled trials — works but doesn't scale.⁶ EvalPlus expanded HumanEval's test cases by 80x, catching overfitting that the original benchmark missed.¹⁶ These are patches on a structural problem.

The structural fix is separation. The entities being evaluated cannot also design the evaluations, fund the evaluators, and choose which results to publish. This is not a radical proposal. It is how every other high-stakes evaluation system in the world operates. Financial auditors are independent of the companies they audit. Drug trials are proctored by the FDA, not the pharmaceutical company. Only in AI does the examinee write the exam, take the exam, grade the exam, and publish the score.

The claim is not the evidence.
The benchmark is not the product.
The score is not the capability.

The Complex

The benchmark industrial complex is not a conspiracy. It's an incentive structure. The companies that build models also fund benchmark research, employ benchmark creators, choose which benchmarks to report, and interpret the results in their own press releases. The entire pipeline from creation to marketing is vertically integrated.

The EU JRC called it what it is: a system shaped by "commercial and competitive dynamics" that prioritizes "state-of-the-art performance at the expense of broader societal concerns."¹¹ The "Benchmarking is Broken" authors called it a "Wild West" where "distinguishing genuine progress from exaggerated claims is exceptionally difficult."¹⁵

$202 billion in annual investment. $500 billion valuations. Regulatory frameworks under construction. All built on an evaluation system with no independent auditing, no proctoring, documented contamination, and a proven track record of gaming.

"55% faster." Faster at what? "State of the art." On whose test? "90% on MATH." And 26% on real tasks.

Show us the proctor, the chain of custody, and the independent audit.

The claim is not the evidence.

Disclosure

This article was researched and written with the assistance of Claude, an AI made by Anthropic. Anthropic is named in this piece as one of the companies whose valuations rest on benchmark performance. Their model's benchmark claims are subject to the same critique as every other company discussed here. The irony of an AI tool auditing AI evaluation is not lost on us — it is, in fact, the point. Every claim in this article is sourced. Verify them. Corrections welcome at nadia@sloppish.com.

Sources

LMArena / Chatbot Arena scandal. 68-page analysis of ~2 million comparison records. Simulation showed selective submissions could inflate scores up to 100 points. Meta's Llama 4 "cheated a little bit" admission. The Register | Collinear AI | Simon Willison.
Benchmark origins: HumanEval (OpenAI), MMLU/MMLU-Pro, GPQA, SWE-bench. o-mega | Analytics Vidhya.
Benchmark data contamination survey. StarCoder-7b 4.9x inflation, 45%+ QA overlap, GPT-4 57% masked-answer inference. arXiv: Benchmark Data Contamination of LLMs.
Search-time contamination and AntiLeak-Bench. arXiv: Search-Time Data Contamination | ACL 2025: AntiLeakBench.
Stanford HAI 2025 AI Index. GPQA +48.9 points, SWE-bench +67.3 points. Vanessa Parli quote. Stanford HAI | 10 Charts.
METR randomized controlled trial: experienced developers 19% slower with AI tools, believed 20% faster. METR | InfoQ.
METR research on real-world AI coding performance. METR study | METR research update.
Global AI investment: $202.3B in 2025, 50% of all VC. Foundation models: $80B. OpenAI + Anthropic: 14% of global VC. Crunchbase.
AI valuations: OpenAI $157B→$500B, Anthropic $380B (post-money), seed premium 42%. Tiger Global quote. Finro | Qubit Capital | CNBC.
Future of Life Institute 2025 AI Safety Index: 35 indicators, no model access. FLI | Factually.
"Can We Trust AI Benchmarks?" EU Joint Research Centre meta-review, ~110 studies. arXiv | EU JRC.
NTIA AI Accountability Policy Report; Harvard JOLT on AI auditing standards; Springer Nature on AI audit boards. Harvard JOLT | arXiv: Frontier AI Auditing.
EU AI Act standardization: CEN/CENELEC deadline missed, 16-month delay. EU Digital Strategy | AI Act Timeline | K&L Gates.
"Can We Trust Benchmarks for EU AI Compliance?" (Bench-2-CoP). arXiv.
"Benchmarking is Broken: Don't Let AI be its Own Judge." PeerBench proposal. arXiv.
EvalPlus: 80x expanded test cases for HumanEval, 35x for MBPP. o-mega | ngrok.