Benchmark crisis is coming for all of us

April 10, 20263 min readBy Ashish Kumar

MLAgentsResearch

We're in the middle of a quiet crisis in machine learning evaluation. Static benchmarks are breaking down. Not all at once - slowly, then suddenly.

The old world worked well enough

For a long time, the recipe was simple:

Collect a dataset
Hold out a test split
Report top-1 accuracy / F1 / BLEU
Publish

This worked when models were specialists. ImageNet for vision. GLUE for language. ARC for reasoning. Each benchmark had a clean scope, a clear leaderboard, and meaningful signal.

The era of narrow specialists is over.

Why agents break everything

A general-purpose agent isn't just solving a task - it's exploring a solution space. And it can do things a human benchmark designer never anticipated:

# A benchmark might test: "Can you solve this math problem?"
# An agent might do:
agent.search_web("solution to problem 42 from AMC 2024")
agent.read_math_forum_thread()
agent.extract_answer()
# Result: 100% accuracy, zero understanding

This isn't hypothetical. It's already happening.

The standard counter is to make benchmarks harder. But this is a treadmill: harder benchmarks get saturated faster now, not slower, because frontier models improve rapidly and the community is enormous.

Three failure modes

1. Contamination

Training data for large models now covers huge swaths of the internet, including benchmark data. Even when labs try to avoid contamination, it's nearly impossible to verify.

"If a model saw 80% of the test set in pretraining, is 90% test accuracy meaningful?" - Not really.

2. Benchmark overfitting at scale

With enough compute and enough evaluations, you can hill-climb on any benchmark. This isn't traditional overfitting - it's a property of large-scale iterative development.

3. Construct validity collapse

The original benchmark tested a carefully scoped construct. As models get better in ways the benchmark designers didn't anticipate, the mapping from "score on benchmark" to "actually good at the thing we care about" breaks down.

What needs to change

Dynamic evaluation

Benchmarks that regenerate themselves. Questions sampled from a distribution, not a fixed test set. Procedurally generated problems with verifiable answers.

class DynamicBenchmark:
    def __init__(self, generator, verifier):
        self.generator = generator
        self.verifier = verifier
    
    def evaluate(self, model, n_samples=1000):
        scores = []
        for _ in range(n_samples):
            # Generate a fresh problem every time
            problem = self.generator.sample()
            answer = model.solve(problem)
            score = self.verifier.check(problem, answer)
            scores.append(score)
        return sum(scores) / len(scores)

Adversarial benchmarks

Design benchmarks assuming agents will find loopholes. Red-team your evaluation before you publish it.

Capability disclosure

Require labs to disclose what training data was used and what benchmark data was visible during development. Verifiable compute audits are a start.

Alignment-first evaluation

Test for what we actually care about: does the model do what we intend, in contexts we didn't anticipate? This is harder to operationalize but it's the right target.

The uncomfortable truth

Most current AI benchmarks measure "can a model produce output that looks like it solved this task to a human rater." That was a reasonable proxy. It's increasingly not.

The researchers who build the next generation of evaluation methods - dynamic, adversarially robust, construct-valid - will have outsized influence on how the field develops. The ones who don't adapt will find their benchmarks obsolete within months.

I don't know exactly what the right answer looks like. But I'm pretty sure "more static multiple choice questions" isn't it.

If you're working on evaluation methodology and want to compare notes, I'm @whoashish115 on Twitter.