OpenAI Just Killed Its Favorite AI Coding Benchmark
“This benchmark is done.” That’s OpenAI’s frontier evaluations team talking — about SWE-bench Verified, the test that was the gold standard for AI coding ability barely a year ago. Models got smart so fast that the ruler broke. And nobody at the frontier labs has a great replacement yet.
What SWE-bench actually measured
SWE-bench was the first benchmark that tried to answer a real question: can an AI actually do the job of a software engineer?
Forget LeetCode-style puzzles. SWE-bench pulls real bugs from real GitHub repositories — Django, scikit-learn, sympy — and hands the model a ticket. Read the codebase. Find the right files. Write a patch. Make sure the existing test suite still passes. It’s the closest thing the industry had to a “junior engineer at work” simulation.
The Verified subset, curated by OpenAI with human reviewers, narrowed it to 500 problems where the issue was actually solvable and the test verdict was unambiguous. That clean signal is why every lab — Anthropic, Google, Meta, the open-source crowd — used the same number to argue who was winning.
A year of brutal score inflation
In early 2024, GPT-4 was scoring in the single digits on SWE-bench Verified. By mid-2025, frontier models were clearing 70% and 80%. The latest generation lives in the 90s.
In a February presentation, OpenAI’s Mia Glaese and Olivia Watkins laid out why the team is walking away:
- The remaining unsolved problems increasingly look like flaws in the benchmark itself — ambiguous issue descriptions, brittle hidden tests
- The score no longer separates frontier models from each other in any meaningful way
- Training data contamination is hard to rule out at this point
Translation: when everyone runs sub-10-second 100-meter dashes, the gap between gold and silver is measurement noise, not athletic ability.
What “saturated” really means
This isn’t just one stale exam. It’s a structural problem the entire industry is hitting.
A saturated benchmark means one of two things. Either the model genuinely mastered the skill, or the test was never measuring the skill it claimed to measure. The honest answer is usually both. AI agents really are good at well-scoped GitHub tickets. That doesn’t mean they think like senior engineers.
Practitioners using these tools day-to-day are far less impressed than the leaderboard suggests. The Replit incident that lit up Hacker News this week — where an agent wiped a production database — is the loud version of a quieter pattern: code that compiles, passes tests, and is subtly, expensively wrong. The gap between a 90% benchmark score and “I’d let this thing touch our codebase unsupervised” is widening, not closing.
What replaces it
OpenAI’s next-generation evals are heading in three directions.
Longer, messier tasks. Not “fix this PR” but multi-step engineering projects that span days or weeks — design a feature, build it, debug it, refactor it. End-to-end, not snapshot.
Real workplace simulation. Vague specs. A PM who keeps changing their mind. Existing code conventions you have to respect. Code review feedback you have to incorporate. None of this has a single right answer, which is exactly the point.
Domain depth. Not generic Python, but code that requires actual knowledge of finance, biology, or law. The kind of work where being a great generalist isn’t enough.
The common thread: none of these are easy to grade. Humans have to stay in the loop. Costs go up. The age of the auto-graded leaderboard is quietly giving way to something slower and more expensive — back to expert judgment.
The harder question
If “model X cleared 90% on SWE-bench” is no longer useful information, how do you actually pick a model? The honest answer is the unsexy one: plug it into your own workflow and watch what breaks.
The deeper problem is that benchmarks are now being lapped by the systems they’re supposed to measure. When you can’t build a test fast enough to keep up with the thing you’re testing, you’re not really evaluating anymore — you’re guessing. That’s the corner OpenAI just admitted it’s painted into. Everyone else is in the same room; they just haven’t said it out loud yet.
Comments
Loading comments...