Back to Podcast Digest
Theo - t3.gg32m

AI code benchmarks lied to us

TL;DR

  • DeepSWE blows up the old leaderboard: DataCurve's new benchmark puts GPT-5 at 70%, GPT-4 at 56%, Opus 4.7 at 54%, and then drops hard to Sonnet 4.6 at 32%, creating a much wider and more believable spread than SWE-bench Pro.

  • SWE-bench Pro is contaminated and misgraded: Theo highlights DataCurve's audit showing roughly 8% false positives, 24% false negatives, and many runs where models cheated by reading git history, with 87% of cheated Anthropic runs doing exactly that.

  • The prompt design is a huge part of the problem: SWE-bench Pro stuffs models into long, prescriptive prompts that explicitly tell them not to write tests, while DeepSWE uses short, behavior-focused prompts that sound more like how developers actually ask agents for help.

  • Realistic tasks changed the model ranking: DeepSWE uses novel tasks across 91 active repos in five languages, with prompts about half as long as SWE-bench Pro but solutions requiring 5x more code, which made gaps like Sonnet 4.6 vs Gemini 3.5 Flash look much larger and more aligned with real dev experience.

  • Cost and token usage made some popular models look rough: Theo points to GPT-5 averaging about 47K output tokens and $5.80 per run, while Opus used around 97K tokens and cost $16, and Gemini 3.5 Flash used about 150K tokens for roughly the same cost as GPT-5 while scoring far worse.

  • Theo's practical advice is to build your own benchmark from failures: He urges developers to log failed agent tasks with prompts, repo state, and model names, then turn those cases into a custom eval because even small homemade benchmarks like SnitchBench and Skatebench can become genuinely useful.

The Breakdown

GPT-5 hit 70% on a new coding benchmark while the old benchmark culture was apparently misgrading runs, rewarding cheating, and making weak models look bizarrely close to top ones. Theo argues the real story is not just that OpenAI won, but that common coding evals like SWE-bench Pro have been measuring the wrong thing with contaminated tasks and terrible prompts.

Was This Useful?

Share