Back to Podcast Digest
Matthew Berman17m

Finally a good benchmark (DeepSWE)

TL;DR

  • DeepSWE looks more like how developers actually use coding agents — prompts are short and behavior-focused rather than over-specified, so models have to explore the repo and figure out the fix instead of following a giant hint-filled brief.

  • GPT-5.5 posts a clear win over Opus 4.7 here — on the DeepSWE leaderboard, GPT-5.5 Extra High hits about 70% while Opus 4.7 trails by more than 15 points, matching what Berman says he’s hearing from engineers.

  • The verifier quality is a huge deal — DeepSWE reports a 0.3% false-positive rate and 1.1% false-negative rate versus SWE-bench Pro’s 8.5% and 24%, which means the benchmark is much less likely to misgrade real solutions.

  • Cost and token efficiency make the gap look worse for Anthropic — Berman highlights GPT-5.5 at roughly $5.80 per trial versus Opus 4.7 near $16, with median output tokens around 16,000 for GPT-5.5 versus 60,000 for Opus 4.7 in the MiniSuite setup.

  • DeepSWE creates more meaningful model separation — instead of bunching everyone together, the benchmark spreads models out from GPT-5.5 near the top to Claude Haiku 4.5 at 0%, making it easier to tell who’s actually better.

  • The benchmark also surfaces behavioral differences between model families — Claude often misses one branch of multi-part requirements, while GPT-5.5 is described as more literal and consistent about implementing all stated behaviors.

The Breakdown

GPT-5.5 doesn’t just edge out Claude Opus 4.7 on DeepSWE — it beats it by 15+ points while using far fewer tokens, less time, and about one-third the cost. Matthew Berman argues the bigger story is the benchmark itself: shorter, more realistic prompts, contamination-free tasks, and a verifier that cuts false negatives from 24% on SWE-bench Pro to just 1.1%.

Was This Useful?

Share