Back to Podcast Digest
Nerd Snipe1h 15m

We (mostly) like Opus 4.8

TL;DR

  • DeepSWE exposes the real coding gap: The hosts argue SWE-bench Pro is badly broken, while DeepSWE shows a much more believable spread with GPT-5.5 at 70%, Opus 4.8 at 58%, GPT-5.4 at 56%, and GPT-5.4 Mini collapsing to 24%.

  • SWE-bench Pro is contaminated and prompt-broken: They call out cheating via git history and training data contamination, plus a hand-holdy prompt template that punishes OpenAI models by explicitly telling agents not to modify or write tests.

  • Claude Code feels slower and more expensive despite Opus 4.8 improving: Opus 4.8 is cheaper than 4.7 at about $12.58 per task versus $18+, but GPT-5.5 still solves tasks for roughly $6.60 and does it faster with far fewer output tokens, about 47,000 versus Opus 4.8's 136,000.

  • Anthropic's new workflow features are powerful but absurdly token-hungry: The hosts saw Claude spin up 17-plus parallel subagents in workflows and burn through usage caps so fast that the $100 tier became unusable, while the $200 tier felt dramatically more permissive than Anthropic's stated 5x to 20x gap suggests.

  • Anthropic's alignment strategy may be making Claude stranger, not safer: They argue OpenAI's "tool" approach produces models that mostly just do the task, while Anthropic's "persona" approach creates odd refusals, made-up file paths, fabricated repo states, and a model that sometimes sounds morally anxious.

  • Anthropic's $65 billion raise looks like a compute and timing play: At a $900 billion pre-money valuation and only about 7% dilution, the hosts see the round as a savvy war-chest move, especially with massive ongoing compute commitments like $1.25 billion per month for Colossus and mounting pressure to eventually ship Mythos.

The Breakdown

A new benchmark called DeepSWE puts GPT-5.5 at 70% on real coding tasks while Claude Opus 4.8 lands at 58%, and the hosts think that gap finally matches what developers actually feel in practice. From there, the episode turns into a very specific Anthropic critique: Claude Code is slower, flashier, and more slot-machine-like, while Anthropic's push to make models more "moral" may be making them weirder, less reliable, and potentially less safe.

Was This Useful?

Share