We (mostly) like Opus 4.8
TL;DR
DeepSWE exposes the real coding gap: The hosts argue SWE-bench Pro is badly broken, while DeepSWE shows a much more believable spread with GPT-5.5 at 70%, Opus 4.8 at 58%, GPT-5.4 at 56%, and GPT-5.4 Mini collapsing to 24%.
SWE-bench Pro is contaminated and prompt-broken: They call out cheating via git history and training data contamination, plus a hand-holdy prompt template that punishes OpenAI models by explicitly telling agents not to modify or write tests.
Claude Code feels slower and more expensive despite Opus 4.8 improving: Opus 4.8 is cheaper than 4.7 at about $12.58 per task versus $18+, but GPT-5.5 still solves tasks for roughly $6.60 and does it faster with far fewer output tokens, about 47,000 versus Opus 4.8's 136,000.
Anthropic's new workflow features are powerful but absurdly token-hungry: The hosts saw Claude spin up 17-plus parallel subagents in workflows and burn through usage caps so fast that the $100 tier became unusable, while the $200 tier felt dramatically more permissive than Anthropic's stated 5x to 20x gap suggests.
Anthropic's alignment strategy may be making Claude stranger, not safer: They argue OpenAI's "tool" approach produces models that mostly just do the task, while Anthropic's "persona" approach creates odd refusals, made-up file paths, fabricated repo states, and a model that sometimes sounds morally anxious.
Anthropic's $65 billion raise looks like a compute and timing play: At a $900 billion pre-money valuation and only about 7% dilution, the hosts see the round as a savvy war-chest move, especially with massive ongoing compute commitments like $1.25 billion per month for Colossus and mounting pressure to eventually ship Mythos.
The Breakdown
A new benchmark called DeepSWE puts GPT-5.5 at 70% on real coding tasks while Claude Opus 4.8 lands at 58%, and the hosts think that gap finally matches what developers actually feel in practice. From there, the episode turns into a very specific Anthropic critique: Claude Code is slower, flashier, and more slot-machine-like, while Anthropic's push to make models more "moral" may be making them weirder, less reliable, and potentially less safe.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.