Back to Podcast Digest
AI News & Strategy Daily | Nate B Jones26m

Opus 4.8 Scored 81. Your Workflow Doesn't Care.

TL;DR

  • Opus 4.8 looks more like a checkpoint than Anthropic's big swing: Nate says the May 28 release was timed to support a major funding announcement, while the real anticipated model is Mythos, the much-teased large release Anthropic still has not shipped.

  • Higher reasoning does not reliably help on 4.8: He points to Vending Bench, where Opus 4.8 regressed versus 4.7 and where 4.8 on high beat 4.8 on max, which breaks the usual expectation that more reasoning effort should produce better results.

  • Anthropic's alignment focus may be causing costly overthinking: Nate says reasoning traces from 4.8 max show the model spiraling on constitutional concerns like warmth, tone, and alignment, to the point that it can become less effective on practical tasks.

  • The harness now matters more than the benchmark score: His daily-driver choice is OpenAI 5.5 in Codex not because 4.8 lacks strengths, but because Codex handles long-running tasks, files, browser actions, and multi-hour execution more reliably.

  • A real-world website test exposed the gap: In his side-by-side trial, 5.5 built and deployed two Markdown-domain websites, while 4.8 errored out twice and struggled with compute limits, even before iteration on design quality.

  • Slashworkflows is the standout 4.8 innovation: Nate praises Claude Code's new /workflows command for letting the model compose, reveal, and assign multi-agent workflows transparently, and predicts that pattern will spread across AI products this summer.

The Breakdown

Opus 4.8 may be posting top-tier scores like 81, but Nate B Jones argues that does not matter if the model overthinks, behaves inconsistently, and sits inside a weaker harness than OpenAI's 5.5 in Codex. His bigger point is that in 2026 the real contest is no longer just model IQ, but whether the surrounding product can reliably turn that intelligence into finished work.

Was This Useful?

Share