EveryMay 29, 20261h 21m

LIVE VIBE CHECK: Opus 4.8—IT'S A MONSTER

TL;DR

They think Anthropic undernamed it: Dan Shipper, Kieran Classen, and Katie Parrot repeatedly say Opus 4.8 feels closer to "Opus 5" or even "5.5" because it got markedly better at coding, writing, design, and general knowledge work at once.
Extra high reasoning is where the coding jump shows up: On Every's senior engineer benchmark, Opus 4.8 scored 63/100, about 30 points above Opus 4.7 and one point above GPT-5.5, but that performance only really appeared at extra high reasoning.
The model pushes back in a useful way: Kieran calls it the first model that can "punch you in the face if you do something stupid," meaning it questions your frame without becoming combative or sycophantic, which the team saw in coding, writing, and even interpersonal advice.
Writing improved a lot, but not perfectly: Katie's new writing benchmark put Opus 4.8 at 79.6 versus GPT-5.5 at 73, with only 13 AI tells across eight tasks versus 25 for Opus 4.7, though it still overuses the classic "not X but Y" construction.
It is unusually strong at mixed-skill work: The team highlights a one-shot PowerPoint deck on compound engineering and several design demos as proof that Opus 4.8 can combine writing, visual taste, coding, and structure in a way that feels more complete than prior models.
Claude's product experience is the weak link: Even while praising the model, Dan says Codex remains his daily driver because the Claude desktop app feels slow and confusing, while Codex is faster, cleaner, and better designed for thread orchestration and browser-based workflows.

The Breakdown

Opus 4.8 beat GPT-5.5 on Every’s senior engineer benchmark, produced their best one-shot deck yet, and left the team saying Anthropic should have just called it Opus 5. The catch is that the model feels ahead of Claude’s own app, with Codex still winning on speed and product design.