Back to Podcast Digest
Theo - t3.gg··44m

Did Claude really get dumber again?

TL;DR

  • Theo says the regression is real, but not just “the model got dumber” — he argues Claude’s worse coding experience comes from a stack of failures across the harness, API routing, tokenization, long-context defaults, and serving infrastructure, not a single root cause.

  • Claude Code itself is a major culprit — citing Matt Mau’s benchmark, Theo highlights that Opus performs about 15% worse in Claude Code than in Cursor, and says bad harness design, extra tool calls, and polluted system prompts are making Anthropic’s own models look worse than they are.

  • Anthropic’s new defaults may be quietly trading quality for infrastructure convenience — Theo points to the 1 million-token context becoming the default in Claude Code, even though Anthropic previously admitted the 1M-context variant behaved worse, and notes users must manually disable it with an environment variable.

  • The tokenizer change in Opus 4.7 likely made context heavier and messier — Anthropic said the same input could become 1.0x to 1.35x more tokens, while independent measurements put real coding/docs workloads closer to 1.45x to 1.47x, which Theo says means faster context bloat and more “context rot.”

  • AMD’s AI director documented a measurable post-March quality drop — across 17,000 thinking blocks, 235,000 tool calls, and 6,800 Claude Code sessions, AMD linked redacted thinking, reduced reasoning depth, more permission-seeking, more loops, and worse read-to-edit behavior to a noticeable regression in long engineering tasks.

  • Theo contrasts Anthropic with OpenAI on post-launch stability — after polling X and citing public comments from OpenAI’s Thibault, he argues meaningful degradation after release is something developers repeatedly report with Anthropic, while OpenAI tends to fix product glitches without “fiddling with the models or thinking budgets.”

The Breakdown

The vibe shift: yes, Claude really feels worse

Theo opens by reading a pile of complaints — from Reddit posts to an AMD AI director’s write-up — all pointing to the same thing: Claude feels “dumber and lazier” over time. His key framing is that the experience is real, but he doesn’t buy the simplistic idea that the API is just suddenly returning lower-IQ text for no reason.

A personal breaking point: Claude refusing obvious tasks

He admits he used to push back on regression discourse, until Claude Code started refusing basic help on his own machine — like debugging a broken Dropbox menu bar app. Claude told him that was “outside my area,” while Codex fixed it in minutes, which Theo uses to separate different failure modes: refusals, bad solutions, and the model simply getting lost.

Where regressions can happen: prompt, harness, API, GPU, model

Theo walks through the whole request path: your prompt hits a harness, then an API, then inference hardware, with each tool call creating another API round-trip. His point is that every layer can break quality — safety filters can cause false refusals, harness changes can distort behavior, and serving requests across Nvidia, AWS Trainium, and Google TPUs can introduce inconsistency inside a single Claude Code session.

Expectations have risen — but that’s not the whole story

He concedes one softer explanation: users are asking harder things now because Claude previously impressed them. His analogy is that code which looked great when you were junior looks terrible once your standards improve — but he says that only explains part of the frustration, not the sharp, measurable drops people are now reporting.

Theo’s hottest take: Claude Code’s harness is making Claude look dumb

This is the spiciest section. He shows Claude trying to edit package.json, failing because the file hadn’t been “read” first, then wasting multiple tool calls because search didn’t count as read — a harness bug that burns tokens, pollutes context, and racks up cost. He points to Matt Mau’s benchmark, where Opus scores dramatically worse in Claude Code than in Cursor, and says Anthropic’s product engineering is actively degrading model performance.

Tokenizer changes and context bloat in Opus 4.7

Theo then zeroes in on Anthropic’s updated tokenizer, which the company says can increase token count by up to 1.35x for the same input. He cites independent tests showing closer to 1.45x–1.47x on docs and Claude MD files, arguing that code-heavy workflows now hit limits faster and drag around more useless context — like trying to debug a file that’s suddenly 50% longer.

The 1M context conspiracy theory

He revisits Anthropic’s own September postmortem, where misrouting requests to a 1 million-token context version hurt quality, then notes that 1M context later became the default for Opus 4.6/4.7 and Sonnet. Theo’s theory — which he explicitly labels conspiratorial — is that Anthropic may be steering traffic toward TPU/Trainium-friendly long-context serving, even if that version is worse; he also shows the hidden opt-out: setting CLAUDE_CODE_DISABLE_1M_CONTEXT=1.

AMD’s data, redacted thinking, and Theo’s final verdict

The closing stretch leans on AMD’s analysis of 6,800 Claude Code sessions: after thinking redaction ramped up in March, reasoning depth dropped 73%, stop violations surged, permission-seeking appeared, read-to-edit behavior worsened, and API requests exploded 80x while results got worse. Theo lands hard: this isn’t just vibes anymore, it’s measurable degradation, and until Anthropic fixes its infrastructure and harness quality, his blunt recommendation is to use something else.