AskwhoCasts AIJune 11, 20261h 13m

AI #172: The First Fable

TL;DR

Inference budget now changes what benchmark results mean: Zev strongly backs Noam Brown's call for labs to publish performance against token, cost, or time budgets, noting capability can keep rising dramatically as you spend more compute at test time.
Agents Last Exam puts GPT-5.5 in front, but price changes the picture: Dawn Song's new eval has GPT-5.5 leading overall while Claude Fable 5 lands in the same cluster at much higher cost, about $15.70 per task versus $3.80 for GPT-5.5 and $1.33 for Composer 2.5.
'Good enough' models are often a trap: Zev says teams keep trying to route tasks to the cheapest acceptable model, but for most serious work the spending growth is still on top-end American models, and he thinks defaulting to DeepSeek is often more fashion than analysis.
Anthropic's own numbers imply AI development is accelerating fast: He highlights Anthropic's claim that engineers now ship 8x as much code per quarter as they did from 2021 to 2025, calling it a scary graph because even modest true gains can compound toward recursive self-improvement quickly.
A coordinated pause moved from fringe idea to mainstream lab rhetoric: Zev notes Demis Hassabis, Dario Amodei, and Sam Altman have all called for some form of coordinated slowdown, but says the hard part is still unspecified verification, enforcement, and actual pause conditions.
The US government telling CAISI to stop publishing evals is 'a no good very bad move': He argues private eval details can stay confidential, but public reporting on results is essential, especially when frontier models are already being sold into federal systems.

The Breakdown

Claude Fable 5 may have dominated the week, but the sharper point here is that benchmark scores without inference budgets are becoming close to meaningless as models buy huge gains with more test-time compute. Zev Moshawitz also argues the industry is drifting toward a world of recursive self-improvement, secret government evals, and vague pause rhetoric, which means the real work now is building the ability to verify and coordinate a slowdown before it is urgently needed.