AI #172: The First Fable
TL;DR
Inference budget now changes what benchmark results mean: Zev strongly backs Noam Brown's call for labs to publish performance against token, cost, or time budgets, noting capability can keep rising dramatically as you spend more compute at test time.
Agents Last Exam puts GPT-5.5 in front, but price changes the picture: Dawn Song's new eval has GPT-5.5 leading overall while Claude Fable 5 lands in the same cluster at much higher cost, about $15.70 per task versus $3.80 for GPT-5.5 and $1.33 for Composer 2.5.
'Good enough' models are often a trap: Zev says teams keep trying to route tasks to the cheapest acceptable model, but for most serious work the spending growth is still on top-end American models, and he thinks defaulting to DeepSeek is often more fashion than analysis.
Anthropic's own numbers imply AI development is accelerating fast: He highlights Anthropic's claim that engineers now ship 8x as much code per quarter as they did from 2021 to 2025, calling it a scary graph because even modest true gains can compound toward recursive self-improvement quickly.
A coordinated pause moved from fringe idea to mainstream lab rhetoric: Zev notes Demis Hassabis, Dario Amodei, and Sam Altman have all called for some form of coordinated slowdown, but says the hard part is still unspecified verification, enforcement, and actual pause conditions.
The US government telling CAISI to stop publishing evals is 'a no good very bad move': He argues private eval details can stay confidential, but public reporting on results is essential, especially when frontier models are already being sold into federal systems.
The Breakdown
Claude Fable 5 may have dominated the week, but the sharper point here is that benchmark scores without inference budgets are becoming close to meaningless as models buy huge gains with more test-time compute. Zev Moshawitz also argues the industry is drifting toward a world of recursive self-improvement, secret government evals, and vague pause rhetoric, which means the real work now is building the ability to verify and coordinate a slowdown before it is urgently needed.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.