Back to Podcast Digest
Wes Roth15m

GPT-5.6 about to DROP

TL;DR

  • Anthropic’s IPO could be the AI boom’s first real stress test: Wes says public filings would expose revenue, inference costs, margins, cloud commitments, and customer concentration, giving skeptics and believers actual numbers instead of hype.

  • Opus 4.8 built a full city-economy benchmark almost end to end: In Claude’s new ultra code mode, the model created a simulation with workers, wages, taxes, welfare, businesses, vehicles, balance sheets, and even iterated on bugs and fairness issues itself.

  • GPT-5.5 still looks stronger than Opus 4.8 on DeepSWE-style coding tests: Wes points out that Claude Opus 4.8 did not beat GPT-5.5 on the Deep Suite benchmark, and says the missing ultra code result is the comparison he really wants to see.

  • ARC-AGI 3 is tiny in score, but big in significance: Opus 4.8 reportedly hit 1.5 percent, state of the art on ARC-AGI 3, while most models sit at 0.5 percent or below, and observers said its reasoning looked more abstract and human-like.

  • Benchmark design is shifting from score-chasing to thinking-style testing: Wes highlights ARC-AGI 3, Vending Bench, and Deep Suite as examples of a newer philosophy that tries to force real reasoning on fresh, contamination-free tasks instead of measuring memorized answers.

  • GPT-5.6 rumors suggest frontier models may ship as rolling updates: References to GPT-5.6 and GPT-5.6 Pro in OpenAI-related backlogs, plus talk of stronger coding agents and a possible 1.5 million-token context window, imply launches may happen every few months rather than yearly.

The Breakdown

Anthropic’s rumored IPO could force the first real financial x-ray of the AI boom, while a separate rumor says OpenAI may answer with GPT-5.6 and a major coding jump. In between, Wes Roth shows Claude Opus 4.8 building a full economic simulation benchmark and explains why its weirdly low but state-of-the-art 1.5 percent on ARC-AGI 3 still matters.

Was This Useful?

Share