OpenAI just dropped GPT-5.5... (WOAH)
TL;DR
GPT-5.5 fixes the biggest usability complaint from GPT-5.4: it feels less rigid and more concise — Matthew Berman says the new model has a noticeably better “personality,” giving shorter, more useful coding explanations instead of formal essay-length responses.
OpenAI’s real play is enterprise coding, not just model bragging rights — he frames GPT-5.5 as a direct response to Anthropic’s enterprise momentum, pointing to the coding flywheel: sell to enterprises, collect better coding data, improve the next model, repeat.
The standout technical upgrade is token efficiency, not just raw intelligence — even though GPT-5.5 costs $5 per million input tokens and $30 per million output tokens versus roughly half that for GPT-5.4, Berman argues it often ends up cheaper overall because it gets better answers in fewer tokens.
Benchmarks suggest GPT-5.5 is especially strong at agentic coding and terminal use — on Terminal Bench, he highlights a roughly 7-point jump over GPT-5.4 and says it “completely dominates” Claude Opus 4.7 for CLI-style workflows.
The model’s visual iteration loop in Codex is what really impressed him — Berman says GPT-5.5 can look at what it built, notice UI issues on its own, and keep refining until it matches the target more autonomously than Opus 4.7.
His most memorable real-world test was a production bug fix where GPT-5.5 seemed to ‘see around corners’ — with only a rough description and no production DB access or logs, it identified a fix that Claude Opus 4.6 and 4.7 missed.
The Breakdown
GPT-5.5 arrives, and Berman says it’s the model GPT-5.4 should have been
Matthew Berman opens with a clear verdict: GPT-5.5 is “a very, very good model” after two weeks of testing across his own codebases, fresh projects, ChatGPT Pro, and Codex. His first big takeaway isn’t a benchmark — it’s that OpenAI fixed GPT-5.4’s “difficult and soulless” feel with a better personality and much less rigid tone.
Why OpenAI is suddenly all-in on coding
He says the money is in agentic coding, computer use, knowledge work, and early scientific research — and Anthropic proved the business case by racing to a reported $30 billion annual run rate with enterprise coding focus. In Berman’s telling, OpenAI “got the message,” and GPT-5.5 is the result of that self-improving flywheel: better coding model, more enterprise adoption, more coding data, then an even better next model.
The sneaky important win: more intelligence per token
Berman keeps coming back to token efficiency as the real story. GPT-5.5 is pricier per token than GPT-5.4, but he says it reaches the same or better result in far fewer tokens, explains changes more cleanly, and avoids the annoying long-winded vibe-coding essays that made him constantly ask for shorter summaries.
Box’s enterprise benchmark gives the business angle some teeth
In the sponsored segment, he walks through Box AI’s internal evals and points to a jump from 67% to 77% accuracy overall for GPT-5.5 versus GPT-5.4. He calls out especially big gains in financial services, healthcare, and public sector, then demos a complex financial-document analysis prompt inside Box that ties an engineering roadmap to enterprise customer performance.
The benchmark story: terminal use is a real strength
Back on OpenAI’s own charts, Berman zeroes in on Terminal Bench as the benchmark that matters most because it reflects real CLI-based agent behavior. He’s notably less impressed by browser and computer control, saying point-and-click agents are “excruciatingly slow” and arguing that everything should expose a CLI or API if you actually want agents to be useful.
Demos are mixed — until the 3D dungeon game
He’s lukewarm on a basic earthquake tracker and not wowed by another tank game, saying those kinds of front-end demos are already common. But the 3D dungeon hack-and-slash catches his attention: he compares it to Dungeon Keeper, praises the lighting, shadows, animation, and combat logic, and says it’s the kind of demo that actually feels impressive.
The killer anecdote: GPT-5.5 fixed a production issue with almost no context
Berman says he personally told OpenAI that GPT-5.5 can “see around corners.” His example: he described a production website issue without giving the model access to logs, real DB data, or the live environment, and it still identified the right fix — something Claude Opus 4.6 and 4.7 failed to do.
Price goes up, but he thinks real-world cost can still go down
He closes on pricing: GPT-5.5 comes in at $5 per million input tokens and $30 per million output tokens, about double GPT-5.4. Still, he argues most users won’t pay the full premium in practice because batch pricing, caching, flex pricing, and lower token usage mean the “effective” cost of useful intelligence is often lower — especially if you don’t need the absolute top-end reasoning ceiling.