Back to Podcast Digest
AI News & Strategy Daily | Nate B Jones··32m

GPT-5.5 vs Claude vs Gemini: The Real Difference Nobody's Talking About

TL;DR

  • GPT-5.5 raises the floor, not just the ceiling — Nate’s core claim is that 5.5 feels like a bigger pretrain showing up in everyday use, with OpenAI citing 82% on Terminal Bench, 84% on GPQA-like knowledge-work evals, and Artificial Analysis ranking it top while using fewer tokens than 5.4.

  • The real gap between frontier models shows up on ugly, multi-step work — easy prompts make Claude, Gemini, and GPT look interchangeable, but Nate argues the meaningful differences appear when the brief is underspecified, the files are messy, and the model has to carry a deliverable across tools, formats, and risk constraints.

  • On Nate’s 'Dingo' executive-work benchmark, GPT-5.5 crushed the field — it scored 87.3 vs 67.0 for Opus 4.7 and 49.8 for Gemini 3.1 Pro, producing 23 real artifacts including a 17-slide deck, working spreadsheets, an interactive dashboard, and a legal posture that correctly treated exotic pet ownership as risky and constrained.

  • On messy backend data migration, 5.5 is the best first pass — not the final authority — in the 'Splash Brothers' test it caught planted traps like Mickey Mouse, ASDF ASDF, and a fake $25,000 payment, but still missed service-code conflicts, created a bad canonical customer, and left enum normalization sloppy.

  • Claude still has a real edge on blank-canvas visual taste — in the Artemis 2 interactive NASA visualization, both 5.5 and Opus 4.7 got the mission shape right, but Claude produced a more grounded, better-lit, more presentation-ready scene while 5.5 leaned into information density and looked more cartoonish.

  • The winning workflow is routing, especially 5.5 + Codeex + Images 2.0 — Nate says ChatGPT alone undersells the model, while inside Codeex 5.5 can inspect files, edit code, run tests, drive browsers, and iterate in-place; his practical rule is 5.5 for execution, Opus for taste, and validation wherever money, law, ops, or production data are involved.

The Breakdown

Why GPT-5.5 feels like a real frontier jump

Nate opens with a strong claim: GPT-5.5 is the best model in the world right now, but the interesting part isn’t that it beats 5.4 on benchmarks — it changes what you can reasonably ask a model to do. His phrase is that “the floor moved,” meaning the default fast experience got smarter, not just the max-effort inference mode. He ties that to benchmark numbers like 82% on Terminal Bench and Artificial Analysis putting 5.5 at the top, but says the bigger story is intuitive: it gets the shape of the work sooner and needs less handholding.

Easy tasks are a trap; real work is messy

He takes aim at the fashionable view that the best model matters less now because all frontier models are “good enough.” Nate says that’s only true if your test is summarize-a-doc, write-an-email, or build-a-to-do-app territory. The real question now is not “can the model answer this?” but “can the model carry this?” — long context, contradictory files, legal risk, multiple artifacts, and enough persistence that the human is reviewing the hard parts instead of rebuilding everything.

Dingo: the absurd startup test that exposed real judgment

His first private benchmark, “Dingo and Company,” is a fictional Anchorage pet-tech startup selling an automated litter box for dingoes, with a legally sensitive import subsidiary called Northern Canada Imports. The absurd premise is intentional: weak models treat it like a joke launch, while stronger models realize it’s ethically fraught, operationally complex, and needs a narrow, qualified-household go-to-market posture. GPT-5.5 scored 87.3 and delivered all 23 requested artifacts as real files — including a 17-slide deck, spreadsheets with formulas, a working dashboard, and 34 sourced URLs — while the others either drifted on numbers, underproduced, or faked files with the right extensions.

Splash Brothers: where 5.5 gets impressively useful — and still unsafe to trust alone

The second test is a gross small-business data migration for a fictional car wash operation with 465 messy files: CSVs, Excel sheets, corrupted JSON, scanned handwritten receipts, VCFs, fake customers, duplicates, typos, and a planted fake $25,000 payment. Nate says 5.5 is the first model to catch the obvious human-red-flag traps — it rejected Mickey Mouse, test customer, ASDF ASDF, and the fake payment, merged all seven planted duplicate pairs, and generated a 7,287-line migration report. But it still missed backend hygiene like service-code conflicts, orphan handling, and enum normalization, which is why his practical advice is clear: use 5.5 for the heavy lift, then wrap it with validators, row-count checks, schema constraints, and human approval.

Artemis 2: strong research, weaker taste

The third test asks the model to research and build an interactive 3D visualization of NASA’s Artemis 2 mission from scratch. Both GPT-5.5 and Opus 4.7 correctly understood that Artemis 2 is a lunar flyby rather than a landing or orbit mission, which already rules out a lot of model confusion. But the split was aesthetic: 5.5 packed in educational labels and clickable information, while Opus delivered better lighting, composition, and visual authority — the kind of thing you’d actually want to show someone.

Why Codeex changes the evaluation

Nate says the real product story is not just the model weights but the system around them: Codeex, file access, browser control, memory, image generation, and the ability to act where the work actually lives. In his framing, ChatGPT is still the broad consumer surface, but Codeex is where serious work happens because 5.5 can inspect a codebase, run commands, hit errors, patch files, rerender documents, and keep going. His line is basically that intelligence and agency multiply each other, and 5.5 inside tools feels “like a monster.”

Reliability, routing, and the workflow he’d use now

He also argues availability is part of product quality, pointing to Anthropic status pages showing materially worse uptime and saying some Claude services are hovering around “one nine” rather than two or three nines. That leads into his routing playbook: 5.5 first for complex multi-step execution, Opus 4.7 for blank-canvas front-end taste, and a reference-image workflow when you want both taste and production strength. He closes by saying the future is not one-model loyalty but routing — and that 5.5 matters because it expands the class of work worth attempting, from executive handoffs to side-gig businesses like a palm-reading app or custom LEGO generator powered by 5.5, Images 2.0, and Codeex.