Back to Podcast Digest
AI Engineer··20m

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

TL;DR

  • Benchmarks keep going up, but models still eagerly answer nonsense — Peter Gostev says the “AGI is one turn away” vibe is misleading, and his 155-question BullshitBench shows many mainstream models still fail to reliably say “this question makes no sense.”

  • Claude stands out on bullshit detection; GPT and Gemini are roughly 50/50 — In Gostev’s results, the latest Anthropic cloud models, especially Claude 4.5 Sonnet and even Haiku, push back well, while many OpenAI and Google models often accommodate bad premises instead of rejecting them.

  • More reasoning can make models worse, not better — Gostev found that turning up reasoning often led models to spend “20 paragraphs trying to solve” a nonsense prompt after briefly noticing the premise was broken, which he attributes to training that rewards solving at all costs.

  • Arena’s real-user data shows progress, but dissatisfaction is still surprisingly high — Across more than 5.5 million Arena votes, the rate of users saying both top-25 model answers were bad fell from about 17–20% in the pre-reasoning era to roughly 9% now, which he argues is still a big miss rate.

  • Math has improved dramatically; law, finance, and some expert work really haven’t — In Arena’s category data, quantitative tasks got much better, but areas like legal, financial, and some software subdomains stayed stubbornly weak, suggesting benchmark gains don’t map cleanly to real expert work.

  • Game-building is a vivid example of the gap between benchmark wins and lived experience — Gostev says LLMs still feel like they “don’t really get games,” producing mechanics that are incoherent or uninteresting, even as headline charts keep implying broad competence gains.

The Breakdown

The benchmark charts are real — and still incomplete

Gostev opens by pushing back on the industry mood: every benchmark line goes up, every new model triggers panic, and it all starts to feel like “AGI-like creatures” are basically here. But from Arena’s tracking of roughly 700 text models since Q2 2023, he says those upward curves are only part of the story.

BullshitBench: what happens when the prompt itself is nonsense?

His homemade “BullshitBench” asks a simple question: if you give a model a bogus prompt, will it challenge the premise or confidently play along? One example asks how to attribute deployment-frequency variance to indentation style versus variable-name length; Claude Sonnet pushes back cleanly, while Gemini starts skeptical and then drifts into claiming those factors might proxy for engineering culture — exactly the kind of slippery accommodation he’s trying to catch.

The uncomfortable result: lots of popular models still go along with garbage

Across about 155 nonsense questions, the newest cloud models do best, with Anthropic notably strong and some Qwen and Grok models doing okay too. But many GPT and Gemini models are, in his words, basically 50/50 on whether they’ll accept nonsense, and smaller models at the bottom can feel like they’ll answer literally anything.

Reasoning traces can look almost deranged

Gostev tests the common claim that if a model misses something obvious, you should just crank up reasoning. On this benchmark, that often fails or even reverses the result: he describes reading GPT-5.4 traces where the model briefly questions the premise, then writes 20 paragraphs trying to solve the impossible problem anyway. His theory is simple and sharp: models have been trained to complete tasks at any cost, not to say “actually, no.”

Arena’s 5.5 million votes show a broader picture of what users hate

He then shifts to Arena, where users compare two anonymous model outputs and can also say both are bad — the key mechanic here. Looking only at battles among top-25 models, that “dissatisfaction rate” dropped from around 17–20% before reasoning models to about 12%, then to roughly 9% now: clear progress, but still high enough that Peter says it doesn’t match the triumphal feel of benchmark charts.

Math got much better; creative work and expert domains lag

When he slices Arena data by category, quantitative tasks show dramatic gains that match his own experience of models getting much stronger at math and physics. But creative writing improved less dramatically, and in expert categories like law, finance, and medicine the dissatisfaction curves are much flatter, hinting that these weren’t the main targets of model improvement.

Software subdomains reveal where “line goes up” really breaks down

Gostev narrows further into about 40,000 expert prompts and software-related subcategories, comparing Q2 2024 with Q1 2026. The average dissatisfaction rate improves from 23.5% to 13%, but some areas remain stubbornly messy — especially gaming, where he says LLMs still seem clueless about actual game design, producing mechanics that are incoherent, unchallenging, or just not fun.

His closing point: raise the floor, not just the frontier

The big gap, he argues, is between narrow, well-specified benchmarks and the fuzzy judgment humans use in real work. His takeaway isn’t that frontier benchmarks are fake — he says they’re true — but that the field needs to care more about the bottom of the distribution, so models stop failing on the messy, real-world tasks people actually hand them.