
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
ARC-AGI is still the benchmark AI hasn’t crushed — Matthew Berman calls it the only major benchmark that hasn’t been saturated, with humans solving it at 100% while ARC-AGI 3 frontier models score under 1%.
The benchmark is really about generalization, not memorization — in ARC-AGI 1 and 2, you infer rules from a few examples and apply them to a new case, which feels trivial to humans but still trips up top models like GPT, Gemini, and Claude.
ARC-AGI 2 gets expensive fast without getting close to human performance — Berman highlights GPT 5.4 Pro Extra High scoring 72% at roughly $39 per task, with Gemini 3.1 Pro at 69% and Claude Opus 4.6 Medium at 68%, still far from the human 100%.
ARC-AGI cares about efficiency, not just raw capability — unlike benchmarks where you can throw tokens at the problem, the leaderboard tracks cost per task, making it a test of economical reasoning as much as accuracy.
ARC-AGI 3 turns the benchmark into a zero-instruction video game — you’re dropped into an unfamiliar interactive environment with limited moves and no tutorial, and Berman solves one by noticing a plus-shaped switch changes the goal’s orientation before moving to the exit.
Frontier models basically faceplant on the interactive version — Berman shows GPT 5.4, Gemini 3.1 Pro Preview, Grok 4.2, and Claude Opus 4.6 failing on the showcased task, with the top model reaching just 0.3% at a cost of over $5,000.
Berman opens with a big claim: ARC-AGI is the only benchmark that AI hasn’t fully saturated, and ARC-AGI 3 makes that gap even starker. His headline stat is simple and brutal — humans get 100%, AI gets less than 1% — which is why he calls it the coolest benchmark out there.
He walks through ARC-AGI 1 with a toy example: see a few pink three-square shapes, infer that adding a yellow square completes each 2x2 block, then apply that rule to a new grid. For a person, the answer pops out almost instantly; that’s the point. The benchmark is supposed to feel easy to humans while exposing how shaky AI still is at generalizing from tiny amounts of information.
The second version keeps the same “infer the rule from examples” setup but makes the latent logic much murkier. Berman works through a color-coded shape puzzle where yellow, green, blue, and red map to different internal gap patterns, and you can feel him reverse-engineering the rule live. It’s still solvable by ordinary people, but no longer obvious at a glance.
On ARC-AGI 1, the best models are already around 93-94%, so it’s close to being maxed out. ARC-AGI 2 is where the separation shows: GPT 5.4 Pro Extra High hits 72% at $39 per task, Gemini 3.1 Pro gets 69%, and Claude Opus 4.6 Medium lands at 68%. Berman’s contrast is sharp: on coding or math benchmarks, AI beats elite humans; here, average humans still beat the models.
Berman pauses before ARC-AGI 3 to explain why ARC-AGI feels special. First, it tracks cost per task, so brute-forcing with huge token spend is part of the problem, not a workaround. Second, it measures something closer to everyday flexible reasoning — the kind regular humans use constantly — rather than specialist expertise.
Then the benchmark changes form completely: instead of static examples, you’re dropped into a little game world with arrows, a reset button, a yellow bar, a maze, and no explanation. Berman narrates his own thinking in real time, guessing what the UI means, testing one move, noticing the bar drop, and realizing the plus-shaped object probably changes the goal state. That “let me poke at the environment and form a theory” loop is exactly what the benchmark is testing.
He solves the game by hitting the plus first, which reorients the target, then moving to the exit — something he says would have taken about a minute without the commentary. Watching GPT 5.4 try the same task is almost painful: it takes the first step correctly, keeps returning to the same wrong area, and never thinks to touch the plus. Berman sounds genuinely stunned because to him it feels “so obvious,” a reminder that human intuition is carrying a lot more than we realize.
The results are wild: GPT 5.4, Gemini 3.1 Pro Preview, Grok 4.2, and Claude Opus 4.6 all effectively fail the showcased interactive task, while humans stay at 100%. He says the top model overall scores just 0.3%, and does it at a cost north of $5,000. ARC has released a paper, opened the benchmark for people to try, and attached a $2 million prize to anyone who can saturate it.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.