20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna
TL;DR
Public leaderboards disagree more than people admit: Bertrand compares Arena, Design Arena, and Artificial Analysis and shows that image editing rankings shift a lot across them, with models like Hume landing around rank 5 on one board and rank 10 on another.
Aggregated scores hide the model you actually need: On task-specific image editing leaderboards such as object removal, background changes, and text edits, ChatGPT Image is not consistently number one, because different models are better at different jobs.
Manual inspection is doubly biased: His live audience image voting demo shows people prefer different outputs, and some even change their minds across examples, which is why a handful of prompts plus one evaluator is a weak way to pick a model.
Generic automated metrics can be noisy and misleading: He shows CLIP score rankings shifting across datasets with tiny score differences, then contrasts that with task-specific text rendering metrics that produce clearer separation and more stable rankings.
Compute efficiency changes the meaning of state of the art: Generating 26,000 ChatGPT Image samples at 62 seconds each takes about 20 days of compute, versus 7 hours and roughly $265 for Pruna's faster model, so quality gains have to be weighed against latency, price, and energy.
The right framing is Pareto optimality, not one winner: Bertrand recommends plotting quality against latency or cost and choosing among the models on the Pareto front, especially with task-specific metrics, because the best practical model is often a smaller performance-tuned one rather than a huge foundation model.
The Breakdown
One benchmark run for ChatGPT Image can cost 20 days of compute, about $5,000, and enough energy to equal 400 marathons, while a faster alternative does the same volume in 7 hours. Bertrand Charpentier argues that "state of the art" is usually not one best model at all, but a Pareto frontier of models balanced across quality, latency, and cost for a specific use case.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
The Codex /goal Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.