AI EngineerJune 1, 202619m

20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

TL;DR

Public leaderboards disagree more than people admit: Bertrand compares Arena, Design Arena, and Artificial Analysis and shows that image editing rankings shift a lot across them, with models like Hume landing around rank 5 on one board and rank 10 on another.
Aggregated scores hide the model you actually need: On task-specific image editing leaderboards such as object removal, background changes, and text edits, ChatGPT Image is not consistently number one, because different models are better at different jobs.
Manual inspection is doubly biased: His live audience image voting demo shows people prefer different outputs, and some even change their minds across examples, which is why a handful of prompts plus one evaluator is a weak way to pick a model.
Generic automated metrics can be noisy and misleading: He shows CLIP score rankings shifting across datasets with tiny score differences, then contrasts that with task-specific text rendering metrics that produce clearer separation and more stable rankings.
Compute efficiency changes the meaning of state of the art: Generating 26,000 ChatGPT Image samples at 62 seconds each takes about 20 days of compute, versus 7 hours and roughly $265 for Pruna's faster model, so quality gains have to be weighed against latency, price, and energy.
The right framing is Pareto optimality, not one winner: Bertrand recommends plotting quality against latency or cost and choosing among the models on the Pareto front, especially with task-specific metrics, because the best practical model is often a smaller performance-tuned one rather than a huge foundation model.

The Breakdown

One benchmark run for ChatGPT Image can cost 20 days of compute, about $5,000, and enough energy to equal 400 marathons, while a faster alternative does the same volume in 7 hours. Bertrand Charpentier argues that "state of the art" is usually not one best model at all, but a Pareto frontier of models balanced across quality, latency, and cost for a specific use case.