Back to Podcast Digest
AI Engineer19m

20 days of compute vs 7 hours: rethinking what state-of-the-art means — Bertrand Charpentier, Pruna

TL;DR

  • Public leaderboards disagree more than people admit: Bertrand compares Arena, Design Arena, and Artificial Analysis and shows that image editing rankings shift a lot across them, with models like Hume landing around rank 5 on one board and rank 10 on another.

  • Aggregated scores hide the model you actually need: On task-specific image editing leaderboards such as object removal, background changes, and text edits, ChatGPT Image is not consistently number one, because different models are better at different jobs.

  • Manual inspection is doubly biased: His live audience image voting demo shows people prefer different outputs, and some even change their minds across examples, which is why a handful of prompts plus one evaluator is a weak way to pick a model.

  • Generic automated metrics can be noisy and misleading: He shows CLIP score rankings shifting across datasets with tiny score differences, then contrasts that with task-specific text rendering metrics that produce clearer separation and more stable rankings.

  • Compute efficiency changes the meaning of state of the art: Generating 26,000 ChatGPT Image samples at 62 seconds each takes about 20 days of compute, versus 7 hours and roughly $265 for Pruna's faster model, so quality gains have to be weighed against latency, price, and energy.

  • The right framing is Pareto optimality, not one winner: Bertrand recommends plotting quality against latency or cost and choosing among the models on the Pareto front, especially with task-specific metrics, because the best practical model is often a smaller performance-tuned one rather than a huge foundation model.

The Breakdown

One benchmark run for ChatGPT Image can cost 20 days of compute, about $5,000, and enough energy to equal 400 marathons, while a faster alternative does the same volume in 7 hours. Bertrand Charpentier argues that "state of the art" is usually not one best model at all, but a Pareto frontier of models balanced across quality, latency, and cost for a specific use case.

Was This Useful?

Share