Latent SpaceJune 1, 20261h 44m

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

TL;DR

Most video progress is really language progress: Ethan He says modern visual intelligence gains come mostly from prompt rewriting, planning, and language-model reasoning, not from the diffusion backbone alone.
Grok Imagine was built fast because iteration speed beat novelty: xAI shipped Grok Imagine 0.9 in about 3 months with a small team, strong infra, and rapid end-to-end cycles, where fixing data and training pipeline bugs often mattered more than inventing new algorithms.
Training frontier video models is already LLM-scale expensive: Ethan estimates costs are comparable to medium-scale language models, with tens of trillions of visual tokens, 20B-class models, and storage alone reaching tens of petabytes and millions of dollars per month.
World models need three things, not just pretty video: His definition is real-time, interactive, long-horizon video, meaning the system must respond through pixels to mouse, keyboard, or voice, stay coherent over minutes or hours, and do it fast enough to feel live.
Reference-to-video and video extension are stepping stones to full world models: Instead of brute-forcing massive context windows, xAI built features like full-history video extension and up to seven-image reference conditioning to carry characters, objects, and scenes across longer generations.
Video agents are the next commercial inflection: Ethan predicts by the end of the year, production-grade video agents that iteratively call video generation, editing tools, and utilities like ffmpeg will be good enough for ads and enterprise workflows, which is when spending will really ramp.

The Breakdown

xAI built Grok Imagine 0.9 from scratch in three months with just a few engineers, but Ethan He’s bigger claim is the real gains in video generation now come more from language models than diffusion itself. He argues the next wave is video agents and real-time world models, where AI plans, edits, and generates interfaces and long-form video interactively instead of just spitting out short clips.