AI EngineerJune 4, 202628m

Text Diffusion — Brendon Dillon, Google DeepMind

TL;DR

Text diffusion trades token-by-token generation for block refinement: Instead of predicting one token at a time like GPT or Gemini, the model starts from random tokens and denoises an entire span over multiple passes, which lets it attend to future tokens and revise earlier text.
The latency win comes from hardware, not magic: Dillon explains that GPUs and TPUs are memory-bound for autoregressive decoding, so generating 256 tokens in 24 denoising passes can mean roughly 10 times fewer memory transfers than emitting 256 tokens one by one.
Gemini Diffusion reportedly matched Gemini 2.0 Flash quality with much better latency: In DeepMind's research preview from last year, the model served around 2,000 real tokens per second and stayed broadly competitive on quality, with some strengths in code.
Bidirectional reasoning lets the model fix itself mid-answer: On a math prompt whose answer is 39, the diffusion model first wrote 60, then 49, then corrected the full solution and updated the opening answer, while GPT-4o and Gemini 2.5 Flash each made mistakes on the same problem at the time.
Adaptive compute is built into the generation process: The model can stop early on easy tasks like outputting the first 100 digits of pi in 4 steps, take 18 steps for FizzBuzz, and spend 31 steps explaining quantum mechanics, with the stopping behavior learned by the model itself.
Low latency enables weird, compelling products: Dillon shows prototypes where HTML, text, comments, and UI states are generated live, including fake Wikipedia and Reddit pages, a fully generated operating system, and a voice-driven coding demo that built and restyled a to-do app in about 15 seconds.

The Breakdown

Google DeepMind's Brendon Dillon says text diffusion can hit about 2,000 tokens per second while matching autoregressive model quality, and he shows why the real upside is not just speed but self-correcting reasoning and entirely new low-latency interfaces. His demos range from a math answer changing from 60 to 49 to the correct 39, to whole Wikipedia pages, Reddit threads, and even an operating system generated live on every click.