AI EngineerJune 16, 202618m

You Might Not Need 50 Diffusion Steps — Ziv Ilan, Nvidia

TL;DR

You might not need 50 diffusion steps: Ilan argues that step distillation can shrink generation from roughly 20 to 50 denoising steps down to 4, 8, or even 1 while keeping quality high enough for real use.
Distillation is the biggest performance win: Unlike model compression in LLMs, diffusion distillation keeps the same parameter count but trains a student model to reach similar outputs in far fewer steps, which he says can mean 10x to 200x speedups.
Quantization is the easiest first move: Nvidia and Black Forest Labs used dynamic quantization on Flux 2, and Ilan frames pre-quantized Hugging Face checkpoints plus TRT-LLM Visual Gen as the fastest way to cut memory use and improve speed.
Caching works in diffusion, but differently than KV cache: Techniques like T-Cache skip recomputation when adjacent denoising steps barely change, and newer chunk-based methods only recompute moving regions, like the speaker moving while the audience stays still.
Real-time video likely needs multiple tricks stacked together: Ilan stresses these optimizations are incremental, so teams can combine quantization, multi-GPU parallelism, caching, and finally distillation rather than betting on one silver bullet.
You do not need a GB200 to start distilling: In the Q&A, he says distillation can run on Hopper-class GPUs like H100 and H200 too, though the compute and dataset needs depend heavily on whether you're tuning a 2B model or a 40B video model for a niche domain like protein generation.

The Breakdown

Nvidia's Ziv Ilan says the real bottleneck in diffusion is not model quality but the 20 to 50 denoising steps, and that cutting those steps to 4, 8, or even 1 through distillation is the clearest path to real-time image and video generation. He lays out a practical stack of quantization, caching, and step distillation, with Nvidia showing near real-time video on a single Blackwell B200.