AI EngineerMay 8, 202622m

FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs

TL;DR

Black Forest Labs is positioning itself as a visual AI research lab first, not just a model vendor — Stephen Batifol says BFL’s operating principle is to keep releasing state-of-the-art models, building on Stable Diffusion, Latent Diffusion, Flux 1, Flux Context, Flux 2, and now near-real-time systems.
Flux’s early breakout was about usable quality on accessible hardware — Flux 1 stood out because it was open source, ran on a laptop, had unusually strong anatomy, and briefly became the most-liked model on Hugging Face after a shoutout from Hugging Face’s Clem Delangue.
Flux 2 and the new 'client' model are pushing image editing toward real time — Batifol claims Flux 2 can hit roughly 300 ms for generation and 500 ms for editing, while the newer client model stays around 0.5 seconds versus roughly 15-20 seconds for Qwen on comparable editing tasks.
The core research bet is replacing external encoders with a self-supervised training method called SelfFlow — instead of aligning a generative model to frozen vision encoders like DINOv2 or DINOv3, BFL trains a student-teacher setup with high-noise and low-noise views so representation learning and generation happen in one system.
BFL argues SelfFlow improves more than image quality — it scales across images, video, audio, and even robot actions — Batifol shows internal research comparisons where SelfFlow beats standard flow matching on audio, image, and video metrics, while also reducing text errors, anatomy glitches, flicker, and even improving a robot can-grasping demo.
The long game is 'visual intelligence' and world models for robotics — the talk ends by connecting faster image generation to bigger ambitions: models that understand geometry and interaction well enough to support interactive media, safe driving, manufacturing automation, and physical AI.

Summary

From Stable Diffusion roots to Flux’s breakout moment

Stephen Batifol opens by framing Black Forest Labs as the team behind Stable Diffusion, Latent Diffusion, and Flux, with customers including Microsoft, Adobe, Canva, and Mistral. He says Flux 1, launched in August 2024, was the company’s big arrival: open source, laptop-runnable, and good enough on anatomy that it felt like a real challenger rather than “some company coming out of nowhere.”

Flux Context made editing feel practical, not gimmicky

He revisits Flux Context as the first open-source model that combined text-to-image and image editing in one system, which now sounds normal but felt like a breakthrough at the time. His examples are playful and concrete: remove a snowflake from a woman’s face, move her to the streets of Freiburg, then make the whole background snowy again — all while keeping the same character intact.

Storyboards, product shots, and why speed mattered

Batifol says partners used Flux Context to build storyboard-like sequences, like the recurring “seagull in a VR headset drinking a beer” scene that evolves shot by shot. The punchline was speed: while early GPT image editing could take 40-50 seconds, Flux Context was more like 7-8 seconds, which made it actually useful as input for video and animation workflows.

Flux 2 is the step toward “visual intelligence”

In November, BFL released Flux 2, which Batifol calls their best model so far, showing photorealistic people, animals, and product photography that he says are “impossible to basically tell” from real images. He emphasizes that it’s not just generation: Flux 2 can take up to 10 reference images at once, assemble coherent outfits from multiple product shots, and place furniture like a sofa into realistic home scenes for e-commerce use cases.

Why today’s training setup is powerful but awkward

Then he steps back and explains the technical problem: generative models learn to denoise images, but that process alone doesn’t teach physical coherence — like why a glass shouldn’t pass through a table. The common fix is representation alignment with an external encoder, but he calls that a “Frankenstein setup” because it creates scaling ceilings, modality silos, and misaligned objectives; even a better encoder like DINOv3 can weirdly lead to worse downstream generation than DINOv2.

SelfFlow: one training recipe for images, video, audio, and more

BFL’s answer is SelfFlow, a paper they released openly about six weeks earlier. The setup uses two noisy views of the same asset: a heavily noised input for the student and a lightly noised one for a teacher model that tracks the student more stably, so the system learns generation and representation jointly without relying on an outside encoder.

The demos: better text, cleaner anatomy, less flicker, stronger speech

Batifol walks through research outputs where SelfFlow fixes typical AI artifacts: misspelled text becomes readable, odd faces become more anatomically plausible, and video of a person doing push-ups suddenly looks like “perfect form” instead of a glitch reel. He also shows a multimodal example where the baseline garbles “hello from the black forest,” while the SelfFlow version actually says it cleanly and stops where it should.

The bigger destination: real-time editing, world models, and robots

He closes by tying the research to product speed and long-term strategy: BFL’s “client” model does image generation and editing in around half a second, compared with Qwen at roughly 15-20 seconds, which he says opens the door to mockups, games, and film workflows that respond as fast as you think. The real ambition, though, is world models and physical AI — training systems that understand geometry and interaction well enough to power robots, safe driving, and manufacturing automation, with a final Q&A note that the model’s memory is effectively stored in its context and token state rather than some separate long-term world store.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

FLUX, Open Research, and the Future of Visual AI — Stephen Batifol, Black Forest Labs

Summary

From Stable Diffusion roots to Flux’s breakout moment

Flux Context made editing feel practical, not gimmicky

Storyboards, product shots, and why speed mattered

Flux 2 is the step toward “visual intelligence”

Why today’s training setup is powerful but awkward

SelfFlow: one training recipe for images, video, audio, and more

The demos: better text, cleaner anatomy, less flicker, stronger speech

The bigger destination: real-time editing, world models, and robots

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

From Stable Diffusion roots to Flux’s breakout moment

Flux Context made editing feel practical, not gimmicky

Storyboards, product shots, and why speed mattered

Flux 2 is the step toward “visual intelligence”

Why today’s training setup is powerful but awkward

SelfFlow: one training recipe for images, video, audio, and more

The demos: better text, cleaner anatomy, less flicker, stronger speech

The bigger destination: real-time editing, world models, and robots

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks