
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Black Forest Labs is positioning itself as a visual AI research lab first, not just a model vendor — Stephen Batifol says BFL’s operating principle is to keep releasing state-of-the-art models, building on Stable Diffusion, Latent Diffusion, Flux 1, Flux Context, Flux 2, and now near-real-time systems.
Flux’s early breakout was about usable quality on accessible hardware — Flux 1 stood out because it was open source, ran on a laptop, had unusually strong anatomy, and briefly became the most-liked model on Hugging Face after a shoutout from Hugging Face’s Clem Delangue.
Flux 2 and the new 'client' model are pushing image editing toward real time — Batifol claims Flux 2 can hit roughly 300 ms for generation and 500 ms for editing, while the newer client model stays around 0.5 seconds versus roughly 15-20 seconds for Qwen on comparable editing tasks.
The core research bet is replacing external encoders with a self-supervised training method called SelfFlow — instead of aligning a generative model to frozen vision encoders like DINOv2 or DINOv3, BFL trains a student-teacher setup with high-noise and low-noise views so representation learning and generation happen in one system.
BFL argues SelfFlow improves more than image quality — it scales across images, video, audio, and even robot actions — Batifol shows internal research comparisons where SelfFlow beats standard flow matching on audio, image, and video metrics, while also reducing text errors, anatomy glitches, flicker, and even improving a robot can-grasping demo.
The long game is 'visual intelligence' and world models for robotics — the talk ends by connecting faster image generation to bigger ambitions: models that understand geometry and interaction well enough to support interactive media, safe driving, manufacturing automation, and physical AI.
Stephen Batifol opens by framing Black Forest Labs as the team behind Stable Diffusion, Latent Diffusion, and Flux, with customers including Microsoft, Adobe, Canva, and Mistral. He says Flux 1, launched in August 2024, was the company’s big arrival: open source, laptop-runnable, and good enough on anatomy that it felt like a real challenger rather than “some company coming out of nowhere.”
He revisits Flux Context as the first open-source model that combined text-to-image and image editing in one system, which now sounds normal but felt like a breakthrough at the time. His examples are playful and concrete: remove a snowflake from a woman’s face, move her to the streets of Freiburg, then make the whole background snowy again — all while keeping the same character intact.
Batifol says partners used Flux Context to build storyboard-like sequences, like the recurring “seagull in a VR headset drinking a beer” scene that evolves shot by shot. The punchline was speed: while early GPT image editing could take 40-50 seconds, Flux Context was more like 7-8 seconds, which made it actually useful as input for video and animation workflows.
In November, BFL released Flux 2, which Batifol calls their best model so far, showing photorealistic people, animals, and product photography that he says are “impossible to basically tell” from real images. He emphasizes that it’s not just generation: Flux 2 can take up to 10 reference images at once, assemble coherent outfits from multiple product shots, and place furniture like a sofa into realistic home scenes for e-commerce use cases.
Then he steps back and explains the technical problem: generative models learn to denoise images, but that process alone doesn’t teach physical coherence — like why a glass shouldn’t pass through a table. The common fix is representation alignment with an external encoder, but he calls that a “Frankenstein setup” because it creates scaling ceilings, modality silos, and misaligned objectives; even a better encoder like DINOv3 can weirdly lead to worse downstream generation than DINOv2.
BFL’s answer is SelfFlow, a paper they released openly about six weeks earlier. The setup uses two noisy views of the same asset: a heavily noised input for the student and a lightly noised one for a teacher model that tracks the student more stably, so the system learns generation and representation jointly without relying on an outside encoder.
Batifol walks through research outputs where SelfFlow fixes typical AI artifacts: misspelled text becomes readable, odd faces become more anatomically plausible, and video of a person doing push-ups suddenly looks like “perfect form” instead of a glitch reel. He also shows a multimodal example where the baseline garbles “hello from the black forest,” while the SelfFlow version actually says it cleanly and stops where it should.
He closes by tying the research to product speed and long-term strategy: BFL’s “client” model does image generation and editing in around half a second, compared with Qwen at roughly 15-20 seconds, which he says opens the door to mockups, games, and film workflows that respond as fast as you think. The real ambition, though, is world models and physical AI — training systems that understand geometry and interaction well enough to power robots, safe driving, and manufacturing automation, with a final Q&A note that the model’s memory is effectively stored in its context and token state rather than some separate long-term world store.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.