AI EngineerMay 18, 20261h 17m

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

TL;DR

DeepMind’s real goal is one multimodal “world model,” not a pile of separate media tools — Guillaume Vernade says image, video, audio, sensors, and text are being shipped as separate products like Veo, Nano Banana, and Lyria for release safety, but the underlying vision is a single model that understands and outputs across modalities.
Google is shipping at a breakneck pace, which is great for demos and rough on developers — Vernade says DeepMind releases something every 5 days on average, with gen-media updates roughly monthly, which is why his developer advocate job is equal parts docs, demos, prompt guides, and pushing internal teams to make APIs sane.
The practical pattern in this workshop is: use Gemini to write better prompts for media models — He feeds an entire Gutenberg book into Gemini, uses structured JSON output to generate character and chapter prompts, then passes those into Nano Banana, Veo, Lyria, and TTS to build a full illustrated, scored adaptation.
Prompting plus references beats memory alone for character consistency — In the Wind in the Willows demo, Vernade first relies on chat history to keep characters visually consistent, then shows a better method: ask Gemini which characters appear in a chapter and pass only those saved reference images into image generation.
Veo and Lyria are getting cheap enough to prototype seriously — He cites Veo 3.1 Light at about $0.05 per second, or roughly $0.40 per video, and Lyria clip at around $0.04 for a 30-second song, framing the cheaper models as iteration tools before you upscale.
Some of the most interesting tricks are weird hacks, not headline launches — Vernade shows how to fake multiple speaking characters from only two TTS voices by changing speaking style in parentheses, and says his favorite underused model is Lyria Realtime, which can DJ-style morph music live for game-like adaptive soundtracks.

Summary

A developer advocate inside the Google machine

Guillaume Vernade opens by joking about being one of the few who made it past building security, then sketches his path from video game producer to Stadia to Google DeepMind. His job, he says, is making sure developers have what they need when Google ships models — docs, samples, demos, prompt guides — while also fighting internally for common-sense product decisions, like “just swap the model name and it works.”

The bigger bet: one world model, shipped in pieces

He frames gen media as part of DeepMind’s long-term “world model” vision: a model that can ingest as many modalities as possible and output across them too. The reason users see separate image, video, and music models is mostly product pragmatism; if you ship one giant thing and update it constantly, you risk breaking everything at once.

Gemini’s multimodal growing pains and the release treadmill

Vernade tells a funny-but-real story about Gemini 1.5 sometimes claiming “I’m just an LLM, I can’t deal with images,” because leftovers from earlier training still lingered after multimodal support was restored. He uses that to show how fast this field has moved — a year and a half ago multimodal felt new, and now a non-multimodal model feels basically unusable.

Recent launches: Nano Banana 2, Veo 3.1 Light, and Lyria

He quickly tours the latest media stack: Nano Banana 2 now supports outputs from 520 pixels to 4K and can ground on web-searched images; Veo 3.1 Light just launched at roughly 5 cents per second; Lyria can generate 30-second clips or 3-minute songs. Then he slips in his personal favorite: Lyria Realtime, a live predictive music model that can pivot genres in about two seconds like an AI DJ.

Turning Wind in the Willows into a multimodal demo

The workshop itself is simple and clever: take an open-source Gutenberg book, upload it, feed the whole thing into Gemini, and use structured JSON output to generate prompts for characters and chapters. He picks The Wind in the Willows, invents a “colorful building block style,” and shows how Gemini can keep that style coherent across an entire illustrated adaptation.

Character consistency, but with fewer hacks

First he uses chat mode so the model remembers earlier character images, then admits that’s not really the scalable way to do it. The better pattern is asking Gemini to list which characters appear in each chapter and passing only those saved reference images into generation — a cleaner setup that gives Nano Banana stronger context than just hoping the chat history does the job.

Veo, Lyria, and the fun of chaining models together

From there he animates chapter images with Veo, using the image as the first frame and having Gemini write a fresh prompt about “what happens in the next few seconds.” The same move works for music: Gemini writes chapter-specific prompts for Lyria, producing pastoral, adventurous, and ominous cues, and he notes that all of Google’s gen-media models are heavily trained on Gemini-written prompts, which is why this handoff works so well.

TTS hacks, Europe pain, and the model he wishes more people used

His TTS trick gets one of the best reactions: with only two voice slots, he has Gemini write dialogue like a play and annotate delivery styles so one shared voice sounds like multiple distinct characters. In Q&A, he bluntly acknowledges Europe’s frustration — preview models are often global-endpoint-only, and that’s his “P0” fight internally — before ending on Lyria Realtime and a “Space DJ” demo that turns prompt-blending into a playful, game-like music system.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Let's go Bananas with GenMedia — Guillaume Vernade, Google DeepMind

Summary

A developer advocate inside the Google machine

The bigger bet: one world model, shipped in pieces

Gemini’s multimodal growing pains and the release treadmill

Recent launches: Nano Banana 2, Veo 3.1 Light, and Lyria

Turning Wind in the Willows into a multimodal demo

Character consistency, but with fewer hacks

Veo, Lyria, and the fun of chaining models together

TTS hacks, Europe pain, and the model he wishes more people used

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

A developer advocate inside the Google machine

The bigger bet: one world model, shipped in pieces

Gemini’s multimodal growing pains and the release treadmill

Recent launches: Nano Banana 2, Veo 3.1 Light, and Lyria

Turning Wind in the Willows into a multimodal demo

Character consistency, but with fewer hacks

Veo, Lyria, and the fun of chaining models together

TTS hacks, Europe pain, and the model he wishes more people used

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks