
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
DeepMind’s real goal is one multimodal “world model,” not a pile of separate media tools — Guillaume Vernade says image, video, audio, sensors, and text are being shipped as separate products like Veo, Nano Banana, and Lyria for release safety, but the underlying vision is a single model that understands and outputs across modalities.
Google is shipping at a breakneck pace, which is great for demos and rough on developers — Vernade says DeepMind releases something every 5 days on average, with gen-media updates roughly monthly, which is why his developer advocate job is equal parts docs, demos, prompt guides, and pushing internal teams to make APIs sane.
The practical pattern in this workshop is: use Gemini to write better prompts for media models — He feeds an entire Gutenberg book into Gemini, uses structured JSON output to generate character and chapter prompts, then passes those into Nano Banana, Veo, Lyria, and TTS to build a full illustrated, scored adaptation.
Prompting plus references beats memory alone for character consistency — In the Wind in the Willows demo, Vernade first relies on chat history to keep characters visually consistent, then shows a better method: ask Gemini which characters appear in a chapter and pass only those saved reference images into image generation.
Veo and Lyria are getting cheap enough to prototype seriously — He cites Veo 3.1 Light at about $0.05 per second, or roughly $0.40 per video, and Lyria clip at around $0.04 for a 30-second song, framing the cheaper models as iteration tools before you upscale.
Some of the most interesting tricks are weird hacks, not headline launches — Vernade shows how to fake multiple speaking characters from only two TTS voices by changing speaking style in parentheses, and says his favorite underused model is Lyria Realtime, which can DJ-style morph music live for game-like adaptive soundtracks.
Guillaume Vernade opens by joking about being one of the few who made it past building security, then sketches his path from video game producer to Stadia to Google DeepMind. His job, he says, is making sure developers have what they need when Google ships models — docs, samples, demos, prompt guides — while also fighting internally for common-sense product decisions, like “just swap the model name and it works.”
He frames gen media as part of DeepMind’s long-term “world model” vision: a model that can ingest as many modalities as possible and output across them too. The reason users see separate image, video, and music models is mostly product pragmatism; if you ship one giant thing and update it constantly, you risk breaking everything at once.
Vernade tells a funny-but-real story about Gemini 1.5 sometimes claiming “I’m just an LLM, I can’t deal with images,” because leftovers from earlier training still lingered after multimodal support was restored. He uses that to show how fast this field has moved — a year and a half ago multimodal felt new, and now a non-multimodal model feels basically unusable.
He quickly tours the latest media stack: Nano Banana 2 now supports outputs from 520 pixels to 4K and can ground on web-searched images; Veo 3.1 Light just launched at roughly 5 cents per second; Lyria can generate 30-second clips or 3-minute songs. Then he slips in his personal favorite: Lyria Realtime, a live predictive music model that can pivot genres in about two seconds like an AI DJ.
The workshop itself is simple and clever: take an open-source Gutenberg book, upload it, feed the whole thing into Gemini, and use structured JSON output to generate prompts for characters and chapters. He picks The Wind in the Willows, invents a “colorful building block style,” and shows how Gemini can keep that style coherent across an entire illustrated adaptation.
First he uses chat mode so the model remembers earlier character images, then admits that’s not really the scalable way to do it. The better pattern is asking Gemini to list which characters appear in each chapter and passing only those saved reference images into generation — a cleaner setup that gives Nano Banana stronger context than just hoping the chat history does the job.
From there he animates chapter images with Veo, using the image as the first frame and having Gemini write a fresh prompt about “what happens in the next few seconds.” The same move works for music: Gemini writes chapter-specific prompts for Lyria, producing pastoral, adventurous, and ominous cues, and he notes that all of Google’s gen-media models are heavily trained on Gemini-written prompts, which is why this handoff works so well.
His TTS trick gets one of the best reactions: with only two voice slots, he has Gemini write dialogue like a play and annotate delivery styles so one shared voice sounds like multiple distinct characters. In Q&A, he bluntly acknowledges Europe’s frustration — preview models are often global-endpoint-only, and that’s his “P0” fight internally — before ending on Lyria Realtime and a “Space DJ” demo that turns prompt-blending into a playful, game-like music system.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.