
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
Google’s “any-to-any” pitch is real, but still stitched across multiple models — Patrick Löber says Gemini can take in text, code, image, audio, video, URLs, and Search, then produce text, images, speech, video, code, and function calls, but today that experience still relies on Gemini for understanding plus specialized generators like “Nano Banana” for images and separate speech models.
The core build pattern is simple: use Gemini as the reasoning brain, then let it call tools for media creation — his NotebookLM-style demo uses Gemini for cross-modal understanding of PDFs, lectures, and voice memos, then an agentic loop decides when to call image and speech generation tools instead of following a hard-coded workflow.
Gemini’s multimodal understanding is surprisingly practical at long context — Löber notes that 1 minute of audio is about 1,920 tokens, so with Gemini’s 1 million token window you can fit more than 9 hours of audio, while video support is roughly 1 hour, with timestamp controls and direct file, URL, and YouTube ingestion.
Context caching is an underrated cost lever for multimodal apps — when you repeatedly query long uploaded files, Gemini’s built-in caching can cut costs by 90%, which matters if your agent keeps revisiting the same PDFs, videos, or transcripts.
Native generation matters because the models ‘understand the world,’ not just style-transfer prompts — his examples included Nano Banana reconstructing the Golden Gate Bridge from arrows drawn on a map and correcting math homework visually, plus Gemini-based speech models that can produce multilingual voices and accents like British and Bavarian.
Google is also pushing real-time audio agents with Gemini 3.1 Flash Live — Löber describes the Live API as a native audio-in, audio-out model rather than a cascaded stack, enabling more natural back-and-forth conversation and visual grounding, which he points people to try at ai.studio/live.
Patrick Löber opens by framing the big vision: “any-to-any” agents that can understand text, code, images, audio, video, URLs, and Search, then generate text, images, speech, video, code, and function calls. But he quickly corrects the oversimplification with a laugh at his own “ugly slide” — this is not one magic multimodal model yet, but a stack of Gemini for understanding plus specialized generation models like Nano Banana and speech models.
The practical goal of the talk is a small NotebookLM-style app that can ingest PDFs, images, lectures, tutorials, and voice memos, then output summaries, podcasts, and infographics. The key twist is that he wants it built as an agent, not a workflow: Gemini acts as the reasoning model and decides what assets to create via tool calls instead of following a hard-coded pipeline.
Löber makes this part feel almost boringly easy: get a free API key from AI Studio, install the Google AI SDK, upload files, and call client.models.generate_content with something like Gemini 3 Flash. He emphasizes the cross-modal part — the model can pull together facts across a PDF, an MP3, and a video and synthesize them into a single summary, which is exactly what you want for “Attention Is All You Need”-style study guides.
He drops a bunch of useful implementation notes in quick succession: even Flash and Flash Lite are good enough to transcribe audio if you just ask for a transcript. One minute of audio is around 1,920 tokens, so Gemini’s 1 million token context gets you more than 9 hours of audio, while video lands closer to 1 hour; you can also restrict analysis to timestamp ranges like minute 5 to 15. The sneaky important tip is context caching, which he says can save 90% on repeated queries over long files.
The second half is the generation phase, where Gemini becomes the “brain” and function calling does the work. He walks through the setup: define functions like generate_image and generate_speech, describe the parameters clearly, then prompt Gemini as a research partner that should decide which concepts need a visual diagram and which sections deserve an audio summary.
Löber’s case for native generation is that these models inherit real world understanding from Gemini, which creates more grounded outputs. His favorite examples are fun and specific: Nano Banana 1 turning arrows on a map into a correct Golden Gate Bridge image, and Nano Banana 2 visually correcting math homework because it actually understands the content, not just the prompt surface.
On speech, he plays with the room a bit by demoing multilingual text-to-speech and accent control, including a joking British voice ending at the pub and a Bavarian-accent example that gets a decent reaction. He closes on Gemini 3.1 Flash Live, describing it as a native audio-to-audio architecture rather than a cascaded pipeline, then shows a quick clip where the model responds conversationally and recognizes a person’s appearance in real time.
In the final minute, he widens the aperture again: the same pattern applies beyond education, and Google is adding a multimodal embedding model for unified vector spaces and multimodal search. He also nods to Gemma 4 for local multimodal understanding, reinforcing the broader message that the building blocks for native multimodal agents are now here — and increasingly accessible.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.