AI EngineerMay 20, 202616m

Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind

TL;DR

Google’s “any-to-any” pitch is real, but still stitched across multiple models — Patrick Löber says Gemini can take in text, code, image, audio, video, URLs, and Search, then produce text, images, speech, video, code, and function calls, but today that experience still relies on Gemini for understanding plus specialized generators like “Nano Banana” for images and separate speech models.
The core build pattern is simple: use Gemini as the reasoning brain, then let it call tools for media creation — his NotebookLM-style demo uses Gemini for cross-modal understanding of PDFs, lectures, and voice memos, then an agentic loop decides when to call image and speech generation tools instead of following a hard-coded workflow.
Gemini’s multimodal understanding is surprisingly practical at long context — Löber notes that 1 minute of audio is about 1,920 tokens, so with Gemini’s 1 million token window you can fit more than 9 hours of audio, while video support is roughly 1 hour, with timestamp controls and direct file, URL, and YouTube ingestion.
Context caching is an underrated cost lever for multimodal apps — when you repeatedly query long uploaded files, Gemini’s built-in caching can cut costs by 90%, which matters if your agent keeps revisiting the same PDFs, videos, or transcripts.
Native generation matters because the models ‘understand the world,’ not just style-transfer prompts — his examples included Nano Banana reconstructing the Golden Gate Bridge from arrows drawn on a map and correcting math homework visually, plus Gemini-based speech models that can produce multilingual voices and accents like British and Bavarian.
Google is also pushing real-time audio agents with Gemini 3.1 Flash Live — Löber describes the Live API as a native audio-in, audio-out model rather than a cascaded stack, enabling more natural back-and-forth conversation and visual grounding, which he points people to try at ai.studio/live.

Summary

From Gemini hype to the actual multimodal stack

Patrick Löber opens by framing the big vision: “any-to-any” agents that can understand text, code, images, audio, video, URLs, and Search, then generate text, images, speech, video, code, and function calls. But he quickly corrects the oversimplification with a laugh at his own “ugly slide” — this is not one magic multimodal model yet, but a stack of Gemini for understanding plus specialized generation models like Nano Banana and speech models.

The NotebookLM clone he wants you to build

The practical goal of the talk is a small NotebookLM-style app that can ingest PDFs, images, lectures, tutorials, and voice memos, then output summaries, podcasts, and infographics. The key twist is that he wants it built as an agent, not a workflow: Gemini acts as the reasoning model and decides what assets to create via tool calls instead of following a hard-coded pipeline.

Multimodal understanding is basically a few SDK calls

Löber makes this part feel almost boringly easy: get a free API key from AI Studio, install the Google AI SDK, upload files, and call client.models.generate_content with something like Gemini 3 Flash. He emphasizes the cross-modal part — the model can pull together facts across a PDF, an MP3, and a video and synthesize them into a single summary, which is exactly what you want for “Attention Is All You Need”-style study guides.

The practical details: transcription, token math, and cost savings

He drops a bunch of useful implementation notes in quick succession: even Flash and Flash Lite are good enough to transcribe audio if you just ask for a transcript. One minute of audio is around 1,920 tokens, so Gemini’s 1 million token context gets you more than 9 hours of audio, while video lands closer to 1 hour; you can also restrict analysis to timestamp ranges like minute 5 to 15. The sneaky important tip is context caching, which he says can save 90% on repeated queries over long files.

The agentic loop: Gemini thinks, tools generate

The second half is the generation phase, where Gemini becomes the “brain” and function calling does the work. He walks through the setup: define functions like generate_image and generate_speech, describe the parameters clearly, then prompt Gemini as a research partner that should decide which concepts need a visual diagram and which sections deserve an audio summary.

Why “native generation” is more than branding

Löber’s case for native generation is that these models inherit real world understanding from Gemini, which creates more grounded outputs. His favorite examples are fun and specific: Nano Banana 1 turning arrows on a map into a correct Golden Gate Bridge image, and Nano Banana 2 visually correcting math homework because it actually understands the content, not just the prompt surface.

The voices, accents, and the push toward live interaction

On speech, he plays with the room a bit by demoing multilingual text-to-speech and accent control, including a joking British voice ending at the pub and a Bavarian-accent example that gets a decent reaction. He closes on Gemini 3.1 Flash Live, describing it as a native audio-to-audio architecture rather than a cascaded pipeline, then shows a quick clip where the model responds conversationally and recognizes a person’s appearance in real time.

A few final breadcrumbs: embeddings, local multimodality, and transferability

In the final minute, he widens the aperture again: the same pattern applies beyond education, and Google is adding a multimodal embedding model for unified vector spaces and multimodal search. He also nods to Gemma 4 for local multimodal understanding, reinforcing the broader message that the building blocks for native multimodal agents are now here — and increasingly accessible.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind

Summary

From Gemini hype to the actual multimodal stack

The NotebookLM clone he wants you to build

Multimodal understanding is basically a few SDK calls

The practical details: transcription, token math, and cost savings

The agentic loop: Gemini thinks, tools generate

Why “native generation” is more than branding

The voices, accents, and the push toward live interaction

A few final breadcrumbs: embeddings, local multimodality, and transferability

Was This Useful?

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks

Summary

From Gemini hype to the actual multimodal stack

The NotebookLM clone he wants you to build

Multimodal understanding is basically a few SDK calls

The practical details: transcription, token math, and cost savings

The agentic loop: Gemini thinks, tools generate

Why “native generation” is more than branding

The voices, accents, and the push toward live interaction

A few final breadcrumbs: embeddings, local multimodality, and transferability

Was This Useful?

Make Alcreon Yours

Or just get notified

Read Next

The Retirement Email Isn't a Warning

The Cheapest Model That Passes

Cheap Models, Hard Tasks