From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind
TL;DR
Gemini's pitch is audio understanding, not just transcription: Thor Schaeff says Gemini 3 models are built to capture speech plus emotion, pacing, overlap, accents, and code-switching across languages in a single pass.
A single structured output request produced a rich transcript: In the Echo Script demo, Gemini 3 Flash Preview returned a summary, speaker labels, timestamps, language identification, English translations, and emotion tags from one API call.
Speech generation uses direction instead of giant voice catalogs: Rather than picking from hundreds of preset voices, Gemini starts with roughly 30 base voices and lets developers steer accent, tone, and scene with prompt-based "director's notes."
The live model keeps reasoning inside the audio stack: Gemini 3.1 Flash Live is a full-duplex speech-to-speech multimodal model that takes text, audio, and video over WebSockets and responds in real time, without a separate text pipeline bolted on after ASR.
Vision and speech work together in the live demo: Thor showed the model reading his camera feed, commenting on his Gemini shirt and backwards hat, then switching into German poetry while hilariously keeping the requested Irish accent.
Google is tying real-time conversation to music generation: The closing demo connected Gemini Live to Lyra 3 as a tool, turning a spoken request for a "German techno Schlager about the UK startup scene" into a generated song on the spot.
The Breakdown
One Gemini API call pulled out speaker names, timestamps, language switches, translations, and even emotion from a live multilingual demo, then the same stack jumped to Irish-accented speech, live multimodal conversation, and a custom German techno-Schlager song about the UK startup scene.
Was This Useful?
Share
Keep Reading
Make Alcreon Yours
Tune your feedFive quick questions, and the feed ranks what matters to you first.Or just get notified
The weekly Echo. Signal worth keeping in your inbox.
Every new piece, announced on X.
Read Next
See all
Playbook
Cheap Models, Hard Tasks
Most agent workflows route every step to the frontier model by default. The bill scales with how chatty the agent gets, even when most steps don't need that brain.

Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
The Art of Tasteful Prompting
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.