AI EngineerJune 9, 202619m

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

TL;DR

Gemini's pitch is audio understanding, not just transcription: Thor Schaeff says Gemini 3 models are built to capture speech plus emotion, pacing, overlap, accents, and code-switching across languages in a single pass.
A single structured output request produced a rich transcript: In the Echo Script demo, Gemini 3 Flash Preview returned a summary, speaker labels, timestamps, language identification, English translations, and emotion tags from one API call.
Speech generation uses direction instead of giant voice catalogs: Rather than picking from hundreds of preset voices, Gemini starts with roughly 30 base voices and lets developers steer accent, tone, and scene with prompt-based "director's notes."
The live model keeps reasoning inside the audio stack: Gemini 3.1 Flash Live is a full-duplex speech-to-speech multimodal model that takes text, audio, and video over WebSockets and responds in real time, without a separate text pipeline bolted on after ASR.
Vision and speech work together in the live demo: Thor showed the model reading his camera feed, commenting on his Gemini shirt and backwards hat, then switching into German poetry while hilariously keeping the requested Irish accent.
Google is tying real-time conversation to music generation: The closing demo connected Gemini Live to Lyra 3 as a tool, turning a spoken request for a "German techno Schlager about the UK startup scene" into a generated song on the spot.

The Breakdown

One Gemini API call pulled out speaker names, timestamps, language switches, translations, and even emotion from a live multilingual demo, then the same stack jumped to Irish-accented speech, live multimodal conversation, and a custom German techno-Schlager song about the UK startup scene.