Back to Podcast Digest
AI Engineer19m

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

TL;DR

  • Gemini's pitch is audio understanding, not just transcription: Thor Schaeff says Gemini 3 models are built to capture speech plus emotion, pacing, overlap, accents, and code-switching across languages in a single pass.

  • A single structured output request produced a rich transcript: In the Echo Script demo, Gemini 3 Flash Preview returned a summary, speaker labels, timestamps, language identification, English translations, and emotion tags from one API call.

  • Speech generation uses direction instead of giant voice catalogs: Rather than picking from hundreds of preset voices, Gemini starts with roughly 30 base voices and lets developers steer accent, tone, and scene with prompt-based "director's notes."

  • The live model keeps reasoning inside the audio stack: Gemini 3.1 Flash Live is a full-duplex speech-to-speech multimodal model that takes text, audio, and video over WebSockets and responds in real time, without a separate text pipeline bolted on after ASR.

  • Vision and speech work together in the live demo: Thor showed the model reading his camera feed, commenting on his Gemini shirt and backwards hat, then switching into German poetry while hilariously keeping the requested Irish accent.

  • Google is tying real-time conversation to music generation: The closing demo connected Gemini Live to Lyra 3 as a tool, turning a spoken request for a "German techno Schlager about the UK startup scene" into a generated song on the spot.

The Breakdown

One Gemini API call pulled out speaker names, timestamps, language switches, translations, and even emotion from a live multilingual demo, then the same stack jumped to Irish-accented speech, live multimodal conversation, and a custom German techno-Schlager song about the UK startup scene.

Was This Useful?

Share