AI EngineerMay 9, 202622m

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

TL;DR

TTS architecture is converging toward LLM-style sequence modeling — Samuel Humeau says most strong modern systems now treat speech generation like token prediction, using autoregressive decoder backbones that generate audio in chunks instead of raw waveform sample-by-sample.
Latency is the real product constraint for voice agents — In the agent setup Humeau shows, the key is not finishing the whole waveform fast but emitting the first audio packet immediately, and Mistral’s model gets first playable audio in 17 milliseconds on a single GPU without network overhead.
Audio has to be brutally compressed before transformers can handle it — Humeau contrasts human speech text content at roughly 15 bits per second with standard MP3 around 200 kilobits per second, then explains Mistral’s codec reduces speech to about 500 tokens per second by splitting audio into 80 ms frames and encoding each as 37 tokens.
Mistral’s open TTS model is open-source, but the voice-cloning encoder is not — The released model can synthesize from provided voices, but the component that would let anyone clone arbitrary voices from a few seconds of audio is still proprietary because, as he puts it, they didn’t want to hand everyone that capability.
Voice cloning is already good enough to make “vocal identity” a real brand asset — Humeau demos cloning an English speaker named Paul, a French speaker speaking English with a recognizably French accent, and even his own voice, arguing that companies will increasingly treat how they sound the way they already treat visual brand identity.

Summary

Why speech matters now: voice is becoming the front end for agents

Humeau opens with the practical reason TTS suddenly matters more: the dominant use case is no longer “listen to this blog post,” it’s giving chat agents a voice. In that stack, speech-to-text feeds the LLM, text-to-speech speaks back, and the whole experience lives or dies on latency.

The demo with “Paul” makes the latency point visceral

He shows Mistral’s newly released open-source TTS model cloning a real person named Paul from a short recording, then uses it inside a small voice agent that answers schedule questions. The important bit is that the first audio packet arrives fast enough to start playback before the full waveform is done, which makes the interaction feel responsive even if generation is still happening under the hood.

Voice cloning is getting eerily easy — and surprisingly expressive

Humeau plays several examples: Paul’s cloned voice, a French speaker rendered in English while keeping a strong French accent, and his own cloned voice saying “Hi, this is Sam.” He jokes that at “the peak of my ego” he can now discuss complicated problems with himself, but the underlying point is serious: a few seconds of reference audio is enough to make impersonation feel real.

From brand visuals to brand voice

That cloning capability leads to a brief but memorable aside: companies already care about “vocal identity” in ads, and Humeau thinks that will become much more mainstream. His prediction is simple — just as firms define how their website looks, they’ll increasingly define how their voice sounds.

Why TTS models now resemble LLMs

He walks through the architecture shift historically: old systems stitched together recorded units, then came waveform-level generation, then whole-audio generation. The current center of gravity is chunked sequence modeling — encode short spans of audio into token-like representations, then let an LLM-style backbone generate those pieces autoregressively because, as he says, humanity has become extremely good at modeling sequences of tokens.

The hard part: audio is vastly denser than text

Text tokenization is easy; audio tokenization is not. Humeau highlights the gap with a striking comparison: even speaking quickly, he only reaches about 15 bits per second of textual information, versus roughly 200,000 bits per second for a standard-quality MP3, so a codec has to throw away a huge amount while preserving voice and acoustics.

Mistral’s codec and where it deviates from the common pattern

In Mistral’s system, audio is cut into 80 ms frames — about 12 per second — and each frame becomes 37 tokens, or around 500 tokens per second after compression. Most labs then use one transformer step per frame plus a smaller model to reconstruct the frame tokens, but Mistral does something different: it generates all 37 tokens for a frame at once with a diffusion-style model, specifically using flow matching.

Streaming text is the next frontier, and the architecture is still unsettled

The released model belongs to the “full text first, then audio” camp: it takes a few seconds of speaker audio plus the whole text prompt, then synthesizes speech. For true real-time LLM-to-voice streaming, Humeau says there’s no clear winner yet between interleaving text and audio in one stream or using dual-stream architectures, but the payoff is obvious — you can start speaking as soon as the LLM emits its first tokens instead of waiting for the whole paragraph.

Was This Useful?

LinkedIn X Email

Keep Reading

Tune your feedFive quick questions, and the feed ranks what matters to you first.

Or just get notified

The weekly Echo. Signal worth keeping in your inbox.

Every new piece, announced on X.

Follow @alcreon on X

Why TTS Models Now Look Like LLMs — Samuel Humeau, Mistral

Summary

Why speech matters now: voice is becoming the front end for agents

The demo with “Paul” makes the latency point visceral