
Playbook
Tasteful Skills
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.
TTS architecture is converging toward LLM-style sequence modeling — Samuel Humeau says most strong modern systems now treat speech generation like token prediction, using autoregressive decoder backbones that generate audio in chunks instead of raw waveform sample-by-sample.
Latency is the real product constraint for voice agents — In the agent setup Humeau shows, the key is not finishing the whole waveform fast but emitting the first audio packet immediately, and Mistral’s model gets first playable audio in 17 milliseconds on a single GPU without network overhead.
Audio has to be brutally compressed before transformers can handle it — Humeau contrasts human speech text content at roughly 15 bits per second with standard MP3 around 200 kilobits per second, then explains Mistral’s codec reduces speech to about 500 tokens per second by splitting audio into 80 ms frames and encoding each as 37 tokens.
Mistral’s open TTS model is open-source, but the voice-cloning encoder is not — The released model can synthesize from provided voices, but the component that would let anyone clone arbitrary voices from a few seconds of audio is still proprietary because, as he puts it, they didn’t want to hand everyone that capability.
Voice cloning is already good enough to make “vocal identity” a real brand asset — Humeau demos cloning an English speaker named Paul, a French speaker speaking English with a recognizably French accent, and even his own voice, arguing that companies will increasingly treat how they sound the way they already treat visual brand identity.
Humeau opens with the practical reason TTS suddenly matters more: the dominant use case is no longer “listen to this blog post,” it’s giving chat agents a voice. In that stack, speech-to-text feeds the LLM, text-to-speech speaks back, and the whole experience lives or dies on latency.
He shows Mistral’s newly released open-source TTS model cloning a real person named Paul from a short recording, then uses it inside a small voice agent that answers schedule questions. The important bit is that the first audio packet arrives fast enough to start playback before the full waveform is done, which makes the interaction feel responsive even if generation is still happening under the hood.
Humeau plays several examples: Paul’s cloned voice, a French speaker rendered in English while keeping a strong French accent, and his own cloned voice saying “Hi, this is Sam.” He jokes that at “the peak of my ego” he can now discuss complicated problems with himself, but the underlying point is serious: a few seconds of reference audio is enough to make impersonation feel real.
That cloning capability leads to a brief but memorable aside: companies already care about “vocal identity” in ads, and Humeau thinks that will become much more mainstream. His prediction is simple — just as firms define how their website looks, they’ll increasingly define how their voice sounds.
He walks through the architecture shift historically: old systems stitched together recorded units, then came waveform-level generation, then whole-audio generation. The current center of gravity is chunked sequence modeling — encode short spans of audio into token-like representations, then let an LLM-style backbone generate those pieces autoregressively because, as he says, humanity has become extremely good at modeling sequences of tokens.
Text tokenization is easy; audio tokenization is not. Humeau highlights the gap with a striking comparison: even speaking quickly, he only reaches about 15 bits per second of textual information, versus roughly 200,000 bits per second for a standard-quality MP3, so a codec has to throw away a huge amount while preserving voice and acoustics.
In Mistral’s system, audio is cut into 80 ms frames — about 12 per second — and each frame becomes 37 tokens, or around 500 tokens per second after compression. Most labs then use one transformer step per frame plus a smaller model to reconstruct the frame tokens, but Mistral does something different: it generates all 37 tokens for a frame at once with a diffusion-style model, specifically using flow matching.
The released model belongs to the “full text first, then audio” camp: it takes a few seconds of speaker audio plus the whole text prompt, then synthesizes speech. For true real-time LLM-to-voice streaming, Humeau says there’s no clear winner yet between interleaving text and audio in one stream or using dual-stream architectures, but the payoff is obvious — you can start speaking as soon as the LLM emits its first tokens instead of waiting for the whole paragraph.
Share
Keep Reading
The Weekly Echo. The inbox-shaped summary of what mattered.
New editorials announced here.

Playbook
“Tasteful Skills” argues that the best agent skills are not documentation or best-practice lists.

Playbook
Learn how tasteful prompting helps you move beyond generic AI output by shaping context, style, and judgment from the start.

Playbook
OpenAI shipped /goal for the Codex CLI. It turns a prompt into a persisted, self-continuing contract.