Back to Podcast Digest
This Week in AI··9m

How to Clone Your Voice With ElevenLabs in 10 Minutes

TL;DR

  • ElevenLabs gives you two very different cloning paths — Oliver shows that Instant Voice Clone can spin up from just 10 seconds of audio, while Professional Voice Clone uses roughly 30 minutes to 2 hours of clean recordings for a much more convincing result.

  • The professional clone is the one that actually made him rethink content production — after using it for weeks, Oliver says it enables “more scripts, more episodes, more output without burning out my voice,” and he used that higher-quality clone in the intro of the video itself.

  • Model choice matters more than just having a clone — 11 Multilingual V2 sounded closest to Oliver, 11 Flash V2.5 was faster and cheaper but “definitely more robotic,” and 11 v3 sounded less like him yet more humanlike because of its expressive controls.

  • Prompting and pacing still require manual finesse — Oliver points out that Multilingual V2 can struggle with pauses, so he literally inserts dashes into the script to force better pacing and says getting the best output takes burning some tokens and trying variations.

  • 11 v3’s audio tags are the standout feature for expressive speech — he demos tags like whispers, thoughtful, excited, chuckles, clapping, and even accent changes mid-prompt, showing how the model can act out a performance instead of just reading text.

  • The tool is already meaningfully better than it was 6 months ago — Oliver closes by saying the progress over that short span is dramatic, which is why he frames ElevenLabs as one of the most useful AI tools he’s used this year.

The Breakdown

“Was that me or AI?” sets the hook

Oliver opens by revealing that the intro voice may not have been him at all, which is exactly the point of the demo. He frames ElevenLabs as one of the most useful AI tools he’s used this year because it changes the math of content production: more scripts and episodes without wrecking your actual voice.

Instant clone: fast, simple, and clearly good enough to be useful

Inside the ElevenLabs dashboard, he walks through the Instant Voice Clone flow: upload a sample, add labels like language, accent, age, and gender, then jump straight into text-to-speech. His sample comes from old This Week in AI recordings, but he notes a phone recording as an MP3 would work too. The result “definitely” sounds like him, but not enough to pass as a true replica.

Professional clone: more work, much better payoff

The Professional Voice Clone asks for much more training data, ideally around 2 hours, though Oliver only fed it about 30 minutes from prior 5-10 minute demos. He jokes that this is easy for someone like Jason, who has multiple podcasts a week, and says he’ll probably build one for him next. The catch is training takes 2 to 6 hours, but when the output comes back, Oliver calls it “honestly worth the wait.”

The pacing problem and the dash hack

Using his professional voice with 11 Multilingual V2, Oliver shows the clone that powered the video intro and notes he still has almost 200,000 credits left on the Creator plan. The big limitation he calls out is pacing: the model sometimes rushes through lines, so he manually inserts dashes in the prompt to create little pauses. It’s a small but very practical trick, and he’s clear that getting polished output takes slider-tweaking and multiple generations.

Flash V2.5: cheaper and faster, but obviously synthetic

He then switches to 11 Flash V2.5, which has fewer controls and is positioned as the faster, cheaper option. After playing a sample, Oliver doesn’t mince words: it sounds more robotic, and nobody who knows his voice would believe it’s really him. Still, he says it makes sense if you’re building something where cost and speed matter more than realism.

11 v3 turns voice cloning into performance prompting

The most exciting section is 11 v3, which trades close identity matching for expressive range. Oliver shows how you can embed audio tags like whispers, sighs, sarcasm, chuckles, clapping, sings, and even accent shifts directly into the prompt, then listens as the model tries to act them out. His verdict is nuanced: it doesn’t sound nearly as much like Oliver as Multilingual V2, but it sounds much more human, which makes it better if you want a believable speaker rather than an exact impersonation.

Weird sound effects, failed tags, and the bigger takeaway

He keeps stress-testing the audio tags with explosion, gulps, laughs harder, and applause, and the results are mixed: some tags land, some don’t, and one mystery sound effect gets a laugh. That imperfection is part of his conclusion — the tool is insanely powerful, but you have to experiment to get what you want. Compared with where ElevenLabs was 6 months ago, though, he says the improvement is huge and clearly accelerating.