Microsoft AI Model Can Mimic Your Voice in Just 3 Seconds

Are you afraid of AI yet? Perhaps you should be. We weren’t yet a week into 2023 when Microsoft released a research paper offering a glimpse under the hood of its ‘neural codec language model’ VALL-E – and its contents are enough to provoke horripilation in even the most insouciant AI sceptic.

The tech giant has been tinkering with text-to-speech (TTS) synthesis to hone the in-context learning abilities of VALL-E, which it claims can now “synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.”

In other words, VALL-E needs just a three-second snippet of your speech to convincingly reproduce your voice. Just think: a truanting student could leverage VALL-E to place a fake phone call from his mother to the principal’s office and spend the day in the city, Ferris Bueller style. Or, y’know, put it to even more dangerous use.

60,000 Hours of Training

VALL-E isn’t yet in the wild, though a demo is currently live over on GitHub where you can listen to prompts of sentences like “We have to reduce the number of plastic bags” and “Nothing is yet confirmed” and hear back the version VALL-E created from the human inputs.

According to the research paper, VALL-E was trained with 60,000 hours of speech data (compliments of Meta’s Libri-Light dataset) and despite the impressive results, the authors identified several flaws. These include an observation that some words may be unclear, missed, or duplicated in speech synthesis, the difficulty of covering all accent ranges, and the fact that VALL-E could “carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker.” You don’t say!

Reproducing the emotion, cadence and intonation of a person’s voice, coupled with the acoustic conditions of the original speaker, might seem like a fairly benign skill; but combine it with ever-improving deepfake technology and it’s not hard to see such models being deployed to conduct aggressive propaganda, wage political warfare, or fabricate false evidence.

Scammers could also use the tool to make fraudulent phone calls, mimicking the voice of a trusted family member or your own financial advisor.

“Clearly criminals will be desperate to get their hands on sophisticated AI tools such as VALL-E,” says Matt Ridley, co-founder of cybersecurity firm Ascent Cyber. “As successful as many phishing phone campaigns already are, they would undoubtedly be more profitable if the perpetrator could recreate the voice of a loved one.

“It is notable, though, that the researchers suggested building a detection model that can determine whether an audio clip has been synthesized by VALL-E. There’s nothing to stop other parties from building their own detection systems using AI itself, or even blockchain.”

Surprised there isn't more chatter around VALL-E

This new model by @Microsoft can generate speech in any voice after only hearing a 3s sample of that voice 🤯

Demo → https://t.co/GgFO6kWKha pic.twitter.com/JY88vf4lYc
— Steven Tey (@steventey) January 9, 2023

One nifty use-case of VALL-E in terms of the metaverse might be deploying the model to give a character in a game your voice (or give the villain the voice of your despised stepbrother). Of course, this won’t be applicable to all games, since in some you will be using your own voice anyway and communicating as normal while navigating virtual worlds. But web2 games seeking to integrate web3 features might like the idea of making players even more central to the story.

Microsoft isn’t the only company playing in the TTS arena, of course: all the big tech players have a stake in the technology – Amazon, Google, Meta. Little wonder it’s expected to be a $12.5 billion industry by 2031.