Your voice can be cloned with just 3-second sample. Should you worry?
Summary
- There is a big concern over using cloned voices for mischief, impersonation, etc.
Content creators and voice actors in today's digital age have their work cut out for them with intelligent software mimicking their writings, art, voice, and even their emotions. If OpenAI's DALL-E can generate realistic art and images from plain text prompts, and ChatGPT can write poems, articles, books and even code, here's one more artificial intelligence (AI)-powered tool that can speak and emote like us without us being able to spot the difference in most cases.
Microsoft published a paper early this month about its new text-to-speech AI model, VALL-E, which can simulate a person's voice with just a 3-second recording. Initial results show that VALL-E can also preserve the speaker's emotional tone. The paper describes VALL-E as "a new language model approach for text-to-speech synthesis (TTS) that uses audio codec codes as intermediate representations".
According to the paper's authors, VALL-E was pre-trained on 60,000 hours of English speech data, which the paper claims is "hundreds of times larger than existing systems".
You might also like
Early backers in talks to sell stakes in Byju's
Tyremakers are on a roll but watch out for bumps
What does RBI's Financial Stability Report reveal?
The journey of an award-winning Bengaluru financial adviser
But what's new about this technology, you may ask? And with good reason. Text-to-speech, or TTS systems, have been around for a while. Free TTS tools include Natural Reader; WordTalk; ReadLoud; Listen (which uses Google's TTS application programming interface (API) to convert short snippets of text into natural-sounding synthetic speech); Free TTS (again from Google); Watson Text to Speech (a tool from IBM which supports a variety of voices in different languages and dialects); and Neosapience (which allows users to write out the emotion they want virtual actors to use when speaking).
That said, TTS tools typically require high-quality studio-recorded annotated audio from different speakers with different styles and emotions for commercial applications. The models also typically need at least 30 minutes of such data.
VALL-E, on the other hand, can replicate the emotions — anger, sleep, amusement, or even disgust — and tone of a speaker, even when creating a recording of words that the original speaker never said and adding intonations to it — all with just a 3-second clip. However, unlike ChatGPT, which took the world by storm, VALL-E is currently not open for public use. You can sample some of the work which has been made available on GitHub. VALL-E only uses 3-second recordings as a prompt.
In addition to preserving a speaker's vocal timbre and emotions, the paper underscores that VALL-E can also imitate the "acoustic environment" of the sample audio. In other words, the audio output of a sample telephone call would simulate the acoustic and frequency properties of that call in its synthesized output — simply put, it will sound very much like a telephone call.
According to the published paper cited above, "Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity." Zero-shot learning is an AI model's ability to be able to complete a task without having received or used any training examples, an ability that comes naturally to humans since we have a brain that empowers us to do these activities without much effort.
Demystifying the tech
Machines typically require labelled data (such as cat, dog, camel or mouse) as inputs to learning. However, labelling data can be very tedious, impractical, and expensive too in many cases, such as labelling of more than two million animal species. It can also lead to many manual errors. Zero-shot learning does not require huge amounts of data. Computer vision images and facial recognition, for instance, are examples of one-shot learning.
Zero-shot learning generates the best results when you do not have any training data that the machine learning algorithm can use to identify or classify the object. As an example, if a child is told that a zebra resembles a horse but has black and white stripes, the child is likely to recognize a zebra when it sees it for the first time. The zero-shot learning process is similar.
To be sure, Peregrine already imitates large language models such as Dall-E and GPT (Generative Pre-Trained Transformer) and has made its voice cloning technology available through a public API. On its website, it says: "Although our model is powerful enough to clone a voice with an audio recording as small as 30 seconds, we prefer to use studio-quality audio at least an hour in length." You can already sample cloned voices of Elon Musk, John F Kennedy, and even Kevin Hart on its website.
Ethical concerns
By now, it's certain that generative machine learning (ML) models such as ChatGPT and Hugging Face's Bloom (text generation), DALL-E and Stable Diffusion (image generation), and RunwayML (video generation) are bound to give voice actors and content creators — especially in the audiobook, movie, advertising and media sectors — a run for their money. But there's also the bigger concern over using cloned voices for mischief, impersonation, etc., especially in the banking sector, or making hoax calls to airports about a bomb, impersonating a celebrity for a ransom or some other mischievous work.
To be sure, all websites put up the boilerplate, which effectively says that they "do not tolerate content created for unlawful, hateful, obscene, pornography, or intends to impersonate another person". But that is merely an intent which is hard to enforce in the real world.
Peregrine, on its part, insists that "None of the voices is ever cloned without the written consent of the individual whose voice is being cloned. In some cases, we have cloned a few celebrity voices, including Joe Rogan, Steve Jobs, and Elon Musk. However, these voices are not available for public use. The reason behind cloning these voices was solely to showcase the capability of our technology and how natural the voices can be. Our voice cloning technology is not available through a public API. This is a manual process wherein we handle each request individually and allocate time and resources to ensure the voice being cloned is not used for any unethical purposes."
The VALL-E paper says: "The experiments in this work were carried out under the assumption that the user of the model is the target speaker and has been approved by the speaker. However, when the model is generalized to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech."
This will also lead to a race for battling AI-generated content with AI-detecting software in a bid to avoid its misuse. But that's a topic for another day.
Big tech reaps the dividends
In the meanwhile, such AI tools are helping big tech companies reap the dividends. For instance, Microsoft is said to be in talks to invest $10 billion into OpenAI, according to a January 10 article in media outlet Semafor which cited unnamed sources. It added that both Microsoft and OpenAI declined to comment.
The Wall Street Journal reported last week that ChatGPT was allowing employees and early investors to sell their shares at a valuation of $29 billion. It was reported earlier that Microsoft, which had invested $1 billion in cash and cloud credits into OpenAI in 2019, was in talks to increase its stake.
Elsewhere in Mint
In Opinion, Siddharth Pai tells what Indian IT sector’s TWITCH group needs. Diva Jain explains what makes an industrial policy good. Sandipan Deb writes about the American deep state’s sway on public debate. Long Story reports on how the auditing industry is set to change in India.