OpenAI has recently announced the development of a new voice cloning tool called Voice Engine. This tool uses a 15-second audio sample to generate a clone of a person’s voice, and can create “emotive and realistic voices.” While this technology can be used to improve productivity and aid education, it is often used for malicious purposes, targeting politicians, musicians, businesses and individuals. I (Katie) decided to try out similar voice cloning software, and investigate how OpenAI will aim to prevent the malicious use of Voice Engine.
What is this? This is The Dots, our newsletter about exciting things we find in the world of innovation. We imagine innovation as connecting the dots; putting together a jigsaw. Our puzzle pieces are the pieces of interesting information we absorb in the world, our partners and their products. This newsletter is about these dots we find and connect.
How easy is it to clone someone’s voice currently?
I decided to put current deepfake cloning technology to the test, many other software claims to produce voice clones in less than a minute, with varying results.
An example of this is a website called Speechify, which allows users to clone voices, or use pre-cloned voices, such as Snoop Dogg, to read out given text. Using this website, I was able to clone Andrew’s voice using a 30-second audio clip. However, this gave him a bit of an American twang.
VALL-E(X) is Microsoft’s model, which also claims it can generate a realistic voice clone with a 15-second clip. While the code has not been released to the general public, there are a few open-source implementations that have been recreated from the details set out in their paper. However, this didn’t sound like Andrew at all.
Finally, ElevenLabs is a website used by professional voice actors to generate a clone of their voice which can then be used for audiobooks and voiceovers, allowing them to generate passive income from the site. They offer the option to create an instant voice clone with a few minutes of audio, or a professional voice clone created using 30 minutes of recordings.
Here is Andrew’s voice clone generated using 10 minutes of audio, which while it still sounds quite American, it has picked up the tone in which Andrew might sometimes speak in:
And here is Andrew’s voice clone generated using 20 minutes of audio, using ElevenLabs professional voice clone:
As you can hear, this sounds almost exactly like Andrew. Thankfully in order to create these professional voice clones, ElenLabs requires you to verify your voice using your microphone using phrases they give you. However, many other voice cloning sites do not require the person to verify their voice to produce this level of accuracy.
As you can see the audio created using only a few seconds of audio are actually quite terrible and not realistic at all. It seems as though a large amount of audio data is needed to replicate someone’s voice using AI. Therefore, It will be interesting to see how OpenAI’s Voice Engine compares to ElevenLabs, and whether it can actually produce similar or better voice clones in the space of 30 seconds, or will the results match the quality of audio produced with 20 minutes of audio?
What can voice cloning be used for?
OpenAI claims this tool has many beneficial use-cases including:
Providing reading assistance
Translating content
Reaching global communities
Supporting people who are non-verbal
Helping patients recover their voice
However, there is a pattern of voice cloning AI being used for malicious intent, so what’s to say Voice Engine won’t be used for this either? Personally, I was disappointed to learn that a viral TikTok audio of a cover of Taylor Swift’s song Style featuring Harry Styles was generated by AI. This technology can have huge implications for the music industry, allowing people to take artist’s voices and generate new music using this.
This technology can also be used to influence politics and can be used to incite public disorder. For example, Sadiq Khan recently spoke of how deepfake audio of him making inflammatory remarks nearly caused serious disorder in the run-up to Armistice Day. Smear campaigns like this can be used to push political agendas and incite hatred. It is likely this technology will be attempted to be pushed widely on social media during the next election.
More recently, voice cloning has been used to scam individuals and businesses, posing as loved ones or company executives in order to extort money from victims.
How could this affect me?
Like everything that is made for good, someone will find a way to use it for bad. It’s always important to second guess videos and audio you find on the internet, or unusual calls you get from people. If content is not from a verified source, there is a chance a voice clone could be in use. However, while this might seem scary, there is a positive use for this technology, and there is the potential to grow communities, aid education and rehabilitate those with degenerative speech conditions.
OpenAI is attempting to prevent misuse of their Voice engine by requiring explicit consent of individuals before their voice is cloned, users must also be transparent in their use of voice cloning by notifying listeners that the voice they are hearing is synthetic. Watermarks will be created alongside the generated audio allowing for traceability. We will see with time whether these measures will be enough to deter misuse of their software, and there is always a race for new AI technology to be developed, will these be created with the same principles in mind?
In case you missed it…
These glasses lenses have a small screen that can interact with your environment, Brilliant claim it can translate signs for you in real-time and even let you see different furniture layouts of your room: