Artificial intelligence makes voice cloning easy and ‘the monster is already on the loose’
In video from a Jan. 25 news report, President Joe Biden talks about tanks. But a manipulated version of the video has racked up hundreds of thousands of views on social media this week, making it appear as if he delivered a speech attacking transgender people.
Digital forensics experts say the video was created using a new generation of artificial intelligence tools that allow anyone to quickly generate audio simulating a person’s voice with just a few clicks. And while the Biden clip on social media may not have fooled most users this time, the clip shows how easy it is now for people to create hateful and disinformation-filled “deepfake” videos that wreak havoc on the real world could.
“Tools like this are basically going to add fuel to the fire,” said Hafiz Malik, a professor of electrical and computer engineering at the University of Michigan who focuses on multimedia forensics. “The monster is already loose.”
It came last month with the beta of ElevenLabs’ speech synthesis platform, which allowed users to generate realistic audio of anyone’s voice by uploading a few minutes of audio samples and typing in any text to say it.
The startup says the technology was designed to dub audio in different languages for movies, audiobooks and games to preserve the speaker’s voice and emotions.
Social media users quickly began sharing an AI-generated audio sample of Hillary Clinton reading the same transphobic text seen in the Biden clip, along with fake audio clips of Bill Gates allegedly saying that the COVID-19 vaccine causes AIDS, and actress Emma Watson, who allegedly read Hitler’s Mein Kampf manifesto.
Shortly thereafter, ElevenLabs tweeted that it was an “increasing number of cases of abuse in voice cloning” and announced that safeguards are now being examined to curb abuse. One of the first steps was to make the feature only available to those who provide payment information. Initially, anonymous users could access the voice cloning tool for free. The company also claims that if there are any problems, it can trace any generated audio back to the creator.
But even the ability to track creators won’t mitigate the tool’s damage, said Hany Farid, a professor at the University of California, Berkeley who focuses on digital forensics and misinformation.
“The damage is done,” he said.
As an example, Farid said bad actors could move the stock market with fake audio of a top CEO saying profits have fallen. And there’s already a clip on YouTube where the tool was used to alter a video to make it appear as if Biden was saying the US was going to launch a nuclear attack on Russia.
Free and open-source software with the same capabilities has also appeared online, meaning paywalls aren’t a barrier when it comes to commercial tools. Using a free online model, the AP generated audio samples that sounded like actors Daniel Craig and Jennifer Lawrence in just minutes.
“The question is where to point the finger and how to put the genie back in the bottle?” said Malik. “We can’t do that.”
When deepfakes first made headlines about five years ago, they were easy to spot because the subject didn’t blink and the audio sounded robotic. That is no longer the case as tools become more sophisticated.
The altered video of Biden making derogatory comments about transgender people, for example, combined the AI-generated audio with an actual clip of the President, taken from a live Jan. 25 CNN broadcast that covered the US tank deployment in the announced in Ukraine. Biden’s mouth was manipulated in the video to match the audio. While most Twitter users realized the content wasn’t something Biden was likely to say, they were still shocked by how realistic it seemed. Others seemed to think it was real — or at least didn’t know what to believe.
Hollywood studios have long had the ability to distort reality, but access to this technology has been democratized without considering the implications, Farid said.
“It’s a combination of the very, very powerful AI-based technology, the ease of use, and the fact that the model looks like this: let’s put it on the internet and see what happens next,” said Farid.
Audio is just one area where AI-generated misinformation poses a threat.
Free online AI image generators like Midjourney and DALL-E can generate photorealistic images of war and natural disasters in the style of older media with a simple text input. Last month, some school districts in the US began blocking ChatGPT, which can generate on-demand text such as student assignments.
ElevenLabs did not respond to a request for comment.
Learn how to navigate and build trust in your organization with The Trust Factor, a weekly newsletter exploring what leaders need to succeed. Login here.