VoiceLoop can clone your voice. What does that mean for voice synthesis attacks?
The rapid advancement of voice augmented AI is just as rapidly increasing our vulnerability footprint

VoiceLoop can clone your voice. What does that mean for voice synthesis attacks?

I can't sing, and I'm ok with that. I have been to karaoke a few times, but aside from knowing the words to a few Beatles songs here and there, I'm pretty terrible. I won't quit my day job.

Thankfully, if I ever wanted to try to cut an album, maybe I don't have to be good. Yes, auto-tuning has always been a thing. But VoiceLoop takes that one step further. To wit, VoiceLoop (per the linked paper) can:

  • Transform Text to Real-World Voices - VoiceLoop can convert text into speech using voices that are sampled from real-world scenarios, including from public speeches on YouTube, even if there's challenging background noise or other audio
  • Handle Inaccurate Transcripts and Multiple Speakers - the system can work with inaccurate automatic transcripts, and can manage audio that contains multiple speakers, distinguishing between different voices even if from low-quality recordings
  • Voice Personalization - VoiceLoop incorporates the identity of the speaker into its process, stored as a learned embedding for known individuals, or fitted for new voices. This ensures that the synthesized speech retains the unique characteristics of the original speaker

Thankfully, maybe now I can create a K-pop album, despite speaking zero words of Korean.

Article content
I don't think BTS has to be too worried about my (lack of) singing talent. Attr: Forbes

Malicious potential of voice synthesis

The risk, however, is that as technology advances to allow more and better uses of voice in anything from content creation to account validation to interviewing, and more, advanced voice synthesis can be used for malicious purposes as well, such as creating deepfakes or impersonating individuals for fraud.

A simple example is the requirement of an authenticated voice for activation, with VoiceLoop being used to impersonate victims in fraud phone calls and trick targets out of personal information for identity theft.

Article content
Probably slightly less benign than whatever this is. Attr: 3Dgifartist via Tenor.com

The authors of the paper VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation - Yuanda Wang, Hanqing Guo, Guangjing Wang, Bocheng Chen, and Qiben Yan, all of Michigan State University - introduce "VSMask," a real-time protection mechanism against voice synthesis attacks. Unlike existing defense mechanisms that require substantial computation time and the availability of entire speech, VSMask is designed to protect live speech streams, such as voice messages or online meetings.

How VSMask and perturbation can protect from fraud, high level

The paper's experiments indicate that VSMask can effectively defend against three popular voice synthesis attacks, ensuring that neither synthetic voice can deceive speaker verification models nor human ears. Those attacks are conducted via:

  1. Voice Assistants - attackers use synthesized voice to bypass speaker verification processes and control voice assistants, such as Siri or Google Assistant
  2. App Authentication - Automatic Speaker Verification (ASV) systems integrated into popular apps can be deceived by high-quality synthetic voices. For instance, apps like WeChat, which require voice verification, are vulnerable to voice synthesis attacks
  3. Fraudulent Phone Calls - attackers can use synthetic voice to impersonate victims in fraud phone calls, making it challenging for human ears to differentiate between genuine and synthetic speech

Article content
Nooootttt quite that easy, but a step in the right direction. Attr: felixbeilharz via Tenor.com

VSMask introduces a universal perturbation tailored for any speech input to shield real-time speech in its entirety. Here is what that means.

A "universal perturbation" is a small alteration or noise added to input data (in this case, voice data) that can cause a machine learning model to misclassify or misinterpret the data. The term "universal" implies that this perturbation is not tailored to a specific input but can be applied broadly to various inputs and still cause the desired effect.

In the context of VSMask and voice synthesis attacks, a universal perturbation is introduced to the voice data to prevent voice synthesis models from accurately impersonating the voice, thereby providing a layer of protection against such attacks.

Article content
I do, but maybe the machine doesn't quite. Attr: User dcjeffrey via

By implementing a weight-based perturbation constraint, VSMask minimizes audio distortion within the protected speech. This ensures that voice assistants and automatic speaker verification (ASV) systems can differentiate between genuine and synthesized voices, thereby enhancing account security.

How perturbation and VSMask work practically

It works the following way:

  1. Enrollment - the customer provides voice samples, which are used to create a voice profile, stored in an authentication system
  2. Authentication Process - when a customer calls in, they speak naturally. The system captures the live voice and introduces the perturbation in real-time. The perturbed live voice is then compared to the stored voice profile (without perturbation). If a malicious actor tries to impersonate the customer using a synthesized voice, their voice won't have the exact characteristics of the genuine voice. When the system introduces the perturbation to this fake voice, it will react differently compared to the genuine voice. This difference is what the system detects to identify the voice as fakeThe perturbation is designed to be subtle, so it doesn't significantly alter the voice's quality. If the live voice data doesn't match this expected pattern, it's flagged as potentially fake or synthesized
  3. User Experience - the customer doesn't need to do anything special. They simply speak as they normally would. The system handles the perturbation and comparison processes in the background.

So, in essence, the perturbation acts as a challenge to the live voice data. The system knows how an authentic voice should sound when perturbed, and uses this knowledge to differentiate between genuine and potentially fraudulent voice samples. The customer's authentic voice, when perturbed, should match the expected pattern the system has on file. A malicious actor's synthesized voice,however, will not. It's basically a voice watermark

Conclusion

The paper underscores the effectiveness of VSMask in defending against voice synthesis attacks. As voice synthesis technology continues to evolve and becomes more sophisticated, the need for robust defense mechanisms like VSMask becomes paramount.

Cybercrime Magazine estimated the cumulative cost of cybersecurity spending will rise seemingly exponentially over the next several years. There is little reason to expect those costs to protect, and the cost of damage, won't continue to rise.

Article content
The costs to protect against cybercrime are going up. Attr: Cybercrime Magazine

Because, per the World Economic Forum back in 2019:

Article content
The cost of attacks are going up (and this is relatively old data). Attr: World Economic Forum

So as synthetic voice technology advances, malicious actors will continue to find innovative ways to impersonate and deceive, making it imperative for individuals and businesses to prioritize voice security measures. In an era where voice is increasingly used for authentication and communication, staying vigilant and informed about the potential risks of voice synthesis attacks is crucial for safeguarding personal and corporate assets.

,,,,,

#ai #ml #aiml #Cybersecurity #SyntheticVoice #VoiceSecurity #Deepfakes #VoiceAuthentication #Cybercrime #VoiceImpersonation #AIsecurity #VoiceTech #DigitalFraud #VoiceSpoofing #TechSafety #VoiceVerification #DeepfakeDetection #CyberThreats #VoiceHacking #DigitalIdentity #VoicePhishing #TechInnovation #SecureCommunication

To view or add a comment, sign in

More articles by Kristian Vybiral

Insights from the community

Others also viewed

Explore topics