[ad_1]
In one example, a 15-second snippet of a human voice – someone reading a science lesson for children – is given to the model, which then applies it to five different written lessons. The human never read these lessons, but the output audio sounds exactly like them.
Yet the original source recording itself sounds compressed, which makes it hard to judge the clarity of the output. And the reader is giving a slow, deliberate and distinctive read, which is potentially ideal for the model to copy. The same can be said for all five of the given examples, so we don’t know how good the model is at producing a conversational tone, or whether it can apply different tones to its output.
In its blog, OpenAI said the model is being tested by a small number of trusted partners under strictly controlled conditions, and that it hasn’t decided when – or if – it will become available to the public. It said it is providing these details in hopes of starting a conversation about responsible use of the technology.
What could go wrong if this or similar technology was made public?
The first danger you probably think of when learning of this technology is misinformation, and that’s a real concern.
Assuming it works as well as OpenAI says, a bad actor could take just 15 seconds of speech from any person, and create a recording of them saying almost anything. For prominent people, such as celebrities and politicians, you could find all the training input you need with a simple Google search.
Granted, making it sound like the prime minister is saying something controversial and then posting the audio clip to a random social media account is not likely to be the most effective misinformation. However, with a bit of effort, you could embed the false voice clip into a wider interview, or even dub it into a video.
Combined with OpenAI’s video generation model Sora, you could conceivably fake an entire video with dialogue, although right now, Sora output is typically filled with tell-tale errors, and I wouldn’t be surprised if Voice Engine is the same.
Even if the result isn’t perfect, or sounds a bit weird, the technology could still be used to generate effective misinformation.
Much simpler fakes, including obviously photoshopped or altered elements, video with its speed modified, and manually tampered audio, has been used before to hurt public perception of politicians. It’s especially dangerous when you consider the willingness of some online channels and influencers to promote and spread content that suits their political purposes, regardless of the content’s origin or any verification.
Another danger many will jump to is scamming. But while crooks will always jump on any technological advantage, I’m not convinced Voice Engine would be a huge boon for them.
Loading
Theoretically, scammers could use the new tech to disguise accents, speaking any language naturally to sound like a local, but it’s unclear how they could do it fluidly in a real-time conversation. They could also use a voice clone to read text output from a chatbot, automating scams that trick people into giving up their personal information. But this is already possible: the groundbreaking aspect of Voice Engine is having the bot sound like a specific person.
Could a scammer call you with a bot that sounds like your daughter using Voice Engine? Or one that sounds like your boss? Potentially. But they would need to collect a lot of information first, would be calling from an unfamiliar number, and would risk saying something weird to tip you off. They may be better off sticking with email and text message versions of their scams.
Many of these challenges could be overcome in an eventual consumer version of OpenAI’s Voice Engine. For example, apps could require more than 15 seconds of audio, and could require the speaker to read specific words or phrases to confirm they are a real person and not a recording.
OpenAI could also embed audio watermarks in all generated speech for easy detection, and your smartphone could alert you if someone calls you using it.
OpenAI has also suggested a “no-go voice list” that would mean systems decline to build models of prominent people’s voices.
What legitimate function could it serve?
In all the panic and doom and gloom that seems to be our first instinct when talking about AI, it can be worthwhile to remember that this technology does have the potential to do good.
Turning any text into human-like speech has an obvious accessibility benefit, as does instantaneous translation. As it stands, the world’s information largely exists in various buckets, with access determined by a person’s language or ability to read, see or hear. AI could make it all available to everyone.
OpenAI’s Voice Engine has some unique potential benefits. For example, anyone who writes content could train a model of their voice in seconds, then make an audio version of their work available to anyone who prefers to consume it that way. The result could be read emotively in their own voice, rather than by a generic robot voice. Obviously, a recorded version would sound better, but it could take hours longer to produce.
Additionally, the spoken content could be translated into any language but still read with the original author’s voice. This could be used for content that was originally spoken too, for example, to make TV commentary, public speeches, videos or podcasts available in every language with little additional work.
It would be especially useful for people whose primary language isn’t one of the world’s most widely spoken, and this process could provide access to a huge amount of information and entertainment. In an example given by OpenAI, a community health organisation provides advice on nutrition to breastfeeding mothers, which is translated to the informal Kenyan language Sheng and played aloud.
Last year, Apple unveiled an AI application that lets people train a model to use as a personal text-to-speech voice, and Voice Engine could be used for a similar purpose.
Those who are entirely non-verbal could have someone create a voice model that reflects their culture and regional accent. In another OpenAI example, a person who is losing the ability to speak because of a brain tumour was able to train a voice model using an old recording, so her text-to-speech voice sounds like her younger self.
What’s likely to happen now?
Whether the technology is as good as OpenAI says, and whether it releases it to the public, it’s clear that convincing text-to-speech in any human’s voice will eventually be possible, so there are a number of things we need to be thinking about.
Obviously, any security that relies on voice verification should be reconsidered, and we should be start being wary of believing a person said a thing purely because we heard a recording that sounds like them. As with photos, audio recordings and videos of speech should be treated with scepticism – unless you can verify a trustworthy source.
Loading
Even though I’m not convinced that AI voices will make an effective tool for scammers pretending to be their victims’ loved ones, the development reinforces the need to practice the same precautions we should all be taking now; if someone calls you from an unfamiliar number, don’t agree to give them anything.
It will also be crucial to develop methods that can identify AI-generated audio, as well as images, and track their provenance. This technology, for better or worse, will likely come from the same labs developing the generative capabilities in the first place.
Get news and reviews on technology, gadgets and gaming in our Technology newsletter every Friday. Sign up here.
[ad_2]
Source link