Meta presents Voicebox, an AI capable of reproducing any human voice

Meta has just officially presented Voicebox, an artificial intelligence specialized in voice synthesis. This model is able to convert text into audio file and generate speech based on these samples of only two seconds.

As you know, the main Tech players have embarked on the AI race. After the launch of ChatGPT at the end of 2022 and Microsoft’s 10 billion invested in the OpenAI startup, the web giants rushed to present their own artificial intelligence in turn.

Google distinguished itself with Bard, its conversational AI, while Meta confirmed the development of its AI in April 2023. In recent months, the Menlo Park firm has published a multitude of AI models, starting with LLaMA (Large Language Model Meta AI), an open-source language model.

A while ago, the Californian company also unveiled JEPA, a model that aims to reproduce human thought, in particular by analyzing and understanding abstract notions and concepts. In a completely different area, Meta also presented MusicGenan AI capable of creating music via a basic textual description.

Meta unveils Voicebox, the AI capable of imitating the human voice

However, on June 16, 2023, Meta announced “its new breakthrough in the field of generative AI for speech”. This AI is Voicebox. In short, this state-of-the-art AI model specializes in voice synthesis. In other words, she is able to create, edit or style audio files.

First, let’s tackle Voicebox’s most interesting (and probably most problematic) feature: text-to-speech synthesis in context. Based on an audio extract of only two seconds, Voicebox is able to generate a speech by simulating the voice and phrasing of the person heard in the extract.

In this way, Voicebox can simulate the voice of a relative, a singer or a politician. In the future, Meta says Voicebox and other similar generative AI models will be able to give natural voices to voice assistants or NPCs in the metaverse. Additionally, they could also allow the visually impaired to hear messages written with the voices of their friends.

Also to read : After Dall-E and Midjourney, this new AI can generate a video from a text

Editing of audio files and instant translation

But that’s not all since Voicebox offers other features:

Audio editing and noise reduction : Voicebox can recreate a portion of speech interrupted by noise or replace slurred and mispronounced words without having to re-record a whole speech (a sort of Google-like magic eraser for audio)
Multilingual translation : Voicebox currently supports six languages (English, French, Spanish, German, Polish and Portuguese), which allows it to transpose speech into a language other than that of the original file (while transposing the style and shades)

To carry out its various tasks, Meta’s AI has been perfected over more than 50,000 hours of extracts sound mostly from audiobooks and royalty-free content. For the moment, Voicebox remains inaccessible to the general public, for safety reasons. Unsurprisingly, Meta is worried about his AI being misused, including to mimic the voices of real people.

Source : Meta