Voicebox, Meta brings AI to text-to-speech

Meta has not stopped taking out her chest, during the last months, regarding artificial intelligence, and Voiccebox is just the latest in a growing list of samples of it. Deflated the hype from the metaverse (hype internally, within the company, because from the outside the expectations were never too high), it seems that the company has decided to focus its efforts on other areas with greater interest and growth potential, something that its accounts and its customers will undoubtedly appreciate. investors.

As I said, Meta has shown a lot of interest in artificial intelligence for quite some time, but it wasn’t until the boom in this technology, especially thanks to generative models, that have decided to start publishing papers and project samples interesting, also allowing in some cases the download of the models. Something that I cannot help but relate to Yann LeCun’s statements at the end of January, in which he stated that ChatGPT was not that innovative. A statement that, of course, made us wonder what they were working on.

Since then we have seen the presentation and release of flame (Large Language Model Meta AI) and the SAM image element segmentation tool, among others, in addition to the most common approaches to artificial intelligence today, such as the chatbot that will soon arrive on Instagram. Thus, it would be unfair to recognize that Meta is knowing how to position itself as a technology to take into account when we talk about artificial intelligence.

The most recent example of this is in Voicebox, an AI model that turns text into speech. These types of tools have been around for a long time, but until now most solutions of this type are based on using a monstrous volume of samples, which are used to compose each text-to-speech conversion. This gives reasonable results, but it’s common to find weird intonations and similar effects.

Voicebox has been trained on over 50,000 hours of unfiltered audio. As we can read on her website, Meta used recorded voice and transcriptions of a bunch of public domain audiobooks read in English, French, Spanish, German, Polish and Portuguese. Thanks to this training, this model is capable of generating truly realistic narrations, as well as taking an existing recording with background noise, and returning a clean version of it.

ANDl of speech synthesis is a very active field in the world of artificial intelligence. Voicebox is just the latest example, but recently we’ve also heard about VALL-E, a model created by Microsoft that is capable of imitating voices, with the possibilities and risks this poses, and Apple’s plans to generate audiobooks from from text originals.