After changing its name to Meta, Facebook’s meta-universe vision is gradually being realized. This time, Facebook set its sights on the meta-universe society.
Meta releases speech processing model XLS-R
Recently, Meta officially released XLS-R, a new self-supervised model for various speech tasks. It is reported that XLS-R is trained on massive public data (the amount of data is ten times that of the past), which can more than double the language support of traditional multi-language models. Currently, XLS-R supports a total of 128 languages.
Meta believes that voice communication is the most natural form of interaction for people. “With the development of voice technology, we have been able to directly interact with our own devices and the future virtual world through dialogue, thus integrating the virtual experience with the real world.”
This coincides with Zuckerberg’s previous statement that “the company’s business will be given priority to the meta-universe”. Previously, Zuckerberg had outlined his plan to build a “metaworld”: a digital world built on our own digital world, including virtual reality and augmented reality. “We believe that Metaverse will take over the mobile Internet.”
XLS-R, as an indispensable part of meta-universe social interaction, can help people with different native languages to have a barrier-free conversation in meta-universe.
It is worth mentioning that, in order to achieve a wide range of speech understanding capabilities for multiple languages through a single model, Meta has fine-tuned the XLS-R to enable it to obtain functions such as speech recognition, speech translation, and language recognition. According to reports, XLS-R has achieved good results in the BABEL, CommonVoice and VoxPopuli speech recognition benchmark tests, the CoVoST-2 foreign language to English translation benchmark test, and the VoxLingua107 language recognition benchmark test.
In order to reduce the threshold of functional access as much as possible, at present, Meta and Hugging Face have jointly released the model ontology, which is fully open through the fairseq GitHub repo.
According to reports, XLS-R received more than 436,000 hours of public voice recording training on the wav2vec 2.0 training set, thereby realizing a self-supervised learning method for voice expression. This amount of training has reached 10 times that of the strongest model XLSR-53 released last year. Utilizing multiple voice data sources ranging from meeting records to audio books, the language support range of XLS-R has been expanded to 128, covering nearly 2.5 times the number of languages covered by the previous model.
As the largest model ever created by Meta, XLS-R contains more than 2 billion parameters, and its performance is much higher than other similar models. Meta said that facts have proved that more parameters can more fully reflect the various languages in the data set. In addition, Meta also found that the performance of larger models in single language pre-training is also better than other smaller models.
Meta evaluated XLS-R in four major multilingual speech recognition tests and found that it achieved performance that surpassed previous models in 37 languages. The specific test scenario is: 5 languages are selected from BABEL, 10 languages are selected from CommonVoice, 8 languages are selected from MLS, and 14 languages are selected from VoxPopuli.
The word error rate benchmark test results on BABEL. XLS-R is a significant improvement over the previous model.
In addition, Meta also evaluated the voice translation model, which is to directly translate the recorded data into another language. In order to build a model that can perform multiple tasks, Meta also fine-tuned XLS-R in several different translation directions of the CoVoST-2 benchmark, so that it can implement content between English and up to 21 languages. Mutual translation.
When using XLS-R to encode languages other than English, significant performance improvements have been obtained, which is also a major breakthrough in the field of multilingual speech expression. According to Meta, XLS-R has achieved significant improvements in low-resource language learning, such as Indonesian to English translation, where the accuracy of BLEU doubled on average. The improvement of the BLEU indicator means that the automatic translation results given by the model and the manual translation results processing the same content have a higher degree of overlap, which represents a big step in improving the oral translation ability of the model.
The accuracy of automatic speech translation measured by the BLEU index, where a higher value indicates that XLS-R is derived from a high-resource language (such as French, German), a medium-resource language (such as Russian, Portuguese), or a low-resource language (such as Tamil, (Turkish) The accuracy of the voice recording when translated into English.
Meta believes that XLS-R proves that expanding the scale of cross-language pre-training can further improve the comprehension performance of low-resource languages. It not only improves the speech recognition rate, but also more than doubles the accuracy of speech translation from foreign languages to English.
“XLS-R is an important step towards our goal of understanding multiple different languages (speech) with a single model, and it also represents our greatest effort in using public data to advance multilingual pre-training. We firmly believe this is A correct exploration direction will allow machine learning applications to better understand all human speech and promote follow-up research, greatly reducing the threshold for using speech technology on a global scale, especially in service-poor communities. We will continue to develop new methods. , Expand the language understanding ability of the model through low-supervised learning, gradually make it cover more than 7000 languages around the world, and realize the continuous update of the algorithm.”Meta mentioned.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/how-to-make-friends-in-the-metaverse-meta-releases-a-voice-model-for-cross-language-communication-supporting-barrier-free-dialogues-in-128-languages/ Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.