Together with related AI work, RSC will pave the way for the construction of a “Metaverse”.
[Introduction] From “The Matrix” to “Westworld”, countless literary and artistic works have envisioned the future of super AI running on supercomputers, controlling the virtual reality world and ruling the earth. On January 25, 2022, the AI supercomputer RSC announced by Meta seems to be heading in this direction.
On January 25, Meta and Nvidia officially launched a new supercomputing – “AI Research SuperCluster” (RSC for short).
Meta’s plan is also very “simple”. First, the models required for CV, NLP, speech recognition and other technologies are maximized, and the number of parameters is almost “trillions”.
These models can work in hundreds of different languages; seamlessly analyze text, images, and video; develop new augmented reality tools, and more.
Then build a whole new artificial intelligence system. For example, provide real-time speech translation for people who speak different languages, so everyone can collaborate seamlessly on research projects, or play AR augmented reality games together.
Ultimately, RSC will work with related AI efforts to pave the way for the construction of a “Metaverse”.
It is worth mentioning that it only took RSC 18 months from the paper inspiration to the real thing.
to be the largest
Since Facebook officially established an artificial intelligence laboratory in 2013, Meta has made a lot of significant progress in AI.
Things like self-supervised learning, which learns from a large number of unlabeled samples, and Transformers, which enable AI models to reason more efficiently.
However, in order to take advantage of self-supervised learning and Transformer-based models, whether it is vision, speech, language, or identification of key information, larger and more complex models need to be trained.
Computer vision requires higher data sampling rates to process larger, longer videos. Speech recognition needs to work well in challenging scenes with lots of background noise, such as parties or concerts. NLP needs to understand more languages, dialects and accents.
And advances in other areas, including robotics, embodied AI, and multimodal AI, will enable real-world tasks.
To this end, Meta created the first generation of supercomputing in 2017. Among them, there are 22,000 NVIDIA V100 Tensor Core GPUs in a cluster, executing 35,000 training jobs per day.
At the beginning of 2020, Meta decided to design a new supercomputing from 0, and then train a model with more than one trillion parameters on a dataset as large as one Exabyte. In other words, this amount of data is equivalent to 36,000 years of high-quality data. video.
Anyway, the number of parameters in the neural network model has been soaring. For example, the natural language processor GPT-3 has 175 billion parameters, and the supercomputer runs the super neural network AI, no problem.
RSC is composed of 760 NVIDIA DGX A100 systems to form computing nodes, with a total of 6080 NVIDIA A100 GPUs connected on the Quantum InfiniBand network, achieving TF32-level performance of 1895 peta operations per second.
RSC’s storage tier has 175PB of Pure Storage FlashArray, 46PB of cache storage for the Penguin Computing Altus system, and 10PB of Pure Storage FlashBlade.
20x performance gain
Meta’s first generation of supercomputing was designed in 2017, with 22,000 Nvidia V100 Tensor Core GPUs in one cluster executing 35,000 training jobs per day.
Early benchmarks of RSC show that it can run computer vision workflows 20 times faster, run the NVIDIA Collective Communication Library (NCCL) more than 9 times faster than the first generation of supercomputing, and train large-scale NLP models. 3 times faster.
That means a model with tens of billions of parameters can be trained in three weeks, compared to nine weeks before.
In the second phase in 2022, the number of GPUs of RSC will increase from 6,080 to 16,000, increasing the training performance of AI by more than 2.5 times, making it the fastest AI supercomputer in the world.
In mixed precision, RSC will reach a staggering 5 exaflops. The storage system will be expanded to Exabyte (1 billion GB) level with a target transmission bandwidth of 16TB/s.
In addition, the InfiniBand fabric will support 16,000 ports with a double-layer topology and no over-proportion.
As a control, the largest system in the latest round of MLPerf neural network training benchmarks was a 4320-GPU system deployed by Nvidia.
It trains the natural language processor BERT in less than a minute. However, BERT has only 110 million parameters compared to the trillions that Meta’s RSC will use.
It’s really a little tricky.
Advantages of supercomputing
The supercomputer’s gigantic scale is necessary in many ways, said Kevin Lee, the Meta project manager in charge of RSC.
First of all, the basic business of Meta Company itself brings the need to process massive amounts of information continuously, which has high requirements on the lower limit of data processing performance.
Secondly, the amount of data used in AI R&D projects also has a lower limit, because the more complex and complete the content of the data set, the better the research results.
The lower limit of computing power for training AI models is much higher than the lower limit of computing power for running AI models. That’s why your smartphone doesn’t need to be connected to a data center full of servers to scan your face for authentication.
Again, the management of all this infrastructure is a big challenge. Therefore, the de-fragmentation of management brought about by large scale simplifies management work and improves the efficiency of tasks such as management work, energy consumption, and land occupation.
Metaverse: Please step up
Meta didn’t forget about the company’s recent Metaverse theme when it publicly announced supercomputing.
In October 2021, Facebook officially announced its name change, META!
When the “thumbs up” on the signboard of the Silicon Valley headquarters was removed, the “New Era of the Metaverse” officially kicked off.
Meta boss Mark Zuckerberg wrote in his Facebook post on Monday: “The experiences we build for the Metaverse require enormous computing power (trillions of operations per second!), and RSC will enable new AI models, You can study from trillions of examples, understand hundreds of languages, and more.”
Meta has also repeatedly stated in news articles that one of the purposes of developing supercomputing is to “help realize the company’s Metaverse vision.
“. Running AI with supercomputing is also because “AI-driven applications and products will play an important role in the Metaverse.”
Meta said: “We hope that this step-up in computing power will not only help us create more accurate AI models for existing services, but also enable new user experiences, especially in the Metaverse… The underlying technology that powers the Metaverse and drives the broader AI community forward.”
There are also prospects for specific products and landing scenarios that RSC can promote. In addition to the multiple mentions of “censoring massive amounts of content” and “translating speech in real time for people in hundreds of languages,” there are also augmented reality devices being developed by Facebook and Instagram that could also benefit.
As Mark Zuckerberg mentioned his own data2vec model, the combination of high-performance AI and AR will enhance the user experience of the Metaverse: “High-performance AI assistants will eventually be built into AR glasses. For example, when the user cooks When there is less seasoning, the fire is too high, etc., the AI assistant in the AR glasses can pop up the window/voice prompt in time to assist the user in completing complex tasks.”
How to create an AI supercomputing
Projects such as the RSC are designed and built not only with performance parameters in mind, but with the best solutions available today to achieve those performance on the widest possible scale.
Collaborate with external partners
All of this infrastructure must be very reliable and durable, as Meta estimates that some experiments can run for weeks and require thousands of GPUs. And the entire experience of using RSC must be researcher-friendly so that research teams can easily explore a wide range of AI models.
A large part of the achievement of this goal is the result of Meta’s collaboration with long-term partners, who also helped design Meta’s first-generation AI infrastructure in 2017.
Penguin Computing, an SGH company, is Meta’s architecture and management services partner, working with Meta’s operations team on hardware integration to deploy clusters and help build major parts of the control plane.
Pure Storage provides Meta with a powerful, scalable storage solution.
Nvidia provides Meta with AI computing technology featuring cutting-edge systems, GPUs and InfiniBand fabrics, as well as software stack components such as NCCL for clusters.
Coping with changing times during development
But there have been other unexpected challenges in the development of RSC — namely the Covid-19 pandemic.
Covid-19 made RSC initially a fully remote project, and it took the project team about a year and a half to go from a simple shared file to a working cluster of work.
The Covid-19 and industry-wide wafer supply shortages have also created supply chain issues, making everything from chips and optical components to GPUs and even building materials hard to come by — all of which must be shipped in accordance with new safety protocols.
To effectively build this cluster, Meta’s project team had to design from scratch, creating many new Meta-exclusive processes, and rethinking previous precedents in the process.
For example, Meta had to write new rules around its data center design — including its cooling, power, rack layout, cabling and networking (including a new control interface), among other important considerations.
Meta must ensure that all teams within the company, from architecture to hardware to software and artificial intelligence, are working in tandem with their partners.
AIRStore developed for supercomputing
In addition to the core system itself, AI supercomputers need a robust storage solution: one that can deliver terabytes of bandwidth from a hyperbyte-scale storage system.
To meet the growing bandwidth and capacity demands of AI training, Meta has developed a storage service from the ground up, the Artificial Intelligence Research Store (AIRStore).
To optimize AI models, AIRStore employs a new data preparation phase that preprocesses the dataset used for training.Once preparations are complete, the prepared dataset can be used for multiple training sessions until it expires.
AIRStore also optimizes data transfer to minimize cross-regional traffic on the backbone network between Meta data centers.
Ensure data security
High-performance computing has been solving problems of scale for decades, but controls for security and privacy are also of paramount importance.
To meet privacy and security requirements, the entire link of data from the storage system to the GPU is encrypted end-to-end and not decrypted until training. And before being imported into RSC, data must go through a privacy review process to ensure it is properly anonymized.
In addition, RSC is also isolated from the Internet, there is no direct inbound or outbound connection, and traffic can only come from Meta’s production data center.
Meta’s RSC is arguably the first to address performance, reliability, security and privacy issues at this scale.
Finally, once fully enabled, Meta’s RSC supercomputer will be the largest custom installation of an Nvidia DGX A100 system.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/50-billion-billion-times-per-second-meta-sacrifices-the-metaverse-behemoth-and-joins-forces-with-nvidia-to-create-the-worlds-strongest-supercomputer/ Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.