Microsoft and Nvidia create 105-layer, 530 billion parameter language model that needs 280 A100 GPUs, but it’s….

The MT-NLG is more effective than previous transformer-based systems trained by both Microsoft and Nvidia, the two firms’ Turing-NLT model and Megatron-LM.

When it comes to neural networks, bigger is typically better. It needs them to consume more training data. MT-NLG is superior at a wider range of natural language activities such as auto-completing phrases, question and answer, and reading and reasoning when compared with its predecessors.

AI researchers and engineers must develop new methods and methods to train ever-growing language models.

The ENLTM-MLT-NLG was created with Nvidia’s Selene machine learning supercomputer, a system made up of 560 DGX A100 servers running eight A100 80GB GPUs.

4,480 Nvidia GPUs connect to one another using NvLink and NVSwitch. Each one was capable of performing 113 petaFLOPS. It’s costly to train these models, even on top-of-the-line hardware, and it takes software hacks to reduce training times.

“We can operate them within the regime where they are most effective,” said Paresh Kharya, senior director of product management and marketing for accelerated computing at NVIDIA, and Ali Alvi, group program manager for the Microsoft Turing team.

“For example, the 530 billion model has 280 NVIDIA A100 GPUs per model replica, with 8-way tensor-slicing within a node and 35-way pipeline parallelism across nodes.

The Pile is a big data set compiled by Eleuther AI, a group of AI researchers and engineers leading a grassroots effort to open-source large language models, which was used to train the MT-NLG.

Because of the sheer amount of text, a purification process is impossible. Because no such method exists, the dataset can’t be cleansed of toxic language.

“Microsoft and NVIDIA are dedicated to addressing this issue. We encourage future research to assist in quantifying the bias of the model…