A few years ago, in 2019, Microsoft announced that it was going to invest $1 billion in OpenAI. So he agreed to develop a next-generation supercomputer to the company Research in Artificial Intelligence. The only problem was that Microsoft didn’t have any systems like OpenAI needed, and their managers weren’t entirely sure they could build something as big as they needed on their Azure cloud service without causing it to fail and crash.
OpenAI was trying to train various Artificial Intelligence models, which were consuming more and more data and learning more parameters. So it needed access to very powerful cloud computing services, and for extended periods of time. This led, according to Bloomberg, to the Redmond team having to develop a supercomputer with the power and components necessary to drive the development and operation of, among others, ChatGPT.
Among other things, to achieve this they had to find the key to interconnect several tens of thousands of Nvidia A100 graphics cards, which gave the system the necessary processing power for the development of the chatbot. But in order to do that, they even had to change how they placed the servers in the racks, so they wouldn’t suffer power outages.
According to Nidhi Chappel, Head of Azure AI InfrastructureBecause they were able to build the architecture of a system that could operate and be reliable at a very high scale, they were able to develop ChatGPT. But this is only the beginning, and many more new AI models will emerge from the project that has led to its development.
For his part, Scott Guthrie, Vice President of Microsoft and Head of Cloud and Artificial Intelligence at Microsoft, points out that while ChatGPT is in fact the most popular use case presented so far for this supercomputer they built, it can be adapted to various use cases. Therefore, it is not a custom development, but one designed to advance the development of Artificial Intelligence from Microsoft and OpenAI in general.
Thus, the manager emphasizes that they did not build a personalized team, and that although it started out that way, it evolved into a general-purpose system, because they developed it in such a way that “if someone wants to train a large language model, they can take advantage of the same improvements. This has helped us a lot to become a better cloud for AI in general.«.
In addition, Guthrie has recalled that the model with which everyone is working and testing now has been possible thanks to a supercomputer that is already a couple of years old. In fact, the team is already training its next-generation supercomputer, which according to the manager is “much larger, and will allow for even more sophistication«.
Training a large-scale AI model requires a large number of graphical processing units interconnected at one point, as Microsoft did with the system developed for the AI supercomputer. But once a model is used, answering all the questions users ask, known as inference, requires a different setup.
Microsoft also deploys graphics chips for inference. Of course, these hundreds of thousands of processors are spread across the more than 60 data center regions that the company has, which is now adding the latest Nvidia graphics chips, the H100, to its systems to work with AI loads. It has also added a new version of Nvidia’s network technology, Infiniband, to these systems with the goal of speeding up data sharing.
When OpenAI, or Microsoft itself, is training a large AI model, the work is divided among all the GPUs, and at certain points, the units need to communicate with each other to share the work they have done. For the AI supercomputer the company developed, the development team ensured that the network equipment that manages communication between all the chips was capable of handling such a load. To do this they had to develop software that makes the best possible use of the GPU and network equipment. The result was a tool that allows training models with tens of trillions of parameters.
Since all the machines are powered on at the same time, Microsoft also had to think about where they were, and where the power sources were, in order to have the power it needed. Otherwise, you wouldn’t have what it takes to power your systems. But he also had to make sure that he could cool all the machines and chips. To do this, it uses evaporation, outside air in colder climates and high-tech immersion coolers in those located in warmer areas.
Now Microsoft uses the same resources that you developed for OpenAI to train and run your own models of great scope of Artificial Intelligence. Among them is the new Bing search bot that it introduced last month, and which is still in the testing phase. It also sells the system to other customers. In addition, the company will continue to work on custom chip and server designs, as well as systems to optimize its supply chain to take advantage of any speed, efficiency and savings gains it can.