Surely you have ever hit yourself because your computer is not working properly, something is wrong. On a laptop or desktop, finding the problem and fixing it is easy. Now the world’s most powerful computerwhich goes by the name of Frontier, has performance issuesand while they know what the problem is, they haven’t been able to fix it yet.
Frontier is a supercomputer developed to perform advanced jobs that require great computing power. One of its main features is the ability to offer power greater than 1 ExaFLOP. We are talking about a power of the order of thousands of times greater than that of a home computer.
The largest and most expensive computer in the world crashes all the time
Currently this supercomputer it is working, although due to its computational capacity, it does not work well. Make such a beast system work properly is really complicated. We must bear in mind that it has thousands of components and a very complex interconnection system. It is not like a home computer, which is easy to assemble and repair.
To give us an idea, Frontier has 9472 AMD EPYC 7A53s processors. Each of these processors have a total of 64 cores and work at a frequency of 2.0 GHz. It is complemented by a total of 37,888 Radeon Instinct MI250X acceleration cards.
Each of the nodes is made up of an AMD CPU, four AMD graphics cards each with 128 GiB of HBN2e memory Y 512 GiB of DDR4 RAM. In addition, each of the systems has a capacity of 4TB NVMe storage.
Apparently, this system has a strong issue of operation that has to do with the Instinct MI250X. the system slingshot interconnect used for this system would be ggenerating problems operating with high loads.
Justion Whitt, Director of the Oak Ridge Leadership Computing Facility Program explained:
These are mostly issues of scale coupled with the breadth of applications, so the issues we encountered are mostly related to running very, very large jobs using the entire system… and getting all the hardware to work on gig to do that
But it would not be the only problem that would be affecting performance. Indicates that the AMD products would not be the problem, but it would be a “coincidence”. It is also noteworthy that this type of performance issues are not unusual in these types of systems. When systems of this type are created, until everything works properly, it usually takes time and requires correcting different problems.
Expensive and difficult to assemble equipment
We must bear in mind that this type of system has thousands of existing connections. Making the whole system work correctly is not easy, you have to make many adjustments. In addition, it must also be taken into account that many applications are not ready for this type of system.
One might think that since it does not work properly, it would be better to create several systems. The reality is that it is “easier” to tune the software than to create several smaller systems. Many times the data obtained is needed for the next step in processing.
We have to say that this type of equipment is used for complex scientific studies such as astrophysics or biomedicine. They are also used for climate predictions and multiple types of advanced simulations. Things that a normal computer or a set of them would take years to do, but that a supercomputer can do in much less time.