The first Dual GPU in history is AMD: 6 nm and 14,000 Cores!

GPUs have undergone an incredible evolution since their inception, as today they are not only used to generate the impressive frames of our favorite games, but also for various general-purpose computing applications where the CPU is not good enough for run certain algorithms.

It must be taken into account that AMD has a presence in GPU Gaming thanks to its Radeon products, where at present it is the RDNA 2 architecture used in its RX 6000, but Lisa Su’s company has decided to create a different architecture for the high performance computing.

AMD Instinct MI200 Specifications

Graphic card AMD Instinct MI250 AMD Instinct MI250X
Architecture CDNA2 CDNA2
Manufacturing node 6 nm TSMC 6 nm TSMC
GPU number 2 2
Active Compute Units 208, 104 per GPU 220, 110 per GPU
Power in FP16 (Matrix Core Units) 362 TFLOPS 382 TFLOPS
Power in FP32 (SIMD Units) 45.3 TFLOPS 47.9 TFLOPS
Power in FP64 (SIMD Units) 45.3 TFLOPS 47.9 TFLOPS
VRAM quantity 128 GB 128 GB
VRAM bandwidth 3.2 TB / s 3.2 TB / s
Form Factor OAM OAM

AMD has decided to make a splash in the world of high-performance computing with its Instinct MI200 series of GPUs, which in terms of computing power is the most powerful hardware that has been made so far and whose technical specifications are the ones you can see in the table above.

The AMD Intinct MI200 have been gestated mainly to be used in the El Capitan supercomputer, hence the form factor of the AMD Intinct MI250 and MI250X is precisely the OAM that is typical of this type of hardware. However, this does not mean that we cannot install an AMD Intinct MI200 on our PC in case we want to use it for scientific development on a HEDT PC or a server, since the Instinct MI210 is the version in PCI Express format and graphics card form factor. conventional to be released later.

The first Dual GPU for HPC

AMd Instinct MI200 Dual GPU

We are facing the first dual-GPU HPC or high-performance computing graphics cards to appear on the market, made possible by the use of third-generation TSMC CoWoS-S technology, which was created by the foundry. Taiwanese so that AMD could realize its AMD Intinct MI200.

As you can see, on top of the interposer we find two GPUs and 8 HBM2E memory stacks, this means that we are facing a bus of 8192 bits in total. The bandwidth it provides? Neither more nor less than 3.2 TB / s, twice that of the NVIDIA A100 and all thanks to the use of a wider interface and faster memory.

Elevated Fanout Bridge

The communication between the GPUs and the HBM2E memory is done using what AMD has dubbed the Elevated Fanout Bridge, which is a silicon bridge that is not built within the internal circuitry of the Interposer, but is built on top of it. This means that in the AMD Instinct MI200 we have three levels instead of two, so it is a more complex GPU to manufacture and that affects the cost, but we have to take into account the target market for these graphics cards and It is not exactly the domestic one.

The EFB is a technology similar to Intel’s EMIB and serves to communicate both each GPU with the closest HBM2e memory stacks as well as both GPUs with each other. In order to communicate with the Interposer at a lower level, it makes use of columns built in copper that are at the same level of the structure as the EFB.

The CDNA 2 architecture of the AMD Instinct MI200

AMD Instinct MI200 Architecture

The important thing in every GPU is its architecture, however we have to assume that CDNA 2 is not what we could say a typical GPU, since it has a series of differences that make it only useful for high-performance computing and not for generating graphics, moreover, despite the fact that its architecture is based on that of a GPU based on the Vega architecture, it really does not serve its main function:

  • Ring 0 of the command processor, which is in charge of handling the screen list, is not found in CDNA 2.
  • Fixed function units used for certain redundant and repetitive functions in graphics have been removed.
  • The screen controller, in charge of controlling the sending of the image to the monitor, has been eliminated, as well as the video outputs.

So in the end, CDNA 2 is left with a machine with an enormous capacity to calculate numbers at high speed and in parallel. For this, each of the two GPUs of the CDNA 2 architecture of the AMD Instinct Mi200 is organized into 4 Compute Units with a total of 32 Compute Units each, so we physically have a total of 128 CU per GPU, but «only »104 or 110 are active depending on the model we are talking about.

The Compute Unit of CDNA 2

AMD CDNA2 Compute Unit

Each of the Compute Units is made up of 4 different blocks where they have the following units:

  • One 32-bit floating point or integer SIMD16 unit, for a total of 64 ALUs per Compute Unit.
  • New to CDNA1 is the new 64-bit floating point SIMD16 drive. The amount of ALU is the same as FP32, 64.
  • A Matrix Core Unit, which is used for matrix calculations. It is the classic Tensor unit and is important for advanced deep learning algorithms.

The Compute Unit has 4 different sets of registers and the scheduler is in charge of feeding waves full of execution threads, so each of the four blocks works with a different wave at the same time. Depending on the type of wave, a group of units or others are activated, since they share registers.

The big novelty compared to the first generation CDNA of the Intinct MI100 is the 16-component 64-bit floating point SIMD unit, a precision that is necessary for scientific computing. This change has allowed the 32-bit and 64-bit computing capacity to be the same and therefore if we take into account the configuration of two GPUs, it has quadrupled in the AMD Instinct MI200 compared to its predecessors.

Infinity Fabric 3.0 on AMD Instinct MI200

AMD Instinct MI200 Infinity Fabric 2.0

Since AMD has improved its Infinity Fabric intercom interface in its third generation, let’s not forget that this is used for internal and external intercommunication between the different components in the company’s CPU, GPU and APU, allowing to combine the power of several CPUs and GPU.

The Infinity Fabric of the previous generation AMD Instinct forced CPU-GPU communication to be done in a non-coherent way through the PCI Express 4.0 port, in addition to limiting the number of GPUs connected to each other to four.

What are the news? First of all, the use of a Dual GPU allows up to 8 of them to intercommunicate. As a second point we have that for the first time the addressing between the CPU and GPU is unified as it is totally coherent and all thanks to the adoption of the CXL 1.1 standard.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *