For more than a decade, performance in a PC is not only a matter of the CPU, but also the GPU. The latter also takes part in speeding up the execution of the programs and not only to render graphics, since today there are hundreds of applications that make use of the parallel processing capacity of the parallel algorithms to accelerate their algorithms. GPU for maximum performance.
The fact that Intel will be left behind meant that its direct rivals and especially AMD had an advantage. Thanks to the excellent GPU performance at AMD they have won several contracts from the United States government for the development of supercomputers. All this in the midst of the paradigm that is the race to reach the ExaFLOP of computing power.
This was the turning point for Intel, who hired Raja Koduri from AMD and assembled a team around him with one goal. Creating a scalable graphics architecture that allowed it to compete against AMD and NVIDIA, from GPUs embedded in CPUs to HPC GPUs. All this without forgetting the graphics cards for gaming. Where the Intel ARC Alchemist are the first generation with which Intel intends to cut market share from its rivals.
A journey into the architecture of the Intel ARC Alchemist
As if we were soaring higher and higher, we’re going to break down the different components that make up Intel’s first enthusiast gaming GPU. Starting from the specific to the global and so that you can understand what the organization or architecture of the Intel ARC Alchemist architecture is and how it compares with its counterparts from NVIDIA and AMD. It is a Gaming GPU that despite being built by Intel will be manufactured in the TSMC N6 process.
We will see the Intel ARC Alchemist architecture in both dedicated graphics cards for desktop PCs and gaming laptops in various configurations where the bandwidth of each of them as well as the amount of Render Slices will vary. The version with 8 Render Slices being the most advanced of all of them. Its release date is expected to enter 2022.
The Xe-Core, the foundation of the Intel ARC Alchemist
The first thing we have to bear in mind is that the so-called EU Cores have disappeared to be replaced by the Xe Cores, but they are not the same, since each Xe Core is equivalent to AMD’s Compute Unit or NVIDIA’s SM, but with A series of changes that should be noted is that Intel has left out the Sampler or texture unit and other fixed function units. He has not ruled them out, but they make it easier for the creation of non-graphics GPUs.
Each Xe-Core in the Intel ARC Alchemist is comprised of 16 Vector Engines, each of them is a 256-bit SIMD drive and therefore it is composed of 8 32-bit floating point ALUs making a total of 128 compute units per Xe-Core. A ratio equivalent to that of the NVIDIA RTX 3000 and twice that of the AMD RDNA 2.
Regarding the XMX units They are equivalent to NVIDIA’s Tensor Core and are therefore designed to speed up calculation with matrices, ideal for algorithms based on convolutional neural networks. When it comes to gross horsepower, the XMX units have twice the computing power than their equivalents in the NVIDIA RTX 3000. Although like the NVIDIA architecture, it seems that these units share the registers and the scheduler with the Vector Engines. These units will be key to its XeSS algorithm, which is Intel’s weapon against Intel.
Top-notch caches, texture unit, and Ray Tracing
Without leaving the Xe-Core we can see that both the first-level instruction cache and the data cache they are found within each Xe-Core. This is a differential element with respect to NVIDIA and AMD, since their instruction cache is usually shared by two equivalent units. Another change from the first-level cache comes compared to previous Intel architectures.
Until its Gen 11 GPUs, Intel had separated the texture cache from the data cache. Something that is not usual to do, now they have not only unified it, but local memory shares the same space as data cache. So developers can choose how much is allocated to the L1 data cache and how much to local memory. Which is not a cache, but a small RAM to temporarily store certain variables and interconnect the different units.
The data cache is used by the texture unit, called Sampler by Intel itself and the unit for the intersection in Ray Tracing. The latter seems to be more advanced than AMD’s as it is separate from the texture unit and can carry out the tour of the data structure that is the BVH tree by itself. So it is more similar to the NVIDIA RT Core, but we do not know at the moment what its performance is, but since the launch of the architecture is for 2022 we expect a performance equivalent to that of the NVIDIA RTX 3000 in that regard.
Many Xe-Cores Render Slice
The Render Slice is a set of units that gathers inside three fixed function units named by Intel as Geometry, Rasterizer and HiZ. Which are responsible for a series of common functionalities in all GPUs and that are essential to display graphics in real time.
The first one is the rasterized unit or Rasterizer and it takes care of the common task of projecting the image onto the screen and converting the geometry of the 3D scene that is composed of vertices to a two-dimensional Cartesian space composed of pixels or fragments. Like all modern raster units, Intel has adopted tile rasterization over the GPU LLC cache.
The second is the classic tessellation unit that many games use to add geometric density in games. Which is called Geometry, we do not know if it is a contemporary Geometry Engine like the one carried by AMD and NVIDIA GPUs, but we assume that since this type of units is essential for Mesh Shading. And let’s not forget that Intel ARC Alchemist supports DirectX 12 Ultimate.
The third unit is called Hi-ZIt must be taken into account that when the rasterization is carried out, what is done is to generate the Z-Buffer or depth buffer, which what it does is store the distance at which they are with respect to the camera. The idea of Hi-Z is that instead of using a large image buffer like Z-Buffer, what we do is make use of a hierarchy of them to speed up access to it. Keep in mind that many game algorithms such as traditional shadow maps use it and it is also essential for Occlusion Culling. Which allows the GPU to remove fragments with a Z value further away from the camera.
Nor without leaving the Render Slice do we have the Pixel Backend, the name that Intel has given to the classic units responsible for generating the final image buffer. At the end of the pipeline, when the Pixel Shader has colored each pixel, what it does is send it to the Pixel Backend and from there to the L2 cache of the GPU or memory.
Multiple Render Slice and L2 Cache make one GPU
If we go even higher we can see the architecture of the Intel ARC Alchemist in all its glory, composed of 8 Render Slices and a huge second-level cache as a system LLC. Which is responsible for giving cache coherence to all Render Slice that are part of the GPU. Like the rest of the contemporary GPUs, several units in the Intel Xe HPG import and export data to the L2 cache, so its operation has no more secrecy.
What drives are in contact with the L2 cache? Well, the following:
- The top-level caches on each Xe-Core
- The fixed function units that we have mentioned before: Geometry, Rasterizer and Hi-Z
- The Pixel Backend.
As for the reason why we say that it is almost the entire GPU and why we have to keep in mind that Intel will launch their first Intel ARC Alchemist GPUs in the first quarter of 2022 and they still have data to reveal about it. Among them the command processor configuration and the classic all-GPU accelerators such as the display controller, the video codec, the DMA units and any other units that Intel has not revealed yet.
This is all we can say so far about the new Intel architecture.