Where does Intel get the advantage in RAM for its CPUs?

In recent times, one of the advances in which the most is being investigated has to do with the movement of data within an architecture. Something that at first glance may seem completely resolved for a long time, but in recent years has become a crucial point in order to increase the performance of the CPUs.

There are two reasons why the movement of data has become the obsession of architects when designing new hardware. The first has to do with energy consumption and the second with latency, which is the time it takes to perform a memory operation, and it is on this second point that we have to make the relationship between latency and width very clear. band.

Bandwidth equals latency?

No, they are not the same, latency is the time in clock cycles that it takes to resolve a request to memory and this has a series of steps that must always be performed. The problem is that although the memory interface may be very fast, the memory controller may not be so and it ends up happening that this or the MMU of the CPU become saturated and end up delaying all requests to memory.

Well, no matter how fast the interface is, if the memory request is blocked then the rest of the queue is blocked and no data is transmitted. And this can happen if we end up saturating a large number of requests to the RAM. The worst thing is that this can even leave the CPU waiting a long time to obtain the data for the next instruction to execute.

The bandwidth on the other hand is simply the transfer speed. For example, you can have 100 requests at 1 GB / or 1 request at 100 GB / s, but it must be taken into account that the processor memory controller that is responsible for managing accesses to it will have greater difficulties with the first case. than with the second.

Data movement units

Take any ISA, it does not matter what it is and take a look at it, in it you will see instructions that do not perform an arithmetic-logical operation, nor are they in charge of performing a jump or a ship, but are in charge of performing data movements that involve movement to memory.

Many of these instructions are old-fashioned and have a specific clock cycle latency. What if we added a support processor to act as a messenger and could resolve those requests to RAM or any memory within the same address with lower latency? Well, the performance of the processor would increase and allow it to focus the clock cycles that are usually waiting to solve new instructions.

Well, the Intel Data Streaming Accelerator is based on this principle and it is one of the keys to improving the performance of the different processors.

Intel Data Streaming Accelerator

As its name indicates, it is an accelerator, that is, a unit that performs a specific task, which in this case is the emission of data in less time than the CPU would. The particularity of the DSA is that it is designed for one of the features that Compute Express Link brings with it over PCI Express 5.0, which is to grant consistent access to the RAM memory to all peripherals connected to the PCI Express port, that is, they share the same memory addresses.

Therefore, it is used to carry out the following operations:

You can move data from CPU to RAM and vice versa.
In order to access non-coherent memory spaces, with another memory addressing, it can perform the address conversion automatically, so technically we are facing an updated DMA unit.
It also has access to persistent or non-volatile memories, so it can also access NVMe SSDs, Intel Optane modules, NVDIMMs etc …
Through NTB and in a server environment it gives you access to other RAM or non-volatile memory from another board in the data center or server.
It has built-in functions to apply the above points to virtual machines.

As many of you will have deduced, it is a type of unit that is designed especially for server processors, although it is not a fixed function unit that works automatically.

Intel DSA Instructions

The Data Streaming Accelerator is not a fixed function unit since it does not always apply the same program on the data that enters it, but rather supports a series of instructions, so it is what we call a domain processor specific. Among the operations you can perform are:

Move: the classic x86 data moving instructions, those who have written assembler will know. If the processor has one or more Intel Data Streaming Accelerators then it will be executed by these and not by the CPU cores.
DIF: It is in charge of carrying out the process of verifying the integrity of the information in the memory.
CRC Generation: generates the CRC Checksum on the transmitted data.
Fill: It is responsible for filling a section of memory with a specific data repeatedly, it is ideal to erase the content of a part of the memory, since it allows us to set all the bits to 0.
Compare: It is used to compare two memory blocks and check if they are identical.
Delta Record Create: Perform a check and generate a new data stream with the difference between the two.

The Data Streaming Accelerator can also control several storage devices at the same time:

Enable / Disable– Connect or disconnect a memory device, either RAM or non-volatile storage.
Abort: abort all memory requests to RAM or other memory device.
Drain: requests that all requests be made to a memory device at the same time.

The list of instructions is much larger, but it is so that you have a rough idea of what is the operation of this new unit that Intel has integrated into its processors. The benefits are clear and are expected to be further enhanced at Sapphire Rapids.