The acronym SIMD stands for Single Instruction Multiple Data, they are a type of unit that performs the same instruction, whatever type it may be, on different data at the same time. These are usually packed in a succession of data values presented as if it were a vector, hence they are also known as vector units.
Because everything related to image and sound has the characteristic of working with large amounts of data, but with highly recursive instructions. The SIMD units began to be implemented in the CPUs together with extensions to the sets of registers and instructions. Which happened in the late 90s and since then there has been a clear evolution in this type of units in CPUs.
SIMD units are also used in GPUs. For example, we have the case of the “Stream Processors” from AMD and the misnamed “CUDA Cores” from NVIDIA. Which are nothing more than ALUs placed in SIMD units. For which they all receive the same instruction in unison and work with it, despite manipulating different data.
The limitations or bottlenecks of SIMD units
Since SIMD units can operate several data at the same time, they therefore multiply the computing power of a processor. However, they have a series of associated limitations that make their performance move away from the theoretical ideal that they should have, in particular there are three of them, which we are going to describe below.
Limitations on the size of instructions in SIMD units
The first one is for chip design, the size of the data that the SIMD units work with is fixed, in such a way that it is necessary to increase the set of instructions so that they can work with data of greater or lesser precision. TheseThere is an overcomplication when designing new CPUs and GPUs And it has become a nightmare with the different types of data format that have appeared in recent years. Especially with artificial intelligence that has introduced new low-precision data formats.
The problem is exacerbated in ISAs with fixed-size instructions and therefore use the same number of bits as is the case with ARM. If, for example, in the future it is necessary for the size of the SIMD instruction to be larger then fewer bits will remain for the opcode and therefore for the number of instructions. This is forcing certain RISC designs to have to pull accelerators or co-processors to handle large SIMD instructions.
On the other hand, it does not result in a disadvantage for x86 processors, where the size of the instructions is not fixed, but on the other hand, it overcomplicates the size of the control unit.
Jumps and loops affect performance
Another bottleneck or limitation of SIMD units is for loop or jump instructions, since it may be that a particular value among the multitude of operands of a SIMD unit can cause the instruction to continue in a different way. Let’s not forget that the speed of every processor is always that of its slowest component, and an ALU may go into a much longer loop or jump. That is why GPUs that base their entire execution on SIMD units have jump prediction units running separately and their performance plummets when one of these instructions is present.
If due to a loop or jump one of the SIMD unit operations has not been completed, then it is not possible to jump to the next instruction for this type of units. That is why a technique called Loop Unrolling is used, which consists of converting a code in a loop or jump into a series of instructions in series that obtain the same result. To do this, the compiler of the source code is used to machine code or, depending on the architecture, it has to be done by hand, but it has as a counterpart the increase in the size of the program.
Waste of ALUs in SIMD instructions
An instruction for a SIMD unit does not always make use of all the ALUs it has, causing many times a part of the unit to be idle by not having an operand to work on. Ideally, unused ALUs should be assigned to the next instruction, but this is not possible and that is why it is extremely important to make the most of your resources in this regard.
Where this is seen a lot is in the GPUs, where many times the waves of data and instructions that arrive at the shader units do not occupy all the gaps and it translates into a significant reduction in performance when using this type of units, but It is not the case. It must be taken into account that SIMD units usually have configurations where the ALUs are a power of 2, so if the amount of data to work is not a multiple of 2, it ends up being a waste of the calculation capacity of the units SIMD.