Why We Need SIMD (The Real Reason)

tl/dr: it can deliver big performance speedups at modest area cost

Aug 13, 2025

Why do we even need SIMD instructions?, a recent blog post by Daniel Lemire, inspired me to write this reflective post with a brief history of CPU architecture and the inception and history of SIMD instructions on the x86 platform.

SIMD stands for Single Instruction, Multiple Data, and the term was invented as part of Flynn’s Taxonomy used to characterize computer architectures. Dating back to 1966, this taxonomy describes the different types of architecture that computer engineers were exploring at the time (including the original Single Instruction, Single Data or SISD), and it necessarily has been revised and revisited many times.

Computer engineers had been talking about parallelism long before it became strictly necessary for performance – it was in 1968 that legendary computer engineer Gene Amdahl was prompted to write the letter that inspired the term “Amdahl’s Law,” pointing out that the overall speedup of a program is limited by the serial portions that remain after all available parallelism has been exploited. Amdahl’s skepticism proved correct for decades: long past the foreseeable future, he did not envision computer engineers hitting a limit to the exploitable parallelism in the single-threaded fetching, decoding, and execution of instruction streams. Pipelined CPUs can do all of those tasks concurrently, much like the auto workers in Henry Ford’s moving assembly lines: although each step in the process is discrete and must be done in a particular order, there’s no question that all the workers are performing their tasks in parallel and that the factory is creating more cars per unit time. By the 1990s, CPU architects had figured out how to build “superscalar” hardware that even identified opportunities to execute instructions in parallel, and did so when possible. This form of parallelism is known as “instruction level parallelism” or ILP, and studies showed that compilers could enable concurrent execution of at most 4 or 5 instructions at a time. All of this innovation occurred amidst an explosion in transistor budgets; for Intel x86 CPUs, an increase of 26x across less than a decade as the 80286 (c. 1984) had 134,000 transistors and the Pentium, the first superscalar x86 implementation, had 3.5M transistors.

Once CPU designs were fully pipelined and superscalar, CPU designers were forced to cast a wider net as they sought to deliver even more performance, and extending their instruction sets to enable SIMD was a relatively inexpensive way for them to deliver more compute per instruction. The reason is because SIMD reuses a great deal of existing infrastructure already built into the chip: caches, prefetchers, decoding hardware, the scoreboarding hardware that tracks dependencies between instructions, and so on. On x86, the first SIMD instructions were MMX (“MultiMedia eXtensions”), which aliased existing register state for the floating point unit (FPU) to hold packed arrays of smaller integers. With MMX, a single instruction could perform eight (8) byte-sized operations in a single instruction. And amazingly, because so much of the CPU was hardware that had to be built anyway (truthfully – most of it was SRAM to provision the caches), the incremental manufacturing cost to Intel was modest. Yet this new family of instructions was able to describe a great deal more work per instruction – the key benefit of SIMD. For suitable workloads, it’s about 4x faster to be able to tell the computer “add these 4 numbers to these other 4 numbers, and give me the 4 results” instead of the scalar model (“add these two numbers together”). Intel has since increased the SIMD width available in their chips several times: MMX was 64 bits wide, but SSE, AVX, and AVX512 subsequently have increased the SIMD width to 128-, 256-, and 512-bits wide, respectively, amplifying the core benefit of SIMD that it enables a single instruction to request more work be done by the CPU.

Of course, unlike many of the previous architectural innovations, MMX and its successors all require software updates in order to benefit from the new instructions, which takes much longer because not only do developers have to write and test new code, but it must propagate through intermediate layers (such as an operating system release cycle) before finding its way into the hands of end customers. This lengthy process stands in stark contrast to previous innovations in x86 history that transparently benefit end customers, like integrating the FPU in the 80486, or implementing superscalarity in the Pentium.

In some domains, such as cryptography or video encoding and decoding, there is a ready deployment pipeline for instruction set innovations to deliver the benefits of those instruction improvements, no matter how esoteric1. When Intel was building MMX, they had aspirations to create a similar pipeline for 3D rendering; and if their CPUs had been performance-competitive with dedicated hardware, they might have succeeded. For example, if Intel had been able to build a fast OpenGL implementation that rendered triangles with MMX, then further improvements to the SIMD instruction sets (SSE, AVX, etc.) would have delivered transparent performance improvements to OpenGL applications and neither the developers nor the end customers would have needed to know what enabled those improvements.

Unfortunately, video turned out to be a somewhat lonely exception in that landscape; at the same time MMX was being brought to market, the main sink for CPU clock cycles – 3D rasterization, led by games like DOOM and Duke Nukem and Descent – was being taken over by dedicated hardware. As Direct3D development lead, I had a front row seat in that show, as we shipped an MMX rasterizer in Direct3D and also shored up hardware support with dozens of hardware vendors competing to win the Windows graphics accelerator market. I knew software rasterization was dead for sure, the day Intel delivered a Pentium 2 (the first chip that featured both the Pentium Pro’s superscalar core and MMX instruction support), and it ran half as fast as a lowly S3 ViRGE GX, the least expensive and slowest graphics chip money could buy at the time. The Pentium 2 machine we’d received was a preview, not yet available in the consumer market; so the comparison favored Intel, and still didn’t deliver competitive performance. And Direct3D, with its Hardware Abstraction Layer, provided just as seamless a mechanism for developers and end customers to benefit from technology advancements. As a consequence, it was left to individual software developers to identify when their application included workloads that would benefit from MMX optimization, hindering both adoption of new SIMD instructions and development and QA of the new code that included them. (I have written about the software engineering exigencies of SIMD instructions, to contrast with writing parallel applications in CUDA.)

That was almost 30 years ago. In the intervening time, Intel tried again to will software rasterization into existence in the form of Larrabee, and failing that (and failing Larrabee’s derivative, Xeon Phi, winning any meaningful developer or customer mindshare), parlayed the instruction set advancements they’d designed for those chips into AVX512, the most functional SIMD instruction set ever. AVX512 is not just twice as wide as its predecessor; it also included per-lane predication arbitrated by dedicated mask registers that came with a fairly rich set of instructions to manipulate them. Daniel Lemire has done yeoman’s work identifying workloads (such as string processing) where SIMD optimization can both deliver a compelling speedup, and also benefit many end customers. As these innovations find their way into mainstream software, and as AVX512 support becomes more pervasive, they promise to deliver the more seamless benefits to end customers that Intel envisioned 30 years ago.

I often joke that the specs for video instructions don’t have to be published, they just have to be texted to the ten or so developers in the world who know how to write optimized video codecs.

The Parallel Programmer

Discussion about this post

Ready for more?