A key driver of platform adoption is ROI (return on investment): what benefit is derived by developers or customers in exchange for investing in the platform? This axiom is broadly applicable across all platforms, though the benefits that accrue to platform adopters may vary. From x86, to Microsoft Windows, to AWS, developers invest in the platforms in exchange for a return on that investment.
For CUDA, the return always has come in the form of higher performance. As for the investment, for application developers, that means porting to CUDA; for end customers, the investment is to procure CUDA-capable hardware and software that can leverage the new technology for higher performance.
It’s important to understand that this ROI calculus is part of the reason CUDA won the battle for clock cycles and developer mindshare. Minimizing the denominator (the investment expected of would-be platform adopters) is one way to maximize the ROI!
From the earliest days of CUDA, NVIDIA made sure to build software that was portable across operating systems and CPU architectures, and with the Tegra family of processors, they were able to build hardware that spanned the addressable market from the largest supercomputers on the planet to mobile applications like drones, cars, and robots.
In building platforms, it’s crucial to bear in mind the scope of design elements that contribute to the denominator. They can be roughly divided into three categories:
The platform implementor,
Developers targeting the platform, and
End users who utilize the platform.
Let’s examine a few case studies to see how platform designs impact ROI, from narrowest (impacting mainly the platform developer) to broadest (impacting the ecosystem).
Case Study: 80486 FPU Integration
We’ll start by examining a 1980s-era platform innovation that mainly impacted the platform vendor - not would-be developers targeting the platform, nor end customers considering procuring the hardware and software that implements the platform.
When Intel integrated the x87 floating point unit (FPU) into the 80486 processor, they took on a considerable design cost: they had to build a processor that correctly interpreted a family of instructions (the ESCape prefix) that had been designed to operate a physically separate coprocessor. This design choice impacted Intel – they had to invest considerable engineering resources to design and validate the integrated FPU – but developers’ lives were made strictly easier, because the 80486 represented a new baseline: developers would no longer have to detect for the presence of an FPU and pursue different code paths if no FPU was available. After enough time had elapsed (akin to the “fleet turnover” statistic you see with automobiles), developers could write code that relied on x87 instructions, and it ran equally well on the 80486, its successor the Pentium, or subsequent x86 chips. Since hardware floating point implementations are 10x faster (or more) than software emulation, Intel seamlessly broadened the applicability of their CPU products without asking anything of their developer base. Of their customer base, they asked them to pay for FPU functionality regardless whether their workload actually required it. (Later, Intel sold a chip called the 80486SX that did not include FPU capabilities; in a predictable side effect of semiconductor economics, they initially built the 80486SX by disabling the FPU: the transistors were still present, but no longer accessible to software.
I am still skeptical as to whether a majority of 80486 processors actually executed any floating point instructions. NetWare, Novell’s server operating system, was a key driver of high-end CPU sales at the time. It featured a clock-for-clock optimized code path to minimize network latencies. What it did not include was any need for floating point instructions!
Since Intel’s manufacturing volume, and consequent economies of scale, enabled them to charge prices that seemed reasonable both to those procuring server hardware to run NetWare (which did not need the FPU), and those building PC workstations that could run applications such as AutoCAD (which did), the FPU integration serves as an early example of Intel doing an incremental investment (one-time engineering and verification costs, followed by increased manufacturing costs due to the larger die area) to expand its processors’ total addressable market.
I could just as easily have written a little essay on how integrating an 8K cache into the 80486, to reduce the effective latency of memory, was another platform innovation Intel was able to deliver that conferred benefits at little cost to both developers and end customers. But, the instruction set extensions serve as a better example of platform innovation than adding a cache, since new instructions translate to new platform capabilities. Also, on-chip caches long have dominated the transistor budgets of CPUs, with the MMX instruction set, 53 instructions strong, only adding 8% to the die area of the MMX-enabled Pentium. Intel has since expanded its platform capabilities with SSE and its variants, and AVX up to and including the latest AVX-512. Adding new instructions, will continuing to support all the previously-supported instructions, has been a reliable way to add platform features while minimizing blast radius.
Case Study: AMD Wave64
Our second case study examines a design decision that impacted the developers considering adoption of the platform. Increasing the adoption cost reduces ROI, bending the curve unfavorably for adoption of the platform.
NVIDIA, whose CUDA technology enjoys a near-monopoly mindshare and market share of data parallel processors, has only built GPUs that execute using 32-thread bundles called “warps,” named for the parallel threads in a loom using to weave cloth. To write programs that run on these machines, NVIDIA’s CUDA software stack includes a compiler that translates CUDA C code into executable code that can run on their GPUs. The warp width of 32 is queryable by developers, but in practice, if hundreds of millions of devices are running applications, and they have only even done something a certain way, developers inevitably build code that inadvertently relies on that behavior. (see my earlier article, The Implementation Is The Spec: CUDA Edition)
AMD’s Instinct family of GPUs, designed to compete with NVIDIA’s families of GPUs that target datacenter applications, always has executed its parallel programs in 64-thread chunks. (Just to keep things interesting, AMD’s consumer-targeted RDNA GPUs support wave32.)
There are contexts where differing warp and wavefront widths are not consequential for developers. The core code of the BLAS library’s AXPY operation (elementwise multiplication) is not substantively impacted by this difference. But a 64-thread version of DOT (BLAS’s dot product function) or other reductions may require different addressing or loop handling as the threads exchange data with one another to compute the final output.
To make matters even more complicated, starting with Kepler (c. 2013), NVIDIA added warp intrinsics that enable thread-to-thread communication within the 32 threads of a warp without involving shared memory. Take the __ballot()
intrinsic, for example, which causes each active thread to evaluate a predicate, then delivers a copy of every active thread’s predicate to the threads. Here, given that CUDA hardware is a 32-bit machine (meaning, the atomic unit of measure for the register file is 32 bits), __ballot()
benefits from the happy coincidence that the warp size also is 32, to the __ballot()
return value comes perfectly packaged for processing by the threads in the warp.
If you suddenly introduce the idea that warps (or wavefronts, as AMD calls them, apparently under the theory that products can be differentiated by having different names for the same abstraction) are 64 threads wide, these intrinsics’ meaning changes in ways that radiate all the way up into the source code written by developers, starting with the simple fact that the return value is now 64 bits. Reductions and other computations must be formulated to accommodate the new wavefront width, generally in a way that doesn’t disturb the legacy code that was written on 32-thread warps.
Predictably, Wave64 has turned out to be the biggest porting barrier for developers who take their CUDA code and update it to run on AMD’s Instinct GPUs. Exacerbating AMD’s misstep, if they ever ship datacenter GPU products that support Wave32, they will have to confront a compatibility burden that NVIDIA never has had to worry about: How to support wave64 on wave32-capable chips.
When chasing an incumbent who holds a near-monopoly position in the market, it is never wise to make extra work for oneself.
Case Study: Impacting End Customers
Our final case study puts the microscope on the imposition made by platform developers on the final paying customers of their products.
Here, too, contrasting AMD and NVIDIA GPU availability is instructive. But before we delve into that, let’s pause for a moment to reflect on the end customer impact of manycore CPUs such as Larrabee, Xeon Phil, and the Cell processor: even after developers have been persuaded to port to the platform in question (which may include porting to the supporting operating system), end customers then must decide to procure machines that include the hardware functionality. As I have blogged about in the past, it is a much bigger leap for end customers to buy machines with brand new capabilities than machines that incrementally expand existing capabilities, as Intel did by integrating the FPU into the 80486.
From a platform perspective, AMD’s GPUs have many of the same advantages as NVIDIA’s GPUs: they are peripherals that can be enumerated on the bus and, in theory, AMD could build portable driver software to enable their GPUs to work on any operating system or CPU architecture. But tactically, AMD instead has focused on Linux and x86, limiting the total addressable market by ROCm.
But the platform limitations don’t end there. The AQL (Architected Queueing Language) specification, a hardened, publicly (albeit sparsely) documented interface to submit work to AMD GPUs, only operates on machines that support PCIe atomics. Such machines are all-but-pervasive today, but when AQL was designed, atomic operations across PCI Express were just being introduced to the platform, and it was common for chipsets to ship broken implementations. It would have been a prudent investment in defensive engineering for AMD to figure out how to build a DMA language that did not require functional PCIe atomics.
Not satisfied with the platform flexibility afforded by GPUs’ status as a peripheral, NVIDIA also designed and build the Tegra family of SOCs, CUDA-capable chips small enough to design into mobile applications such as automobiles, drones, and robots. These devices lower the barrier of entry, not only for would-be platform adopters such as automobile, drone, and robot manufacturers, but for would-be developers who want to learn how to program the platform. AMD’s strategic decision to focus on datacenter GPUs has limited budget-minded developers’ options to cloud vendors such as TensorWave.
Platform coverage is yet another way to minimize would-be adopters’ investment, hence driving up the ROI.
I write a great deal about CUDA and GPU programming, but any platform can be analyzed through the lens of ROI. Other useful lenses, more relevant to operating systems and utility computing, involve network effects and economies of scale - a possible topic of a future article.