Ten Years Later: CUDA Succeeded Despite...

Feb 16, 2025

After posting a list of reasons why CUDA succeeded, it seems worthwhile to reflect on some of its apparent vulnerabilities, and why CUDA has been successful despite those issues.

CUDA Succeeded Despite...

1. Being Proprietary

NVIDIA builds the hardware and software to run CUDA applications and has never licensed the technology to anyone else. Conventional wisdom in the industry holds that proprietary software technologies are doomed to failure – they don’t get shepherded well by a single owner, and they don’t gain adoption by developers. But by making CUDA software portable to everything from Linux to Windows to MacOS, and making CUDA hardware available in a broad range of products from SOCs (Tegra) to high end servers (DGX-1), NVIDIA has staved off the risks they incurred by going it alone.
(Ed.- when this post was written, the Supreme Court hadn’t yet issued the Google v. Oracle decision, which deemed use of header files to be permissible under the fair use doctrine when building a clean-room implementation of an interface. I’ve written separately about the implications of this ruling here and here.)

2. Explicit Memory Management

It’s every new CUDA programmer’s rite of passage: As if allocating and copying input and output data to and from device memory weren’t enough trouble, developers also explicitly manage shared memory to facilitate data interchange between threads.

Fortunately for NVIDIA, due to the First Law of CUDA Development, developers haven’t been fazed by the need to learn these idiosyncrasies.

(Ed.- when this post was written, obviously deep learning hadn’t yet driven NVIDIA’s market cap into the stratosphere. Ever since the first AlexNet model was trained on NVIDIA GPUs, machine learning has been a workload native to discrete GPUs. No one has demonstrated a benefit of copy elision by unifying CPU and GPU memory, as was made possible by e.g. AMD’s MI300A)

3. Limited Cache Coherency

Some rules of thumb have been internalized by hardware designers to such a degree that they are not so much sound engineering practices, but religious edicts. One such rule is that caches have to be coherent. All the time. In hardware.

But CUDA is pervaded by violations of this tenet. Device memory is not coherent with host memory. Shared memory effectively resides in a separate address space, so isn’t coherent in the same sense as an L1 cache. Constant and texture memory are not coherent with device memory, and when changes are made to the memory, the illusion of coherence is maintained via software invalidation. (Ed.: And writing to constant memory from a running kernel that subsequently reads from that memory, results in undefined behavior.)

As with explicit memory management, developers are willing to treat the lack of cache coherency as a cost of doing business – as long as they get the performance they crave.

4. Limited PC Market Share

Discrete GPUs only occupy about 25% of PC market share by unit volume, and NVIDIA competes with AMD in that space. NVIDIA’s limited market share helps explain why CUDA has had limited success achieving developer adoption in packaged PC software, even when there’s a good fit with the software requirements.

Put yourself in the shoes of an engineering director at (say) Adobe. “Port this code to CUDA,” says NVIDIA, “and it will run much faster… on 18% of your potential customers’ machines.” Even that proposition is sketchy when accounting for the costs and benefits of supporting the full range of CUDA GPUs extant.

But for vertical applications (think HPC), CUDA developers build data centers with thousands of identical servers. And for embedded applications (think automotive), every GPU in a given design win has identical properties. In both cases, developers have a fixed hardware target to develop against, and they get a compelling return on the engineering investment of the CUDA port.

(Ed.: NVIDIA’s investments in Windows support for CUDA have paid off handsomely, as they now enjoy a monopoly position in the GPU workstation market. An estimated 1,200 workstation applications rely on CUDA, making the decision easy for whomever is procuring GPUs for engineers using CAD/CAM applications, 3D animators, and so on. NVIDIA’s workstation GPU business would be the envy of most companies, but of course is overshadowed by their AI/ML driven data center GPU business.)

In the longer term, companies like Adobe and Autodesk should be able to gain the same benefits by transitioning to cloud-provisioned GPU platforms.
(Ed.: the transition to cloud may have been hindered by the absence of alternatives to CUDA, which has enabled NVIDIA to charge premium prices to utility computing companies. For those deciding whether to rent or buy their GPUs, that tilts the playing field in favor of those who would buy, as long as the GPUs won’t be idle.)

keesh lauria

Oct 31

CUDA’s explicit management of shared memory has always seemed better to me than the situation with CPUs. In CPU programming we still have to think hard about managing what’s in cache, but we don’t have explicit control, so it’s harder to know if what we’re trying to do is actually happening. And a small change in the code or, even worse, in the compiler, can change performance a lot.

Expand full comment

2 replies by Nicholas Wilt

2 more comments...

The Parallel Programmer

Discussion about this post

Ready for more?