Caches and Abstractions

No discussion of caches is complete without examining CUDA's explicit residency management semantics

Jul 16, 2025

Phil Eaton’s tweet of this short essay on caches as an abstraction by Justin Jaffray found its way into my Twitter feed the other day, and I was inspired to write something up on early CUDA history, and what CUDA adoption can teach us about developers’ willingness to adopt explicit memory residency management versus relying on caches to manage residency on our behalf.

The author argues that caches should be thought of as an abstraction, not an optimization, and that’s a useful contrarian view. For decades, hardware designers have interposed caches between CPU cores and their memory, and across those decades, it was customary for both CPU designers and their customers to think of them as an optimization. The integrated 8K cache on Intel’s 80486 (the first x86 implementation with more than 1M transistors) certainly felt more like an optimization than an abstraction: because the hardware team had ample real workloads to run simulations on, they could design caching policies that maximized performance of those workloads. The primacy of tracing and simulation to design caches and their replacement policies stems from the following two assumptions:

developers shouldn’t have to know details of cache implementation - the presence of caches should be invisible, as alluded to by OP when they realize caches are a programming simplication, not an optimization; and
more can be discovered by the hardware at runtime than the developer possibly could know when they are building their application. The caches, after all, are an intermediate data structure built up by the hardware as the workload runs.

The primacy of trace-driven cache design is so ingrained in the computer hardware architecture community that I have never met a CPU architect who didn’t hate CUDA’s explicit caching mechanisms with the heat of 1,000 suns. More on that later!

As soon as caches became common in CPUs, developers started exploiting them intentionally in a way that broke Assumption 1) above; see, for example, the classic paper by Lam et al. The Cache Performance and Optimizations of Blocked Algorithms, which points out that matrix multiplication algorithms run much faster if “blocked” (a more contemporary term of art would be “tiled”) to reduce cache misses.

Such algorithms inevitably require the tile size to be parameterized according to the characteristics of the target machine. It’s not clear where this weaponization of the machine’s implementation characteristics falls in OP’s taxonomy of “optimization” versus “abstraction.” When running an optimized matrix multiply that’s tiled according to the caching properties of the target machine, the cache is neither an optimization that strictly improves performance (as Lam et al. pointed out in 1991, improper tiling can greatly hinder performance), nor an abstraction that simplifies programming. If anything, tiling for cacheability is much more difficult to program.

Jaffray’s lament about their a priori knowledge of their data’s residency also resonates: “It seems like for my application I should have a much more fine-grained understanding of how things should work. Why am I outsourcing my understanding of the data to a generic ‘policy’?”

It’s an ancient pastime in this industry, debating the division of labor between developers and the hardware that their code will run on. An example of a developer cohort who historically has insisted on maximum control is game developers.

When Microsoft started building the DirectX suite of APIs about 30 years ago, game developers were uncompromising in their demands to exercise fine-grained control over residency of assets. In fact, the Direct prefix was inspired by their universal cry for “direct control of the hardware.” Before Direct3D, there were DirectDraw (to control SuperVGA hardware), DirectSound (to control audio hardware), DirectInput (for mouse and keyboard input, which looked very different from the Win32 APIs), and so on. For purposes of this discussion, DirectDraw presented a particularly instructive case: it enabled developers to allocate buffers in video memory, and invoke BitBlt operations to copy them between different video memory locations. Eric Engstrom once told me that some of the most compelling demos for early DirectDraw were on old machines with the 16-bit ISA bus, because it was so much faster to leave bitmaps in video memory and just tell the graphics chip which video memory locations to copy between, rather than having the data round-trip across the bus.

In keeping with the DirectX imperative to give developers “direct access” to the hardware, DirectDraw empowered the developer to decide what data would reside in video memory and when and how its contents would be used. Using these APIs, developers could build their own caching policies. Early versions of Direct3D echoed this usage pattern, reverse-delegating to developers the problem of deciding which textures and other assets should be resident in video memory at any given time.

Almost a decade later, when I was designing the driver API for CUDA, the DirectX design sensibilities seemed like a good fit for the requirements, at least for the low-level APIs, since after all, any caching policy can be implemented on top of a competent set of lower-level APIs. The common ancestry of the hardware (the first CUDA-capable GPU, after all, also was a VGA-compatible graphics chip that most customers understood to be the first DX10-capable GPU) only made that API design choice more natural.

Consequently, CUDA became a proving ground for OP’s conjecture that they needn’t “[outsource] the understanding of my data to a generic ‘policy’.” Early CUDA hardware could only process data that was resident in GPU memory: inputs had to be copied to buffers allocated in device memory, and outputs had to be copied back. Furthermore, CUDA kernels included a scratchpad memory called “shared memory” that was under explicit developer control. Between CUDA’s explicit management of GPU memory buffers and shared memory, it became a natural experiment to see what, exactly, developers could do when granted these superpowers.

It turns out…they can do a lot. CUDA has exceeded everyone’s expectations in terms of the diversity of workloads it can competently address. We built CUDA for dense linear algebra, but it turned out to be useful for so much more: sorting, image processing and computer vision, drug discovery, video transcoding, machine learning… the list goes on and on. Even NVIDIA’s own attempts to render obsolete explicit copying (in the form of managed memory) and shared memory (in the form of better L1 cache support) have not dented developers’ demands to explicitly control the residence of data.

The Parallel Programmer

Discussion about this post