APUs: When and Why They Work
A brief history grapples with the question: Why hasn't unified memory taken over?
The history of computer engineering, especially around the hardware/software interface, is punctuated by the rise and fall of seductive ideas that seem inevitable at their peak, but eventually peter out for want of fuel, or sometimes unceremoniously extinguished like snow falling on a newly-built fire1. Sometimes the time just isn’t right, like when Microsoft was flogging Fahrenheit, a tiled rendering hardware architecture, in the mid-1990s2. Sometimes they turn out just to have been bad ideas, like Microsoft’s massive foray into managed code that resulted in a years-long schedule slip of their Longhorn operating system.
Usually these ideas take a decade or so for their viability to become evident, and the industry adopts them or doesn’t, (like GPU computing itself); but one such idea that has been repeatedly explored at different phases of industry history is what AMD calls an “APU” (accelerated processing unit), a unified CPU+accelerator with unified, sometimes cache coherent access to the same memory pool.
On paper, APUs address the single biggest deficiency in GPU programming models, namely that GPUs have their own memory. Every new CUDA programmer remembers learning about cudaMalloc() and having to manually copy inputs to, and outputs from, device memory attached to the GPU. Every superficial analysis of CUDA’s programming model seizes on this additional complexity, and it’s intuitive to believe that not having to separately allocate GPU memory eliminates the single biggest barrier to GPU programming.
Before we delve further into the tradeoffs of APUs, I want to discuss some market forays (successes and failures) across decades of computer engineering history.
Late 1990s – Integrated CPU/GPU
My first encounter with the idea of CPU/GPU integration came at a late-1990s roadmap disclosure of a hardware partner with Microsoft, as Direct3D 6.0 was poised to bring multitexture support onto the platform. A CPU vendor disclosed that they had licensed GPU IP and were building a single-die device that would feature both a competent x86 CPU implementation, and a multitexture-capable GPU. It would have unified access to memory, but the main anticipated benefit of this device was the economies of scale for manufacturing3. The designers also spoke derisively about the low clock speeds targeted by GPUs, and said they expected the graphics chip portion of their device to outperform its contemporaries because it would operate at a higher clock speed.
The result? A device that went all the way through tapeout, through bringup, to a working Windows driver… and complete market failure. None of the OEMs would pick it up. We were told that the OEMs wanted to preserve the measure of flexibility offered by the PCI bus, which would enable a graphics board from any hardware vendor to be plugged into the OEM’s machine. Any OEM who decided to preinstall a discrete graphics card into a computer with an integrated CPU+GPU, will have paid for a GPU that did not benefit them or the customer.
Then, as now, the benefits of tight integration were more than offset by the inflexibility of the hardware design. (Knowing Intel’s history of anticompetitive behavior, another reasonable explanation for the product not getting any design wins would be that Intel stifled this product in the crib by exerting inappropriate market pressure. I don’t know of any evidence to support this hypothesis, though.)
Mid-2000s: Xbox Integration
Fast forward less than a decade to cost-optimized console implementations, and we can examine a market success: when the Xbox 360 initially shipped, it was a multi-chip hardware implementation with separate CPU and GPU. Microsoft designed a single device that had both, and furthermore was constrained to make that device timing-compatible with its predecessor – a difficult proposition. The reason the device was a success is partly because for gaming consoles, uniformity of the platform is a strength, not a weakness. Additionally, since the chips were manufactured at high volume (into the dozens of millions), the up-front design and validation costs are more than offset by per-unit manufacturing savings. Game developers targeting consoles have had the opportunity to develop for unified memory architectures (with and without cache coherency) for two decades or so.
Late 2000s-2011: AMD APUs
Around the same time that cost-optimized Xbox 360 was being built, AMD purchased the graphics chip vendor ATI Technologies, sparking the beginning of a 19-year preoccupation with APUs that has yet to yield a decisive market success. Immediately after the ATI acquisition, AMD set about creating a device akin to the ill-fated APU of the late-1990s: a single chip with both CPU and GPU on the die. Unfortunately, whether due to the teams’ unfamiliarity with each other, or the inherent complexities of designing such a device, AMD slipped schedule so badly that they missed the window on the targeted fab process, which requires a wholesale redesign of the chip, as all of the designers’ assumptions about transistor budget and clock rate must be revisited. It wound up shipping around 2011 (~2 years late), which happens to be when AMD Corporate Fellow Phil Rogers kicked off the Heterogeneous Systems Architecture (HSA Consortium). He recruited partners such as Qualcomm and Samsung, whose target markets necessitated SOCs, and they hosted a conference in Bellevue, Washington.
I was working at Amazon at the time, and I went to the conference out of curiosity and with a unique perspective: I’d had the dubious pleasure of designing CUDA’s APU/zero-copy enlightenments for CUDA 2.2. In the late 2000s, NVIDIA had a reasonably successful chipset business in the MCP product line, and they integrated GPUs into the north bridge. These devices proffered the same benefits of copy elision as other types of APU, and I’d looked far and wide for workloads that would benefit. I asked Scott Le Grand for a compute-intensive, elementwise workload, and he suggested the Black-Scholes options computation. But I had been disappointed at the difficulty I had in finding workloads that would benefit from the copy elision – the Black-Scholes computation mostly benefited because our GPUs were so much faster at doing the computation, not because the copy could be avoided. As if to underscore that point, GT200, the biggest-ever Tesla-class GPU, had gotten some enhancements to improve system memory rendering performance, and those worked well enough on mapped pinned memory that the GPU could saturate both directions of the PCIe bus.
So I went to Bellevue to attend Phil’s HSA Conference, and came away bemused. AMD and their partners were deeply invested in the benefits of unified memory - simplified memory management and copy elision – being so disruptive that they’d be able to displace CUDA in the data parallel programming market. Needless to say, they turned out to be wrong, and I can’t identify a single cause of that market failure.
Part of the problem may have been that the partners AMD had recruited were targeting tiny SOCs focused on mobile applications, and the portability benefits were not as pronounced on those platforms as the portability of an API like OpenGL is on workstation platforms. Part of the problem may have been that HSA was a low-level API, and like OpenCL, no one bothered to layer a usable runtime API on top of it. Though it is cumbersome to allocate device memory and copy data to and fro, it is equally cumbersome to load compiled kernels and for developers to use APIs to launch those kernels.
NVIDIA’s APU aspirations were disrupted when Intel sued to revoke NVIDIA’s chipset license, under the pretext that Intel CPUs now integrated the memory controller. NVIDIA countersued, and later raised this issue in its private antitrust lawsuit against Intel, a case whose $1.5B settlement was announced in 2011. As a result, for NVIDIA, their Tegra line of ARM-based SOCs became the only deployment vehicle for APU-like devices for most of the 2010s.
For AMD’s part, Phil Rogers left for NVIDIA, and that was the end of HSA4, although HSA lives on in AMD’s code bases as one of the lowest-level unprivileged APIs available to program their GPUs.
But AMD was not done with its APU fever dreams. In the late 2010s, they won a supercomputer design by proposing a chipset-based APU architecture – what wound up being MI300A – and when I hired into AMD in 2021, AMD was flogging the benefits of APUs just as enthusiastically as Phil Rogers had in 2011: simplified memory management and copy elision.
For HPC, there is an intuitive appeal to this simplified programming model. There is 50-year-old FORTRAN code extant that must run on modern supercomputers, and that code certainly was not written with coprocessors in mind that had their own memory.
Aside: NVIDIA Also Loves APUS
In fairness, NVIDIA also believed (and believes) that simplifying memory management would unlock new platform opportunities for CUDA. They partnered with IBM to design Pascal and Volta GPUs that had a cache-coherent NVLink connection with IBM’s POWER processors. So although the memory was not unified, it gave the same illusion of a shared memory pool as multiple CPUs with integrated memory controllers could, with directory-based cache coherency protocols allowing one CPU to “borrow” cache lines as needed. As long as no false sharing occurred, these systems worked great for CPUs. Although NVIDIA undoubtedly learned a lot from the partnership and built a supercomputer out of it, the product was not a success and support for POWER was recently deprecated in the latest versions of CUDA. NVIDIA also has made heavy investments in “managed memory” (demand paging support in the GPU), coherency protocols across NVLink that enable the Grace Hopper superchip to more-tightly couple the CPU and GPU memory pools, and the DIGITS SOC that undoubtedly will feature excellent coherency between the ARM CPU cores and the Blackwell GPU cores on that device.
AMD MI300A: Conclusion
As far as MI300A was concerned, it was novel in ways that preceding APU attempts hadn’t been: it was chiplet-based, with packaging to enable the smaller chips to transfer data and administer coherency protocols at higher performance than previously had been possible. Nevertheless, its core value proposition was the same as other APUs: unified memory and copy elision. And… MI300A was not a market success. As with HSA, it may be instructive to examine the reasons.
Part of the reason is because AMD’s Instinct GPUs have been low-volume parts throughout their history. With design and validation costs in the hundreds of millions and tapeout costs running into the tens of millions of dollars, economies of scale are a must to help control per-unit costs.
Part of the reason is because despite NVIDIA’s best efforts, very little CUDA code takes advantage of the unified memory features that have been present since CUDA 2.2. Almost all CUDA code is modeled on a discrete GPU with its own memory, where inputs must be copied over and outputs must be copied back.
Part of the reason is because in contrast to MI300A, AMD’s MI300X variant was a better fit for the requirements of machine learning workloads, which always have been native to discrete GPUs. To my knowledge, no benefit to copy elision has been demonstrated for machine learning – for the most part, the weights stay resident in GPU memory as training data and inference inputs and outputs transit across the interlink.
But I think the main reason the MI300A didn’t achieve traction in the contest for clock cycles was due to a fatal flaw in APU architecture, which dates back to the earliest forays into using GPUs as coprocessors (as opposed to graphics accelerators).
APU As Coprocessor
Throughout APU history, if the GPU cores were being treated as a coprocessor (like 1980s-era 80x87 coprocessors), they’ve suffered from a two-fold problem: on the one hand, launch latency, and on the other, the need to synchronize execution with the CPU after the GPU is done processing. APUs for graphics don’t suffer from this problem: graphics APIs always have enabled (if not encouraged) CPU/GPU concurrency.
For MI300A, or other APU-like architectures such as Grace Hopper, consider one of the most obvious cases for offload: BLAS acceleration.
There are highly optimized BLAS libraries for both CPUs and GPUs; on an APU, it seems reasonable to interpose a shim between the application and the libraries, and delegate to the GPU when it “makes sense.” For one-dimensional, bandwidth-bound workloads like DOT and AXPY, any problem above a certain size makes sense to delegate. For multidimensional workloads like GEMM, a more sophisticated heuristic is needed; but the GPU definitely will be faster than the CPU above a certain problem size. Kernel launches take time, sometimes as long as several microseconds, but that overhead can be accounted for in the heuristic.
Now once the GPU has finished, it must be idled before returning control to the caller, lest the CPU try to read from an output buffer that the GPU hasn’t yet written.
And therein lies the problem: when the GPU is treated as a coprocessor, enlisted to offload CPU computations opportunistically, the GPU’s work is bracketed by pipeline bubbles, due to kernel launch overhead on the one hand and CPU/GPU synchronization on the other.
And bubbles in the GPU pipeline amount to Amdahl’s Law slowdowns that inhibit the GPU’s utility.
In the case of x87 integration and other instruction set enhancements, or the addition of TensorCores to NVIDIA GPUs, the incremental cost of the additional functionality is small compared to hanging an entire accelerator device off a CPU. The reason is because most of the circuitry in the CPU and GPU, such as caches and control flow logic, is needed regardless of the new functionality. With APUs, all of the additional circuitry is just deadweight until it gets asked to do something useful.
Conclusion: What To Do?
The only solution I’ve been able to think of to address this problem is a more coarse-grained version of hyperthreading: overschedule the GPU, so it can find useful work to do during what otherwise would be pipeline bubbles. Servicing multiple inference requests concurrently, or running in a virtualized environment, with the hardware servicing multiple VMs, might be ways to ensure the GPU is making itself useful more often.
Absent that, it seems as if the utility of APUs / SOCs will be limited to cost savings, whether through volume manufacturing (as with the Xbox SOC, or more recently the NVIDIA DIGITS platform) or power efficiency (one of the claimed benefits of MI300A), with those benefits offset by the more-rigid designs. In contexts where the designs’ inflexibility is not a deficiency, APUs can work. But as history has illustrated, the circumstances where APUs work aren’t always well-aligned with their designers’ expectations.
“High up in the tree one branch dropped its load of snow. This fell on the branches beneath. This process continued, spreading through the whole tree. The snow fell without warning upon the man and the fire, and the fire was dead. Where it had burned was a pile of fresh snow.” – To Build A Fire, Jack London
Tiled renderers are a 3D rendering architecture that, instead of creating and editing an entire frame buffer as geometry is submitted, instead cache all of the geometry for a given frame and iteratively render it into “tiles” that are small enough for the intermediate data structure (Z buffer, pending pixels) to fit on chip. Microsoft didn’t successfully enable them to penetrate the PC graphics market, but they are pervasive on mobile chipsets such as Qualcomm Snapdragon because they are more power-efficient.
For integrated GPUs on the PC, it is very common to have unified memory, with a “carveout” reserved for the GPU whose size typically is configurable in the SBIOS. In CUDA 2.2, when we added APU/unified memory/zero-copy enlightenments, developers could steer allocations into the carveout by specifying the WRITECOMBINED flag to cuMemHostAlloc(). The benefit to the GPU was higher performance, in exchange for the loss of cache coherency with the CPU.