The Implementation Is The Spec: CUDA Edition

Or, why does cudaMemcpyAsync() sometimes work with pageable memory?

Jan 26, 2025

https://x.com/lauriewired/status/1874196253434929178

Is your software bug compatible? Are you following Hyrum’s Law? It’s a concept in software engineering where future versions exactly replicate an *undesirable* feature of a previous version. But why the heck would you replicate unintended behavior? - LaurieWired on Twitter

I’d never heard of Hyrum’s Law before I saw this tweet. I guess I shouldn’t be surprised that someone thought to name this phenomenon, which Microsoft’s Windows team had deeply internalized before I joined the DirectX team in 1996. Raymond Chen had famously fixed a compatibility break in his favorite email client (he used Eudora, for reasons none of us understood, and made it his personal mission to keep it working) where he determined that its WndProc was looking up the stack into a Windows-internal data structure and relying on the value in the stack being zero. He fixed the bug by creating a new variable, declared at the exact location needed to ensure that it would reside in the correct memory location, initialized it to zero, and bequeathed it a naughty name that undoubtedly had to be sanitized before Microsoft could open source or license the source code.

Such compatibility fixes often had to be explained to new members of the team. The bug was unquestionably in the Eudora client! No Windows application had any business introducing dependencies on the layout of the caller’s stack. But from a customer experience perspective, what Microsoft was working hard to avoid was a customer experience where upgrading to a newer version of Windows caused some applications stop working. If the only thing that changed was the Windows upgrade, customers understandably would blame Windows and not the broken application.

In building CUDA, which started development about 10 years after Windows 95 shipped, we had a keener awareness of compatibility pitfalls, and did some defensive engineering to evade them. But CUDA has had its share of bewildering interface behaviors, and applications have introduced dependencies on them at times.

An Innocuous-Seeming Optimization

One such behavior dates back to the earliest versions of CUDA, when all inputs had to be copied to device memory, and outputs had to be copied back1. Host-to-device and device-to-host memcpy were workhorse operations. It was important to make them as fast as possible, especially since GPU kernels could not yet access host memory directly. For host-to-device memcpy, a special optimization was possible: for small memcpy’s, there was a benefit to copying the data directly into the command buffer that the GPU was reading.

The heuristic looked like this:

If memcpy is “small,”
1. Issue memcpy-immediate command to copy from command buffer to destination,
2. Copy input data to command buffer,
Else
1. Issue memcpy-DMA command to copy from source buffer to destination,
2. Synchronize (i.e. wait for the GPU to finish),
Return to caller.

This optimization is sort of like CPU instructions with immediate operands: since the GPU was using fast DMA operations to read the push buffer anyway, slipping the source data into the command buffer was faster, because it was asynchronous: the function could return immediately after submitting the command to the GPU. The benefit of the asynchrony (avoiding the synchronization step 2b above) outweighed the penalty of making the extra copy of 1a).

Because of the direction of memory traffic (the GPU only reads command buffer data), this optimization only worked for host-to-device memcpy.

If memory serves, the threshold was 64Kb.

So if an application called host-to-device memcpy, and the number of bytes copied was smaller than 64Kb, our CUDA driver just surreptitiously copied the source data into the command buffer, issued a different command to the GPU, and returned to the caller.

Asynchronous Memcpy

Enter asynchronous memcpy’s, which were added in CUDA 1.1. Asynchronous memcpy’s transparently leveraged an existing abstraction in CUDA – the idea of page-locked memory – by requiring that participating host memory be page-locked. Used in conjunction with the new abstractions CUDA streams and CUDA events, asynchronous memcpy could be used to enable the GPU to run kernels while concurrently transferring data to or from GPU memory. For these new asynchronous memcpy functions, the specification was very clear: participating host memory had to have been allocated with cuMemAllocHost().

Given the foregoing discussion about the host-to-device memcpy optimization, you may already see where we are headed:

In an accidental ordering of parameter validation, we shipped a version of CUDA that, for host-to-device memcpy, included the small-memcpy optimization, but did not require the source data to be page locked.

The truth is, the optimization is valid, regardless whether the source data is page-locked. Page-locking memory only affects whether the GPU can read or write it: the CPU always can read or write its own memory, and the operating system is facile at handling page faults if the data is not resident.

The reason the optimization is valid is because by copying the input data before returning, the optimization evades the race condition that would be present if the driver requested a DMA operation to read from the buffer (the CPU can corrupt the input data before the GPU gets around to reading it) . In addition, that code path always had been asynchronous, even before CUDA 1.1 shipped. Explicitly calling it asynchronous seemed a bit redundant, in fact.

So starting with CUDA 1.1 (c. 2007), small host-to-device memcpy’s did not require source data to be page-locked, even though the specification stated otherwise.

This behavior has caused an unbelievable amount of consternation inside and outside NVIDIA over the years.

The Controversy

On one side of the debate: interface design pedants who read the specification and are offended that the interface doesn’t return failure when pageable memory is specified. Their arguments seem to boil down to the lack of orthogonality (this optimization only applies to host-to-device memcpy’s) and engineering cleanliness.

On the other: myself and, apparently, many CUDA developers, who recognize the utility in this ‘loophole.’ Personally, I think it is very difficult to make a case against this loophole, because 1) the memcpy is, in fact, asynchronous: Arbitrary amounts of work can be piled up for the GPU, and the source data for the memcpy when it is called is copied into the command buffer for future processing; and 2) there is no read-after-write data race, as there is with asynchronous memcpy, because the payload data was copied away before control was returned to the caller.

As for the lack of orthogonality – the fact that the loophole exists for host-to-device memcpy, but not the other way around – that proceeds directly from the producer-consumer nature of the relationship between the CPU and GPU. The CPU tells the GPU what to do; the GPU listens, and takes heed. So copying data from the CPU to the GPU fits into that regime naturally, while going the other direction does not.

Closing The Loophole

After CUDA 1.1 had shipped and been available in the field for some time, a summer intern came in to work on CUDA who decided to deeply interrogate its error returns. They wrote unit tests to ensure that CUDA was not only returning errors when expected, but that it was returning the “correct” errors. We quickly learned that it is a formidable challenge to evolve a large, complicated code base and also keep such tests working as expected: when new error conditions crop up that hadn’t been possible when the tests were written, do we version the error return based on which version of CUDA was called? Reflecting on this episode, I think CUDA benefited from the near-universal idea that we check whether an error occurred, not the exact value of the error code. Exceptions to this pattern, like looping over an allocation, (perhaps reducing its size until it succeeds, to allocate the largest possible buffer), tend to be dependent on very specific error returns.

This summer intern, predictably enough, noticed the loophole in cudaMemcpyAsync() where it did not return an error if a small, pageable buffer was specified.

A huge outcry was raised!

Bugs were filed!

Meetings were scheduled!

Much sturm und drang ensued!

To many developers on the team, and management, it seemed obvious that we should amend the function to replace the existing behavior and instead return an error on pageable memory.

At the time, after pointing out the arguments above (that the memcpy is, in fact, asynchronous and does not present any race condition risks), I raised the following objection: there are liable to be applications in the wide world that rely on the existing behavior.

All other things being equal, it is never, ever a good idea to introduce functional regressions. When we replace APIs, it’s best if the new APIs can be used as replacements for their predecessors (think cudaHostAlloc(), which is synonymous with calling cudaMallocHost() with a flags parameter of 0). But despite those objections, we “fixed” the bug in cudaMemcpyAsync() to fail if a pageable address range was specified, and shipped a beta.

Not two weeks after the beta was made available, we heard from an important developer customer called Elemental Technologies, who were building CUDA-based video transcoding appliances. The folks at Elemental were old friends, having made one of the earliest customer requests for a feature addition to CUDA. In CUDA’s earliest days, they had filed a bug requesting that we add an interrupt-based wait, because they had better things for the CPU core to do than poll a memory location waiting for the GPU to finish.

“Sure, blocking waits are on the roadmap, we’ll just move them up to an earlier CUDA release than planned,” we told them. “By the way, what are you doing? No one else has asked us for blocking waits yet.”

When they told us they were doing video transcoding, we were stunned. We’d known CUDA would be great at linear algebra and N-body computations like gravitational and molecular simulation; video transcoding was the first “surprise workload” for CUDA. So knowing Elemental well, we took heed when they picked up the phone and demanded to know why we’d “broken” cudaMemcpyAsync(). They had debugged the issue down to the point where they knew exactly the conditions where it had worked before, and what we’d changed.

After a quick back-and-forth, we reverted the change. Video transcoding, it seems, needs the occasional tiny host-to-device memcpy in order to work well – and it has to be fast.

This issue seemed to be settled, but I was surprised to discover that it’s still so contentious that it came up on the show floor at GTC 2024, some fifteen years after the conversation with Elemental.

To me, there are few, if any downsides to continuing to support the current behavior. If anything, NVIDIA has had ample opportunity to improve hardware support for it.

You might look at that as leaning into Hyrum’s Law.

Rules Are Meant To Be Broken

The CUDA team had another discussion regarding Hyrum’s Law, though we certainly did not call it that, around a bug in early implementations of cuMemHostGetDevicePointer(). This now-obsolete function dates back to the early days of mapped pinned memory (CUDA 2.2, c. 2009), before 64-bit addressing had been introduced to CUDA hardware. In CUDA 2.2, the host and device address spaces were different, in part because it was possible to build systems with more memory than was addressible with a 32-bit address space. So the way mapped pinned memory worked, was 1) it was an opt-in, i.e. developers had to ask for it, and 2) once allocated, developers who wanted to read or write it with GPU kernels had to separately query for the device memory address.

CUresult cuMemHostGetDevicePointer ( CUdeviceptr* pdptr, void* p, unsigned int Flags );

Given a host pointer, this function returned the corresponding device pointer. It was implemented by looking up the input address in the same balanced binary tree that tracked the address ranges of all page-locked memory that CUDA had allocated.

The Flags word in this interface, like all the Flags words in CUDA interfaces, was intended to enable future opt-ins without having to revise the function interface. Some entry points, like cuInit() and cuMemHostGetDevicePointer(), have never taken advantage of this bit of future-proofing - their Flags words always must be set to 02.

The bug was as follows: cuMemHostGetDevicePointer() always returned the base pointer of the corresponding device memory address range, instead of returning the corresponding device pointer address.

The fix was simple: compute the offset of the input pointer for its pinned memory range, then apply that offset to the device pointer before passing it back.

I am a little embarrassed to say that I thought this bug fix should be applied as an opt-in - we would implement the needed behavior change, and return the expected device memory pointers, only if specifically requested by the caller. The bug was live in the field, having been included in a formal release of CUDA, so my thought was that applications might depend on the legacy behavior. But it was a new feature, and it was admittedly difficult to come up with legitimate scenarios where applications depended on this behavior (after all, it was giving some insight into how CUDA’s handling of pinned memory worked, and those types of implementation details are a big source of Hyrum’s Law vulnerabilities).

So we fixed the bug, and everyone calling cuMemHostGetDevicePointer() got expected behavior until it was rendered obsolete by Unified Virtual Addressing.

The alternative path to bulk memory copies between host and device memory, was to map host memory for the GPU so running kernels could read and write CPU memory directly. This feature was added in CUDA 2.2.

It’s critical to rigorously validate such parameters, or Hyrum’s Law will guarantee they quickly bit-rot, as applications are written that accidentally depend on the function ignoring the Flags word.

The Parallel Programmer

Discussion about this post