AMD's GPU Software: A Software Architect's Take
A Deeper Look At What Is Wrong And How It Could Be Fixed
When I hired into AMD in 2021, in explaining the role of a Software Architect, (the same title I had while working at NVIDIA), I drew an analogy likening software to poetry:
Everyone can tell when poetry is good or bad, but
Only some people can tell you why, and
Only some of those people can take a bad poem and tell you how to fix it.
So it is with software. Everyone can tell that AMD doesn’t build good software, including AMD. But only a software architect can tell them why their software is bad, or what to do about it. And unfortunately for AMD, they employ few, if any, software architects.
This article glosses over some glaring deficiencies in AMD’s software stack, most notably the absence of an intermediate language1. I may write about higher-level, strategic considerations at a later date, but the focus of this article is on the GPU computing stack: HIP, ROCm, and the libraries.
Overview: AMD’s GPU Computing Software
I am not going to waste time trying to establish whether the quality of AMD’s software needs improvement; the December 2024 Semianalysis white paper only scratched the surface:
The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.
From the outside, it may seem as if AMD just has a “bug problem”: if they just fix enough bugs, the quality of the software will converge and, eventually, the number of bugs will approach zero and all ends well.
But in the absence of a cogent, well-designed software architecture, such bug-fixing just amounts to performative flailing. And AMD’s software architecture is poor, not because it was poorly designed, but because it was never designed at all. It grew organically from the OpenCL and Heterogeneous Systems Architecture (HSA) code bases, both of which are too low-level to be usable broadly by developers.
Now I will don my Software Architect hat and, without offering prescriptive remedies just yet, endeavor to explain the architecture of AMD’s GPU computing software and why AMD has struggled so mightily to get it working well.
Keep reading with a 7-day free trial
Subscribe to The Parallel Programmer to keep reading this post and get 7 days of free access to the full post archives.