Discussion about this post

User's avatar
Robots and Chips's avatar

This is one of the most insightful technical retrospectives I've read on GPU architecture evolution. Your 2017 observation that "DON'T MOVE THE DATA" was already critical advice has only become more prescient - the FLOPS/byte disparity you documented has continued widening despite HBM innovations. The table showing B200 at 10.0 FLOPS/byte versus H100 at 20.0 is particularly telling: Blackwell actually regressed on this metric even while doubling HBM capacity to 8000 GB/s, because the TensorCore performance gains outpaced bandwidth improvements. What strikes me most is your 2017 statement that Nvidia was "extremely fortunate that deep learning cropped up" - you were RIGHT that without ML/AI workloads, there would be no workload intensive enough to soak up those FLOPS without starving! The fact that Nvidia's market cap went from $117B to $4.2T (35x!) in under 8 years validates this analysis completely. Your prescience about Grace was also remarkable. The Intel settlement restricting Nvidia from x86 and QPI access clearly forced them toward ARM and custom interconnects, but it's ironic that those constraints may have actually positioned them better for the AI era than if they'd remained dependent on Intel's roadmap. The DGX Spark GB10 system-in-package you discussed at the end is the perfect culmination of this "don't move the data" philosophy - 128GB unified memory eliminating CPU-GPU boundaries for workloads that need it. Your core thesis from 2017 remians the defining challenge of modern computing architecture.

Expand full comment

No posts