CUDA Is a Moat Not Just a Library

NVIDIA's dominance in AI hardware rests not primarily on chip performance but on a software ecosystem so deeply entrenched that competitors cannot dislodge it even with superior silicon.

"As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates." Daniel Nishball

CUDA is not merely a programming language it is the accumulated weight of two decades of libraries, optimized operators, debugging tools, and developer muscle memory. When PyTorch ballooned to over 2,000 operators, each was quickly optimized for NVIDIA hardware but not for any competitor. Any AI hardware startup wanting full PyTorch support had to match this entire growing surface area. AMD's ROCm, despite significant investment, still ships with broken stable releases. In benchmarks, the MI300X achieves only 620 TFLOP/s against a marketed 1,307 TFLOP/s in BF16, while NVIDIA's H100 hits 720 of its marketed 989.5 a gap driven almost entirely by software maturity, not silicon capability.

The moat operates at multiple levels simultaneously. NVIDIA's NCCL dominates distributed training communication; AMD's RCCL is "very far behind." NVIDIA provides thousands of GPUs for PyTorch CI/CD ensuring an out-of-box experience that just works, while AMD customers must set dozens of environment flags and use hand-crafted Docker images built by principal engineers. The ecosystem lock-in means that even when a competitor offers better hardware economics, the switching cost in engineering effort makes it impractical.

However, the moat is not impregnable. OpenAI's Triton compiler and PyTorch 2.0's TorchDynamo/TorchInductor stack are reducing the operator surface from 2,000+ to roughly 250 primitives, making it dramatically easier to target non-NVIDIA hardware. DeepSeek demonstrated that heavy optimization can produce remarkable results on constrained hardware. The question is whether the open-source software stack can mature faster than NVIDIA can deepen its proprietary advantage.

Takeaway: Software ecosystem lock-in, not raw chip performance, is what makes NVIDIA nearly impossible to displace and the only real threat comes from open-source compiler stacks that make hardware interchangeable.


See also: The Memory Wall Limits Everything | Custom Silicon Will Eat General Purpose Computing | AI Infrastructure Is Insanely Hard to Build