Operator Fusion Is the Most Important Optimization in Deep Learning
Most time spent running neural networks on GPUs is not spent computing it is spent moving data between memory and compute units. Operator fusion eliminates redundant memory transfers by chaining operations together.
"Operator fusion is the most important optimization in deep learning compilers. Simply put, instead of writing our data to global memory just to read it again, we elide the extra memory accesses by performing several computations at once." Horace He, Making Deep Learning Go BRRRR
Modern GPUs like the A100 can perform 312 teraflops of matrix multiplication per second, but their memory bandwidth only supports loading about 400 billion numbers per second. This means that for any operation simpler than roughly 100 floating point operations per data element, you are spending more time shuffling data than computing. Non-matmul operations like layer normalization and activation functions make up only 0.2% of total FLOPS in a model like BERT, yet they consume disproportionate runtime because they are memory-bandwidth bound.
The solution is fusion: instead of writing an intermediate result to global memory and reading it back for the next operation, you keep data in fast on-chip SRAM and chain multiple operations together. A fused x.cos().cos() takes nearly the same time as a single x.cos() because the bottleneck is the two memory transfers (one read, one write), not the computation. This is why GELU and ReLU activation functions cost almost the same despite GELU involving many more arithmetic operations.
This insight has far-reaching implications. It explains why tools like Triton and PyTorch's torch.compile exist they automate the generation of fused CUDA kernels. It also explains the counterintuitive result that recomputation (activation checkpointing) can sometimes reduce both memory and runtime: doing extra computation avoids extra memory transfers. Understanding the three-regime model compute-bound, memory-bandwidth-bound, and overhead-bound is the essential first step for any serious GPU performance work.
Takeaway: Before optimizing anything in your deep learning pipeline, determine whether you are compute-bound, memory-bound, or overhead-bound the right intervention depends entirely on the regime you are in.
See also: The Memory Wall Limits Everything | CUDA Is a Moat Not Just a Library | Dennard Scaling Ended and Everything Changed