Distributed Training Is a Systems Problem Not an ML Problem

September 1, 2024

Training large neural networks across many GPUs is less about gradient descent and more about networking, scheduling, fault tolerance, and systems engineering at extreme scale.

"Not all hardware is created equal. The variance of cluster quality across hardware providers is so high that it is literally a lottery pertaining to how much pain one would have to go through to train good models." Yi Tay, Training great LLMs from ground zero

The naive view of distributed training is that you just split data across GPUs and average the gradients. In reality, the field has spawned an entire taxonomy of parallelism strategies (data parallelism, model parallelism, pipeline parallelism, tensor parallelism) each with its own tradeoffs in memory, communication overhead, and bubble time. Lilian Weng's survey shows that pipeline parallelism alone requires careful scheduling of microbatches to avoid idle GPU time, and even GPipe only achieves near-linear speedup when model parameters are evenly distributed.

But the deeper truth is that the hardest problems are not algorithmic at all. Yi Tay's experience building Reka reveals that different GPU clusters from different providers have wildly different failure rates, cabling quality, and networking performance. Some clusters fail every few hours. Model Flop Utilization (MFU) can tank when a teammate starts transferring data across the shared filesystem. The Photoroom team found that Infiniband systematically outperformed Ethernet interconnects by 3-10% at 128 GPUs, with the gap widening at larger scale. They insisted on 48-hour uninterrupted test runs before committing to a provider.

Industry-wide, average cluster utilization is between 30 and 50%. Companies commit to GPU clusters not to optimize cost per hour, but simply to guarantee their ML teams can access GPUs when they need them. The infrastructure around training (checkpoint management, fault recovery, monitoring, data pipelines) is where most of the engineering effort actually goes.

If you are training at scale, hire systems engineers before you hire more ML researchers. The interconnect, storage, and fault tolerance will determine your throughput more than your optimizer.