Distributed training splits a model's computation, data, or optimizer state across multiple GPUs to handle models or datasets too large for a single device. The three primary forms are data parallelism (replicate model, split data across devices), tensor parallelism (split each layer across devices), and pipeline parallelism (split layers across devices).
Effective distributed training requires high-bandwidth interconnect: NVLink between GPUs in a single chassis (SXM form factor) and InfiniBand or RoCE between machines. Frontier-model training routinely uses hundreds or thousands of GPUs connected through hierarchical networking.
Distributed training drives demand for SXM-based GPUs and multi-GPU server configurations. AIMC tracks per-GPU pricing; users planning distributed training should factor in interconnect requirements separately from raw hourly rates.