What is the difference between data parallelism and model parallelism?

In data parallelism, every GPU holds a complete copy of the model. Each GPU processes a different batch of training data, computes gradients independently, and then the GPUs synchronize their gradients by averaging them. In model parallelism, the model itself is split across GPUs. One GPU might hold the first 40 layers while another holds the last 40. Data parallelism is simpler to implement; model parallelism is needed when a single GPU cannot hold the entire model.

Why is communication between GPUs such a bottleneck in distributed training?

Training requires GPUs to frequently synchronize gradients. In data parallelism, after each batch, every GPU must share its gradients with every other GPU so they can compute the average. A modern training job might synchronize gradients thousands of times per second. If the network linking GPUs is slow, GPUs spend most of their time waiting for data rather than computing. High-speed interconnects like NVLink, or placing GPUs close together in the same data center, are essential for efficient large-scale training.

What is pipeline parallelism?

Pipeline parallelism is a hybrid approach that splits the model layer by layer across GPUs, then processes multiple micro-batches simultaneously. While one GPU is working on the forward pass for batch 2, another GPU is running the forward pass for batch 1, and a third is computing backward passes. This keeps all GPUs busy simultaneously, which data or model parallelism alone often cannot achieve. Google's Pathways system uses pipeline parallelism as a core design principle.

What is ZeRO and why does it matter?

ZeRO (Zero Redundancy Optimizer), developed by Microsoft Research, is a memory optimization technique that shards model states across GPUs rather than duplicating them. In the standard ZeRO-3 configuration, every GPU holds only one-third of the optimizer states, gradients, and model parameters. This dramatically reduces the memory each GPU needs, allowing larger models or larger batches to fit on the same hardware. DeepSpeed, Microsoft's training library, implements ZeRO and has been used to train models with hundreds of billions of parameters.

Can distributed training handle GPU failures?

Large-scale training runs on thousands of GPUs for weeks or months. Failures are common. Modern frameworks implement checkpointing: the system periodically saves model weights and optimizer states to persistent storage. When a GPU fails, the job resumes from the last checkpoint rather than restarting from scratch. Facebook's training pipeline for recommendations saves checkpoints every few hundred steps and can resume from failures within minutes.

How Distributed Training Works

In 2022, Google published a paper describing its Pathways system. The paper stated that training a single large language model across tens of thousands of GPUs was causing “fragmentation” in how the hardware was used. Different parts of the model were running on different hardware, and moving data between them was consuming more energy than the actual computation. Pathways was an attempt to solve this. The details are revealing: the problem of getting thousands of GPUs to cooperate on one task is so hard that it requires its own dedicated system, designed by dozens of engineers, just to manage the coordination.

The short answer

Distributed training splits the work of training a machine learning model across multiple GPUs, machines, or data centers. It is necessary because training frontier models requires far more memory and compute than any single device can provide. The main approaches are data parallelism (each GPU trains on different data), model parallelism (the model itself is split across GPUs), and pipeline parallelism (a hybrid that keeps all GPUs busy by feeding them micro-batches in sequence). The core challenge is synchronizing work across devices fast enough that GPUs do not spend most of their time waiting for data.

The full picture

Why one GPU is never enough

Training a large language model involves two compounding demands. First, the model itself is large. GPT-3 has 175 billion parameters. Each parameter is a 32-bit or 16-bit floating point number, meaning each parameter takes 4 or 2 bytes of memory. Just storing the model’s parameters for GPT-3 requires hundreds of gigabytes of memory. No commercially available GPU holds that.

Second, training requires memory beyond the model itself. The model must store activations (intermediate results during the forward pass), gradients (how each parameter should change), and optimizer states (the running estimates the optimizer maintains for each parameter). The optimizer states alone can be multiple times the size of the model.

The result is that training GPT-3 class models requires tens to hundreds of GPUs working together. This is not optional, and it is not simple.

Data parallelism: the straightforward approach

The simplest way to distribute training is data parallelism. Each GPU gets a complete copy of the model. Every GPU processes a different batch of training data, computes the loss, and calculates gradients independently. Then the GPUs must synchronize: each GPU shares its gradients with every other GPU, they compute the average, and each GPU updates its own copy of the model using that average.

This approach is conceptually clean because each GPU is doing essentially the same thing. But it has two significant costs. First, each GPU must send its gradients to every other GPU after every batch. With thousands of GPUs, the communication volume is enormous. Second, every GPU must hold a complete copy of the model. If the model is too large to fit in one GPU’s memory, data parallelism alone does not solve the problem.

The synchronization step uses an operation called all-reduce. Each GPU sends its gradients to a central node or shares them peer-to-peer, all gradients are averaged, and the result is sent back. The speed of this synchronization is a critical bottleneck.

Model parallelism: splitting the model itself

When a model is too large for a single GPU, model parallelism splits the model across multiple GPUs. A common approach is to split along the model’s layers. GPU 1 holds the embedding layer and the first few transformer blocks. GPU 2 holds the next set of blocks, and so on. When a batch of data flows through, it passes from GPU to GPU as each part computes its share of the work.

The challenge is that passing data between GPUs is slow. If GPU 1 finishes its work and waits for GPU 2 to finish before passing the data along, most GPUs spend most of their time idle. This is called pipeline bubbles, and they waste hardware.

The naive version of model parallelism is therefore inefficient. Modern systems use sophisticated scheduling to minimize bubbles, but the fundamental challenge remains: splitting a sequential computation across independent devices creates unavoidable overhead.

Pipeline parallelism: keeping GPUs busy

Pipeline parallelism addresses the bubble problem by processing multiple micro-batches in flight. Instead of waiting for one large batch to complete entirely before starting the next, the system breaks training into many small micro-batches. While GPU 1 is working on micro-batch 2, GPU 2 is working on micro-batch 1, and GPU 3 is already computing the backward pass for micro-batch 1’s gradients.

This keeps GPUs busier, but it introduces new complexity. The forward pass and backward pass for the same batch are separated in time, which means the system must carefully manage which parameters are needed when. Mistakes here cause numerical errors or memory explosions.

Google’s Pathways system, mentioned at the start, is built around pipeline parallelism as a core design. Pathways allows a single training job to span multiple chips in multiple data centers, with the system routing data and computation across this distributed infrastructure. The paper describes managing the communication and scheduling as a full engineering problem in its own right, detailed in the Pathways paper.

ZeRO: smarter memory use

Microsoft Research’s DeepSpeed library introduced ZeRO (Zero Redundancy Optimizer), a technique that shards model states across GPUs rather than duplicating them. In the standard ZeRO-3 configuration, each GPU stores only one-third of the total optimizer states, gradients, and parameters. When a GPU needs a piece of data it does not hold, it requests it from the GPU that does, computes what it needs to compute, and then discards it.

The result is that a model requiring 400GB of memory per GPU in standard data parallelism can run with as little as 1.2GB per GPU with ZeRO-3. This is not a small improvement; it is the difference between a job that fits on available hardware and one that does not.

DeepSpeed has been used to train large models including BLOOM (176B parameters) and many models that never made their size public, as documented in Microsoft’s DeepSpeed paper and the BLOOM training report.

The communication bottleneck

Every distributed training approach shares one fundamental constraint: the bandwidth between GPUs. Whether synchronizing gradients in data parallelism, passing activations in model parallelism, or communicating partial results in pipeline parallelism, training requires moving enormous amounts of data quickly.

This is why large AI labs cluster GPUs in the same physical location with high-speed interconnects. NVLink, NVIDIA’s proprietary interconnect, allows GPUs to communicate at hundreds of gigabytes per second. Traditional Ethernet or InfiniBand is slower. The choice of interconnect can determine whether a training job is economically viable.

This is also why the major AI labs are building custom silicon. Google’s TPUs are designed as large grids where each TPU has high-bandwidth links to its neighbors, making gradient synchronization faster. The hardware and the software are co-designed.

Why it matters

Distributed training is not just an infrastructure detail. It determines what models can be built at all. A researcher who cannot afford to train a model on thousands of GPUs cannot test their ideas at scale. This concentrates AI development at the organizations that can afford massive compute.

The engineering challenges of distributed training are also where many practical innovations happen. Techniques like gradient checkpointing (trading compute for memory by recomputing activations rather than storing them), mixed precision training (using lower-precision numbers to fit more in memory), and asynchronous training (allowing GPUs to work slightly out of sync to reduce waiting) are all distributed training innovations that have shaped which models get built.

Understanding distributed training explains why AI development is concentrated at a small number of well-funded organizations. Training a frontier model is not just a research problem; it is a logistics problem involving thousands of machines, custom software, and enormous energy expenditure, coordinated with precision over weeks or months.

Common misconceptions

“You can just buy more GPUs and train faster.” Doubling the number of GPUs does not halve the training time in most cases. Communication overhead means the speedup from additional GPUs diminishes. Adding GPUs also increases the probability of a hardware failure during a long training run, which can erase days of work if checkpoints are not saved frequently.

“Distributed training is just splitting work across computers.” The complexity comes from the dependencies between parts of the model. Unlike splitting a web server across machines, where each request is independent, training requires all parts of the model to synchronize frequently. This synchronization is the bottleneck, not the computation itself.

“Cloud GPUs are sufficient for any training job.” Cloud providers offer GPU instances, but the networking between GPUs in a cloud cluster is often slower than the dedicated interconnects in purpose-built AI clusters. For large-scale training, the network topology matters as much as the GPU count. This is why companies like Google and Meta build their own data centers specifically for AI training.

Key terms

Data parallelism: A distributed training approach where each GPU holds a complete copy of the model and trains on different batches of data. Gradients are synchronized across GPUs after each step by averaging them. Simple and widely used, but requires each GPU to hold the full model.

Model parallelism: Splitting the model itself across GPUs, typically by layer. One GPU holds the early layers, another holds later layers, and data passes between them during the forward and backward passes. Necessary when models are too large to fit in a single GPU’s memory.

Pipeline parallelism: A hybrid approach that splits the model across GPUs and processes many micro-batches in flight. While one GPU is running the forward pass for batch 2, another is computing gradients for batch 1. This reduces the idle time (pipeline bubbles) that plague naive model parallelism.

ZeRO (Zero Redundancy Optimizer): A memory optimization technique developed by Microsoft Research that shards model states (optimizer states, gradients, and parameters) across GPUs. ZeRO-3 shards all three, allowing much larger models to be trained on the same hardware.

All-reduce: A collective communication operation used in data parallelism where each GPU sends its gradients to every other GPU, all gradients are averaged, and the result is distributed back to all GPUs. The speed of all-reduce operations is a critical bottleneck in large-scale training.

Gradient checkpointing: A memory optimization technique that recomputes certain intermediate values during the backward pass rather than storing them during the forward pass. It trades extra compute for significantly reduced memory usage, allowing larger models or batches to fit in GPU memory.

NVLink: NVIDIA’s high-speed interconnect technology that allows GPUs to communicate directly with each other at hundreds of gigabytes per second. Slower than NVLink are traditional interfaces like PCIe and InfiniBand.