
Distributed Architectures: Parameter Server vs All-Reduce
How GPUs talk to each other. Understanding Ring All-Reduce, PS Strategy, and when to use NCCL.
The Network is the Computer
When you scale to 128 GPUs, the bottleneck isn't Math. It's Communication. Every GPU must agree on the new weights every few milliseconds.
1. Parameter Server Strategy (Async)
- Architecture:
- Workers: Calculate gradients. Send them to PS.
- Parameter Servers (PS): Hold the global weights. Add gradients. Send new weights back.
- Pros: Robust. If one worker dies, the job continues. Good for massive embeddings (Wide & Deep).
- Cons: PS becomes the bottleneck.
2. Ring All-Reduce Strategy (Sync)
- Architecture: No central server.
- GPU 1 passes data to GPU 2.
- GPU 2 passes to GPU 3...
- GPU N passes to GPU 1.
- Pros: Bandwidth optimal. Scales to thousands of GPUs.
- Cons: Fragile. If one GPU dies, the whole ring halts.
- Tech: NVIDIA NCCL (on GPUs), gRPC (on TPUs).
3. Vertex AI Reduction Server
Google Cloud offers a unique hybrid.
If you use MultiWorkerMirroredStrategy, you can enable Vertex AI Reduction Server.
It's a managed service that acts as a super-fast All-Reduce orchestrator, bypassing the need for complex Ring configurations on your VMs.
Knowledge Check
Error: Quiz options are missing or invalid.