Multi-GPU Setup for AI Training
Configure multiple GPUs for distributed training. Covers NVLink, PCIe considerations, and software setup for PyTorch and JAX.
Table of Contents
1. When Do You Need Multi-GPU?
Multi-GPU setups become necessary when: model doesn't fit in single GPU VRAM, training time is too long, or you need to scale inference throughput.
Common configurations: 2x RTX 4090 for homelab, 4-8x A100/H100 for production.
2. NVLink vs PCIe
NVLink provides direct GPU-to-GPU communication at 600-900 GB/s (depending on generation).
PCIe 4.0 x16 offers ~32 GB/s per direction, PCIe 5.0 doubles this.
For training: NVLink significantly faster for gradient synchronization.
For inference: PCIe often sufficient as GPUs work more independently.
Consumer cards (RTX 4090): No NVLink support, PCIe only.
3. Hardware Requirements
Motherboard: Needs sufficient PCIe lanes. HEDT platforms (Threadripper, Xeon) recommended for 4+ GPUs.
Power supply: Calculate total GPU TDP + 200W overhead. 1600W+ for quad 4090s.
Cooling: Ensure adequate case airflow. Consider blower-style cards for tight spacing.
CPU: Doesn't need to be top-tier, but avoid bottlenecking data loading.
4. PyTorch Distributed Training
DataParallel (DP): Simplest, but inefficient. Avoid for serious training.
DistributedDataParallel (DDP): Recommended for multi-GPU. Uses NCCL backend.
FSDP (Fully Sharded Data Parallel): For very large models, shards across GPUs.
DeepSpeed: Microsoft's library for efficient large model training.
Example: torchrun --nproc_per_node=4 train.py
5. Common Issues and Solutions
CUDA OOM: Enable gradient checkpointing, reduce batch size, use mixed precision.
Slow training: Check NCCL settings, ensure GPUs are on same PCIe switch.
Uneven GPU usage: Verify data loading isn't bottlenecked, check batch distribution.
Driver issues: Use matching CUDA/driver versions across all GPUs.
◈ Related Guides
Need Help Choosing Hardware?
Compare specs and pricing for all AI hardware in our catalog.
Open Compare Tool →