Tips & Tricks

Multi-GPU Setup for AI Training

Configure multiple GPUs for distributed training. Covers NVLink, PCIe considerations, and software setup for PyTorch and JAX.

By HardwareHQ Team•11 min read•December 25, 2024

1. When Do You Need Multi-GPU?
2. NVLink vs PCIe
3. Hardware Requirements
4. PyTorch Distributed Training
5. Common Issues and Solutions

1. When Do You Need Multi-GPU?

Multi-GPU setups become necessary when: model doesn't fit in single GPU VRAM, training time is too long, or you need to scale inference throughput.

Common configurations: 2x RTX 4090 for homelab, 4-8x A100/H100 for production.

2. NVLink vs PCIe

NVLink provides direct GPU-to-GPU communication at 600-900 GB/s (depending on generation).

PCIe 4.0 x16 offers ~32 GB/s per direction, PCIe 5.0 doubles this.

For training: NVLink significantly faster for gradient synchronization.

For inference: PCIe often sufficient as GPUs work more independently.

Consumer cards (RTX 4090): No NVLink support, PCIe only.

3. Hardware Requirements

Motherboard: Needs sufficient PCIe lanes. HEDT platforms (Threadripper, Xeon) recommended for 4+ GPUs.

Power supply: Calculate total GPU TDP + 200W overhead. 1600W+ for quad 4090s.

Cooling: Ensure adequate case airflow. Consider blower-style cards for tight spacing.

CPU: Doesn't need to be top-tier, but avoid bottlenecking data loading.

4. PyTorch Distributed Training

DataParallel (DP): Simplest, but inefficient. Avoid for serious training.

DistributedDataParallel (DDP): Recommended for multi-GPU. Uses NCCL backend.

FSDP (Fully Sharded Data Parallel): For very large models, shards across GPUs.

DeepSpeed: Microsoft's library for efficient large model training.

Example: torchrun --nproc_per_node=4 train.py

5. Common Issues and Solutions

CUDA OOM: Enable gradient checkpointing, reduce batch size, use mixed precision.

Slow training: Check NCCL settings, ensure GPUs are on same PCIe switch.

Uneven GPU usage: Verify data loading isn't bottlenecked, check batch distribution.

Driver issues: Use matching CUDA/driver versions across all GPUs.

◈ Related Guides

Hardware Guide

Best GPUs for AI Training in 2025

Hardware Guide

RTX 4090 for AI: The Ultimate Homelab GPU

Benchmarks

NVIDIA H100 vs A100: Which Should You Choose?

Need Help Choosing Hardware?

Compare specs and pricing for all AI hardware in our catalog.

Open Compare Tool →

Table of Contents