Tips & Tricks

VRAM Requirements for Popular LLMs

Complete reference for VRAM needs across Llama 3, Mistral, Mixtral, Qwen, and other popular models at various quantization levels.

By HardwareHQ Team7 min readDecember 28, 2024

1. Understanding VRAM Requirements

VRAM usage for LLMs depends on three factors: model parameters, quantization level, and context length (KV cache). This guide provides baseline requirements assuming 4K context. Longer contexts add ~1-2GB per 4K tokens for 7B models, scaling with model size.

Formula approximation: VRAM (GB) ≈ Parameters (B) × Bits / 8 + KV Cache + Overhead

2. Llama 3 Family

Llama 3 8B: FP16: 16GB | Q8: 9GB | Q4_K_M: 5GB

Llama 3 70B: FP16: 140GB | Q8: 75GB | Q4_K_M: 40GB

Llama 3 405B: FP16: 810GB | Q8: 430GB | Q4_K_M: 230GB (multi-node required)

3. Mistral & Mixtral

Mistral 7B: FP16: 14GB | Q8: 8GB | Q4_K_M: 4.5GB

Mixtral 8x7B: FP16: 90GB | Q8: 48GB | Q4_K_M: 26GB (MoE architecture)

Mixtral 8x22B: FP16: 280GB | Q8: 150GB | Q4_K_M: 80GB

4. Qwen Family

Qwen 2 7B: FP16: 14GB | Q8: 8GB | Q4_K_M: 4.5GB

Qwen 2 72B: FP16: 145GB | Q8: 77GB | Q4_K_M: 42GB

Qwen 2.5 Coder 32B: FP16: 64GB | Q8: 34GB | Q4_K_M: 19GB

5. Other Popular Models

Phi-3 Mini (3.8B): FP16: 8GB | Q4: 2.5GB - Great for limited hardware

CodeLlama 34B: FP16: 68GB | Q4_K_M: 20GB

DeepSeek Coder 33B: FP16: 66GB | Q4_K_M: 19GB

Yi 34B: FP16: 68GB | Q4_K_M: 20GB

Command R+ (104B): FP16: 208GB | Q4_K_M: 60GB

6. GPU Recommendations by Model Size

7B models: RTX 3060 12GB, RTX 4060 Ti 16GB, or any 8GB+ GPU

13B models: RTX 3090/4090 (24GB) or RTX 4080 (16GB with Q4)

34B models: RTX 4090 (24GB) with aggressive quantization, or dual GPUs

70B models: Dual RTX 4090, A100 40GB, or 48GB+ workstation cards

100B+ models: Multi-GPU setups, A100 80GB, H100, or cloud instances

Related Guides

Need Help Choosing Hardware?

Compare specs and pricing for all AI hardware in our catalog.

Open Compare Tool →