LLM Quantization: GPTQ, AWQ, and GGUF Explained
Master quantization techniques to run larger models on consumer hardware. Covers 4-bit, 8-bit, and mixed precision approaches.
Table of Contents
1. What is Quantization?
Quantization reduces the precision of model weights from 16-bit or 32-bit floating point to lower bit representations (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory requirements and can speed up inference, with minimal impact on output quality when done correctly.
A 70B parameter model at FP16 requires ~140GB of VRAM. At 4-bit quantization, this drops to ~35-40GB, making it runnable on consumer hardware.
2. GPTQ: GPU-Optimized Quantization
GPTQ (GPT Quantization) is a post-training quantization method optimized for GPU inference. It uses calibration data to minimize quantization error and produces models that run efficiently with libraries like AutoGPTQ and exllama.
Pros: Fast inference on NVIDIA GPUs, good quality retention, wide model availability.
Cons: Requires GPU for inference, calibration can be slow, less flexible than GGUF.
Best for: Users with dedicated NVIDIA GPUs who want maximum inference speed.
3. AWQ: Activation-Aware Quantization
AWQ improves on GPTQ by considering activation patterns during quantization, protecting important weights from aggressive quantization. This often results in better quality at the same bit depth.
Pros: Better quality than GPTQ at same size, efficient inference.
Cons: Newer format with less model availability, still GPU-focused.
Best for: Users prioritizing output quality who can find AWQ versions of their target models.
4. GGUF: The Flexible Format
GGUF (GPT-Generated Unified Format) is the successor to GGML, designed for llama.cpp. It supports CPU inference, partial GPU offloading, and various quantization levels (Q2_K through Q8_0).
Pros: Works on CPU+GPU, flexible offloading, excellent for mixed hardware, great tooling (Ollama, LM Studio).
Cons: Slightly slower than pure GPU solutions, larger file sizes at equivalent quality.
Best for: Users with limited VRAM, Apple Silicon users, anyone wanting flexibility.
5. Choosing the Right Quantization
Q4_K_M (GGUF): Best balance of size and quality for most users. ~4.5 bits per weight.
Q5_K_M (GGUF): Noticeably better quality, ~15% larger. Good if you have the VRAM.
Q8_0 (GGUF): Near-FP16 quality, double the size of Q4. For quality-critical applications.
4-bit GPTQ/AWQ: Best for pure GPU inference when speed is priority.
EXL2: Advanced format allowing custom bits-per-weight, excellent quality/size tradeoff.
◈ Related Guides
Need Help Choosing Hardware?
Compare specs and pricing for all AI hardware in our catalog.
Open Compare Tool →