Tips & Tricks

LLM Quantization: GPTQ, AWQ, and GGUF Explained

Master quantization techniques to run larger models on consumer hardware. Covers 4-bit, 8-bit, and mixed precision approaches.

By HardwareHQ Team•15 min read•January 3, 2025

1. What is Quantization?
2. GPTQ: GPU-Optimized Quantization
3. AWQ: Activation-Aware Quantization
4. GGUF: The Flexible Format
5. Choosing the Right Quantization

1. What is Quantization?

Quantization reduces the precision of model weights from 16-bit or 32-bit floating point to lower bit representations (8-bit, 4-bit, or even 2-bit). This dramatically reduces memory requirements and can speed up inference, with minimal impact on output quality when done correctly.

A 70B parameter model at FP16 requires ~140GB of VRAM. At 4-bit quantization, this drops to ~35-40GB, making it runnable on consumer hardware.

2. GPTQ: GPU-Optimized Quantization

GPTQ (GPT Quantization) is a post-training quantization method optimized for GPU inference. It uses calibration data to minimize quantization error and produces models that run efficiently with libraries like AutoGPTQ and exllama.

Pros: Fast inference on NVIDIA GPUs, good quality retention, wide model availability.

Cons: Requires GPU for inference, calibration can be slow, less flexible than GGUF.

Best for: Users with dedicated NVIDIA GPUs who want maximum inference speed.

3. AWQ: Activation-Aware Quantization

AWQ improves on GPTQ by considering activation patterns during quantization, protecting important weights from aggressive quantization. This often results in better quality at the same bit depth.

Pros: Better quality than GPTQ at same size, efficient inference.

Cons: Newer format with less model availability, still GPU-focused.

Best for: Users prioritizing output quality who can find AWQ versions of their target models.

4. GGUF: The Flexible Format

GGUF (GPT-Generated Unified Format) is the successor to GGML, designed for llama.cpp. It supports CPU inference, partial GPU offloading, and various quantization levels (Q2_K through Q8_0).

Pros: Works on CPU+GPU, flexible offloading, excellent for mixed hardware, great tooling (Ollama, LM Studio).

Cons: Slightly slower than pure GPU solutions, larger file sizes at equivalent quality.

Best for: Users with limited VRAM, Apple Silicon users, anyone wanting flexibility.

5. Choosing the Right Quantization

Q4_K_M (GGUF): Best balance of size and quality for most users. ~4.5 bits per weight.

Q5_K_M (GGUF): Noticeably better quality, ~15% larger. Good if you have the VRAM.

Q8_0 (GGUF): Near-FP16 quality, double the size of Q4. For quality-critical applications.

4-bit GPTQ/AWQ: Best for pure GPU inference when speed is priority.

EXL2: Advanced format allowing custom bits-per-weight, excellent quality/size tradeoff.

◈ Related Guides

Tips & Tricks

VRAM Requirements for Popular LLMs

Hardware Guide

Best GPUs for Running Local LLMs

Tips & Tricks

Ollama Optimization: Speed Up Local LLM Inference

Need Help Choosing Hardware?

Compare specs and pricing for all AI hardware in our catalog.

Open Compare Tool →

Table of Contents

1. What is Quantization?

2. GPTQ: GPU-Optimized Quantization

3. AWQ: Activation-Aware Quantization

4. GGUF: The Flexible Format

5. Choosing the Right Quantization

◈ Related Guides

VRAM Requirements for Popular LLMs

Best GPUs for Running Local LLMs

Ollama Optimization: Speed Up Local LLM Inference

Need Help Choosing Hardware?