Quantization Guide
Quantization reduces model size and inference cost by representing weights with fewer bits. This is essential for running large models on consumer hardware.
What is Quantization?
Section titled “What is Quantization?”Quantization maps high-precision values (e.g., FP16, FP32) to lower-precision values (e.g., INT8, INT4). The trade-off is a small accuracy loss for significant gains in speed and memory efficiency.
Quantization Formats
Section titled “Quantization Formats”GGUF (llama.cpp)
Section titled “GGUF (llama.cpp)”The most popular format for local LLMs. Supports various quantization levels:
| Format | Bits | Size Reduction | Quality |
|---|---|---|---|
| FP16 | 16 | 1x | Reference |
| Q8_0 | 8 | 2x | Near lossless |
| Q5_K_M | 5.7 | 2.8x | Excellent |
| Q4_K_M | 4.3 | 3.7x | Very Good |
| Q3_K_M | 3.4 | 4.7x | Good |
| Q2_K | 2.3 | 7x | Acceptable |
Recommendation: Q4_K_M or Q5_K_M for most use cases.
AWQ (Activation-Aware Weight Quantization)
Section titled “AWQ (Activation-Aware Weight Quantization)”Optimized for GPU inference with vLLM and other engines:
| Format | Bits | Quality |
|---|---|---|
| FP16 | 16 | Reference |
| AWQ INT4 | 4 | Very Good |
| AWQ INT8 | 8 | Near lossless |
GPTQ (Generative Post-Training Quantization)
Section titled “GPTQ (Generative Post-Training Quantization)”Another GPU-friendly format:
| Format | Bits | Quality |
|---|---|---|
| FP16 | 16 | Reference |
| GPTQ INT4 | 4 | Very Good |
| GPTQ INT8 | 8 | Near lossless |
Choosing the Right Quantization
Section titled “Choosing the Right Quantization”For Consumer GPUs (8-12GB VRAM)
Section titled “For Consumer GPUs (8-12GB VRAM)”- Use Q4_K_M or Q5_K_M (GGUF)
- Model size: 4-7B parameters ideal
- Good balance of speed and quality
For High-End GPUs (24GB+ VRAM)
Section titled “For High-End GPUs (24GB+ VRAM)”- Use Q8_0 or FP16 (GGUF)
- Model size: 13-70B parameters possible
- Maximum quality with reasonable speed
For CPU-Only Systems
Section titled “For CPU-Only Systems”- Use Q4_K_M or Q5_K_M (GGUF)
- More threads help performance
- Consider smaller models (3-7B)
For Edge Devices
Section titled “For Edge Devices”- Use Q2_K or Q3_K_M (GGUF)
- Prioritize speed over quality
- Consider specialized models (e.g., Phi-3, Gemma)
How to Convert Models
Section titled “How to Convert Models”Converting to GGUF
Section titled “Converting to GGUF”# Using llama.cpp conversion scriptpython convert_hf_to_gguf.py \ --model /path/to/hf/model \ --outfile /path/to/output.gguf \ --outtype f16Using Pre-Quantized Models
Section titled “Using Pre-Quantized Models”Most popular models are available pre-quantized on Hugging Face:
- TheBloke - Extensive GGUF collection
- Bartowski - AWQ and GPTQ models
- llama.cpp - Official GGUF models
Performance Tips
Section titled “Performance Tips”Memory Optimization
Section titled “Memory Optimization”- Use the smallest quantization that meets your quality needs
- Reduce context length (
num_ctx) to minimum required - Enable GPU offloading when possible
Speed Optimization
Section titled “Speed Optimization”- Use KV cache quantization (supported in newer GGUF versions)
- Enable flash attention (if supported by your hardware)
- Use continuous batching (vLLM) or speculative decoding
Quality Preservation
Section titled “Quality Preservation”- Always test with your specific use case
- Q4_K_M is usually the sweet spot
- For critical applications, use Q5_K_M or Q8_0
- Keep FP16 as reference for comparison
Troubleshooting
Section titled “Troubleshooting”Blurry/Nonsensical Output
Section titled “Blurry/Nonsensical Output”- Quantization too aggressive (try Q5_K_M or Q8_0)
- Model too small for the task
- Insufficient context length
Slow Inference
Section titled “Slow Inference”- Model too large for your hardware
- Try a more quantized version
- Enable GPU offloading
Out of Memory
Section titled “Out of Memory”- Use a smaller model
- Increase quantization (Q4 → Q8)
- Reduce context length
- Enable more GPU offloading