Skip to content

Quantization Guide

Quantization reduces model size and inference cost by representing weights with fewer bits. This is essential for running large models on consumer hardware.

Quantization maps high-precision values (e.g., FP16, FP32) to lower-precision values (e.g., INT8, INT4). The trade-off is a small accuracy loss for significant gains in speed and memory efficiency.

The most popular format for local LLMs. Supports various quantization levels:

FormatBitsSize ReductionQuality
FP16161xReference
Q8_082xNear lossless
Q5_K_M5.72.8xExcellent
Q4_K_M4.33.7xVery Good
Q3_K_M3.44.7xGood
Q2_K2.37xAcceptable

Recommendation: Q4_K_M or Q5_K_M for most use cases.

AWQ (Activation-Aware Weight Quantization)

Section titled “AWQ (Activation-Aware Weight Quantization)”

Optimized for GPU inference with vLLM and other engines:

FormatBitsQuality
FP1616Reference
AWQ INT44Very Good
AWQ INT88Near lossless

GPTQ (Generative Post-Training Quantization)

Section titled “GPTQ (Generative Post-Training Quantization)”

Another GPU-friendly format:

FormatBitsQuality
FP1616Reference
GPTQ INT44Very Good
GPTQ INT88Near lossless
  • Use Q4_K_M or Q5_K_M (GGUF)
  • Model size: 4-7B parameters ideal
  • Good balance of speed and quality
  • Use Q8_0 or FP16 (GGUF)
  • Model size: 13-70B parameters possible
  • Maximum quality with reasonable speed
  • Use Q4_K_M or Q5_K_M (GGUF)
  • More threads help performance
  • Consider smaller models (3-7B)
  • Use Q2_K or Q3_K_M (GGUF)
  • Prioritize speed over quality
  • Consider specialized models (e.g., Phi-3, Gemma)
Terminal window
# Using llama.cpp conversion script
python convert_hf_to_gguf.py \
--model /path/to/hf/model \
--outfile /path/to/output.gguf \
--outtype f16

Most popular models are available pre-quantized on Hugging Face:

  • Use the smallest quantization that meets your quality needs
  • Reduce context length (num_ctx) to minimum required
  • Enable GPU offloading when possible
  • Use KV cache quantization (supported in newer GGUF versions)
  • Enable flash attention (if supported by your hardware)
  • Use continuous batching (vLLM) or speculative decoding
  • Always test with your specific use case
  • Q4_K_M is usually the sweet spot
  • For critical applications, use Q5_K_M or Q8_0
  • Keep FP16 as reference for comparison
  • Quantization too aggressive (try Q5_K_M or Q8_0)
  • Model too small for the task
  • Insufficient context length
  • Model too large for your hardware
  • Try a more quantized version
  • Enable GPU offloading
  • Use a smaller model
  • Increase quantization (Q4 → Q8)
  • Reduce context length
  • Enable more GPU offloading