Skip to content

Model Selection Guide

Choosing the right model is critical for balancing performance, cost, and hardware requirements. This guide helps you select the best model size category for your needs — keeping recommendations future-proof as new models are released.

Use CaseRecommended SizeQuantizationRunner
General ChatSmall (3–7B)Q5_K_MOllama
Code GenerationMedium (7–22B)Q4_K_MvLLM
Math/ReasoningMedium (7–9B)Q5_K_MOllama
SQL GenerationMedium (7–8B)Q4_K_Mllama.cpp
Document AnalysisSmall–Medium (7B)Q5_K_MOllama
Edge DeploymentSmall (3–4B)Q4_K_Mllama.cpp
High-ThroughputLarge (70B+)AWQ INT4vLLM

Best for everyday tasks, chat, and general assistance.

Size CategoryParameter RangeStrengths
Small1–4BFast, minimal hardware, good for basic tasks
Small–Medium7–9BBalanced performance, versatile
Medium11–14BStrong reasoning, well-rounded

Optimized for programming tasks. Aim for a model in the 7–22B range for the best balance of code quality and speed.

Size CategoryParameter RangeStrengths
SmallUp to 7BGood for common languages (Python, JS, C++)
Medium15–22BMulti-language, strong code gen
Large34B+Maximum capability, higher hardware requirements

Specialized for logical reasoning and mathematical tasks.

Size CategoryParameter RangeStrengths
Small–Medium7–9BSolid math and reasoning capabilities
Medium9–14BStrong analytical skills

Optimized for SQL generation and data analysis.

Size CategoryParameter RangeStrengths
Small–Medium7–8BGood SQL generation and structured data handling
  • Small model (3–4B, Q4+)
  • Small–Medium model (7B, Q4)
  • Small–Medium model (7–9B, Q4+)
  • Large model (70B, Q4)
  • Medium model (20–22B, Q4)
  • Mixture-of-Experts model (8x7B, Q4)
  • Small model (1–4B, Q4+)
  • Tiny model (1.5B, Q4+)
  • Test with your specific use case
  • Compare against baseline (cloud API)
  • Consider both qualitative and quantitative metrics
  • Time to first token (TTFT)
  • Tokens per second (TPS)
  • Context length impact
  • Model size (quantized)
  • Context window requirements
  • Batch size capabilities
  • Hardware investment
  • Electricity costs
  • Maintenance overhead

Create a representative set of prompts for your use case.

Test against a trusted baseline model (e.g., a cloud API) for comparison.

Test your local model with the same prompts.

  • Accuracy: Does the local model match the baseline?
  • Speed: Is the response time acceptable?
  • Quality: Is the output useful for your use case?
  • Try different quantizations
  • Adjust model parameters
  • Consider different models in the same size category

Start with a small model (3–4B) on Ollama. These are fast, require minimal hardware, and perform well for most everyday tasks.

Use a medium-sized model (7–22B) for code generation tasks. Ollama or vLLM is recommended depending on your throughput needs.

Use a small–medium model (7–9B) with Q4_K_M quantization. These offer the best balance of quality and performance for production deployments.

Use the largest model your hardware can handle. FP16 or Q8_0 quantization for maximum accuracy.

  • Try a smaller model size category
  • Use more aggressive quantization
  • Enable GPU offloading
  • Reduce context length
  • Try a larger model size category
  • Use less aggressive quantization
  • Adjust temperature and other parameters
  • Consider a model specialized for your use case
  • Use a smaller model size category
  • Increase quantization
  • Reduce context length
  • Enable more GPU offloading