Skip to content

Model Selection Guide

Choosing the right model is critical for balancing performance, cost, and hardware requirements. This guide helps you select the best model for your needs.

Use CaseRecommended ModelQuantizationRunner
General ChatLlama 3.2 3BQ5_K_MOllama
Code GenerationCodestral 22BQ4_K_MvLLM
Math/ReasoningQwen 2.5 7BQ5_K_MOllama
SQL GenerationLlama 3 8BQ4_K_Mllama.cpp
Document AnalysisMistral 7BQ5_K_MOllama
Edge DeploymentPhi-3 Mini 3.8BQ4_K_Mllama.cpp
High-ThroughputLlama 3 70BAWQ INT4vLLM

Best for everyday tasks, chat, and general assistance.

ModelSizeStrengths
Llama 3.2 3B3B parametersFast, good for most tasks
Llama 3.2 11B11B parametersBalanced performance
Mistral 7B v0.37B parametersVersatile, well-rounded
Gemma 2 9B9B parametersStrong reasoning

Optimized for programming tasks.

ModelSizeStrengths
Codestral 22B22B parametersMulti-language, strong code gen
CodeLlama 7B7B parametersPython, JavaScript, C++
StarCoder2 15B15B parametersMulti-language, large codebase
DeepSeek Coder 6.7B7B parametersStrong code completion

Specialized for logical reasoning and mathematical tasks.

ModelSizeStrengths
Qwen 2.5 7B7B parametersStrong math and reasoning
Llama 3 8B8B parametersGood general reasoning
Mistral 7B7B parametersSolid logical reasoning
Gemma 2 9B9B parametersStrong analytical skills

Optimized for SQL generation and data analysis.

ModelSizeStrengths
Llama 3 8B8B parametersGood SQL generation
CodeLlama 7B7B parametersStrong SQL and data queries
Qwen 2.5 7B7B parametersGood at structured data
  • Llama 3.2 3B (Q4+)
  • Mistral 7B (Q4)
  • Phi-3 Mini 3.8B (Q4+)
  • Llama 3 8B (Q4)
  • Qwen 2.5 7B (Q4+)
  • Gemma 2 9B (Q4)
  • Llama 3 70B (Q4)
  • Mixtral 8x7B (Q4)
  • Codestral 22B (Q4)
  • Phi-3 Mini 3.8B (Q4+)
  • Llama 3.2 3B (Q4+)
  • Qwen 2.5 1.5B (Q4+)
  • Test with your specific use case
  • Compare against baseline (cloud API)
  • Consider both qualitative and quantitative metrics
  • Time to first token (TTFT)
  • Tokens per second (TPS)
  • Context length impact
  • Model size (quantized)
  • Context window requirements
  • Batch size capabilities
  • Hardware investment
  • Electricity costs
  • Maintenance overhead

Create a representative set of prompts for your use case.

Test against a known good model (e.g., GPT-4, Claude) for comparison.

Test your local model with the same prompts.

  • Accuracy: Does the local model match the baseline?
  • Speed: Is the response time acceptable?
  • Quality: Is the output useful for your use case?
  • Try different quantizations
  • Adjust model parameters
  • Consider different models

Start with Llama 3.2 3B or Phi-3 Mini 3.8B on Ollama. These are fast, require minimal hardware, and perform well for most tasks.

Use Codestral 22B or CodeLlama 7B for code generation tasks. vLLM is recommended for high-throughput development workflows.

Use Llama 3 8B or Qwen 2.5 7B with Q4_K_M quantization. These offer the best balance of quality and performance for production deployments.

Use the largest model your hardware can handle. FP16 or Q8_0 quantization for maximum accuracy.

  • Try a smaller model
  • Use more aggressive quantization
  • Enable GPU offloading
  • Reduce context length
  • Try a larger model
  • Use less aggressive quantization
  • Adjust temperature and other parameters
  • Consider a model specialized for your use case
  • Use a smaller model
  • Increase quantization
  • Reduce context length
  • Enable more GPU offloading