llama.cpp Setup
llama.cpp is a highly optimized C++ library for running large language models with minimal resources. It’s ideal for edge devices, CPU-only environments, and maximum portability.
Installation
Section titled “Installation”From Source
Section titled “From Source”git clone https://github.com/ggerganov/llama.cpp.gitcd llama.cppcmake -B buildcmake --build build --config Release -j $(nproc)Using pip (llama-cpp-python)
Section titled “Using pip (llama-cpp-python)”pip install llama-cpp-pythonPre-built Binaries
Section titled “Pre-built Binaries”Download pre-built binaries from the releases page
Getting Started
Section titled “Getting Started”1. Convert a Model
Section titled “1. Convert a Model”Convert models to GGUF format (llama.cpp’s native format):
python convert.py model.bin model.gguf2. Run a Model
Section titled “2. Run a Model”./main -m model.gguf -n 512 -p "Hello, how are you?"3. Start the API Server
Section titled “3. Start the API Server”./server -m model.gguf --host 0.0.0.0 --port 8080Configuration with Evonic AI
Section titled “Configuration with Evonic AI”API Endpoint
Section titled “API Endpoint”llama.cpp server runs at http://localhost:8080 by default.
Configuration
Section titled “Configuration”In your Evonic AI configuration:
model: provider: llama-cpp endpoint: http://localhost:8080 model_name: model.ggufModel Management
Section titled “Model Management”GGUF Format
Section titled “GGUF Format”- Native format for llama.cpp
- Supports various quantization levels
- Convert from Hugging Face models using conversion scripts
Model Sources
Section titled “Model Sources”- Hugging Face GGUF models: TheBloke
- Official GGUF models from model creators
- Convert your own models to GGUF
Advanced Configuration
Section titled “Advanced Configuration”GPU Offloading
Section titled “GPU Offloading”./server -m model.gguf -ngl 35Where -ngl specifies the number of layers to offload to GPU.
Context Length
Section titled “Context Length”./server -m model.gguf -c 8192Set context length to 8192 tokens.
Multi-Thread
Section titled “Multi-Thread”./server -m model.gguf -t 8Use 8 threads for inference.
Quantization
Section titled “Quantization”Run with quantized models for better performance:
./server -m model-Q4_K_M.ggufTroubleshooting
Section titled “Troubleshooting”Out of Memory
Section titled “Out of Memory”- Use a quantized model (Q4, Q5, Q8)
- Reduce context length
- Increase GPU offloading (
-ngl)
Slow Inference
Section titled “Slow Inference”- Enable GPU offloading
- Use more CPU threads
- Choose a smaller model
Model Not Loading
Section titled “Model Not Loading”- Verify GGUF format
- Check file permissions
- Ensure sufficient disk space