Skip to content

llama.cpp Setup

llama.cpp is a highly optimized C++ library for running large language models with minimal resources. It’s ideal for edge devices, CPU-only environments, and maximum portability.

Terminal window
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release -j $(nproc)
Terminal window
pip install llama-cpp-python

Download pre-built binaries from the releases page

Convert models to GGUF format (llama.cpp’s native format):

Terminal window
python convert.py model.bin model.gguf
Terminal window
./main -m model.gguf -n 512 -p "Hello, how are you?"
Terminal window
./server -m model.gguf --host 0.0.0.0 --port 8080

llama.cpp server runs at http://localhost:8080 by default.

In your Evonic AI configuration:

model:
provider: llama-cpp
endpoint: http://localhost:8080
model_name: model.gguf
  • Native format for llama.cpp
  • Supports various quantization levels
  • Convert from Hugging Face models using conversion scripts
  • Hugging Face GGUF models: TheBloke
  • Official GGUF models from model creators
  • Convert your own models to GGUF
Terminal window
./server -m model.gguf -ngl 35

Where -ngl specifies the number of layers to offload to GPU.

Terminal window
./server -m model.gguf -c 8192

Set context length to 8192 tokens.

Terminal window
./server -m model.gguf -t 8

Use 8 threads for inference.

Run with quantized models for better performance:

Terminal window
./server -m model-Q4_K_M.gguf
  • Use a quantized model (Q4, Q5, Q8)
  • Reduce context length
  • Increase GPU offloading (-ngl)
  • Enable GPU offloading
  • Use more CPU threads
  • Choose a smaller model
  • Verify GGUF format
  • Check file permissions
  • Ensure sufficient disk space