Skip to content

vLLM Setup

vLLM is a high-throughput and memory-efficient inference engine for large language models. It’s designed for production workloads with support for PagedAttention, continuous batching, and optimized GPU utilization.

Terminal window
pip install vllm
Terminal window
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Terminal window
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-3-8b-Instruct \
--host 0.0.0.0 \
--port 8000
Terminal window
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-8b-Instruct",
"prompt": "Hello, how are you?",
"max_tokens": 100
}'

vLLM runs an OpenAI-compatible API at http://localhost:8000/v1.

In your Evonic AI configuration:

model:
provider: vllm
endpoint: http://localhost:8000/v1
model_name: meta-llama/Llama-3-8b-Instruct
Terminal window
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-3-8b-Instruct \
--gpu-memory-utilization 0.9
Terminal window
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-3-70b-Instruct \
--tensor-parallel-size 4
Terminal window
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-3-8b-Instruct \
--quantization awq

vLLM enables continuous batching by default for high throughput.

  • Verify GPU compatibility
  • Check CUDA version compatibility
  • Ensure sufficient GPU memory
  • Pre-download model weights
  • Use quantized models
  • Reduce tensor parallelism
  • Reduce gpu_memory_utilization
  • Use smaller models
  • Enable quantization