vLLM Setup
vLLM is a high-throughput and memory-efficient inference engine for large language models. It’s designed for production workloads with support for PagedAttention, continuous batching, and optimized GPU utilization.
Installation
Section titled “Installation”From pip
Section titled “From pip”pip install vllmFrom Source
Section titled “From Source”git clone https://github.com/vllm-project/vllm.gitcd vllmpip install -e .Getting Started
Section titled “Getting Started”1. Start the vLLM Server
Section titled “1. Start the vLLM Server”python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-3-8b-Instruct \ --host 0.0.0.0 \ --port 80002. Test the Server
Section titled “2. Test the Server”curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3-8b-Instruct", "prompt": "Hello, how are you?", "max_tokens": 100 }'Configuration with Evonic AI
Section titled “Configuration with Evonic AI”API Endpoint
Section titled “API Endpoint”vLLM runs an OpenAI-compatible API at http://localhost:8000/v1.
Configuration
Section titled “Configuration”In your Evonic AI configuration:
model: provider: vllm endpoint: http://localhost:8000/v1 model_name: meta-llama/Llama-3-8b-InstructAdvanced Configuration
Section titled “Advanced Configuration”GPU Memory
Section titled “GPU Memory”python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-3-8b-Instruct \ --gpu-memory-utilization 0.9Tensor Parallelism (Multi-GPU)
Section titled “Tensor Parallelism (Multi-GPU)”python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-3-70b-Instruct \ --tensor-parallel-size 4Quantization
Section titled “Quantization”python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-3-8b-Instruct \ --quantization awqContinuous Batching
Section titled “Continuous Batching”vLLM enables continuous batching by default for high throughput.
Troubleshooting
Section titled “Troubleshooting”CUDA Errors
Section titled “CUDA Errors”- Verify GPU compatibility
- Check CUDA version compatibility
- Ensure sufficient GPU memory
Slow Startup
Section titled “Slow Startup”- Pre-download model weights
- Use quantized models
- Reduce tensor parallelism
Memory Issues
Section titled “Memory Issues”- Reduce
gpu_memory_utilization - Use smaller models
- Enable quantization