This project provides a Ray-based solution to deploy 64 instances of Qwen2.5 0.5B models using llama.cpp on a single node with efficient resource management.
- Scalable Deployment: Deploy 64 model instances in parallel using Ray
- HTTP API: Each instance exposes an OpenAI-compatible HTTP API
- Resource Management: Efficient GPU and CPU resource allocation
- Health Monitoring: Built-in health checks and monitoring
- Load Balancing: Round-robin request distribution
- Fault Tolerance: Automatic instance recovery and error handling
- CUDA-enabled GPU: Required for running 64 model instances efficiently
- llama.cpp: Built with CUDA support
- Python 3.8+
- Qwen2.5 0.5B GGUF model file
-
Install Python dependencies:
pip install -r requirements.txt
-
Verify llama.cpp build:
ls -la /mnt/weka/home/jianshu.she/jianshu/llama.cpp/build/bin/llama-cli
-
Download Qwen2.5 0.5B model (if not already available):
# Example download command wget https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_0.gguf
Edit config.py to set your model path:
# Set this to your Qwen2.5 0.5B GGUF model path
MODEL_PATH = "/path/to/your/qwen2.5-0.5b-instruct-q4_0.gguf"
# Optionally adjust other parameters
NUM_INSTANCES = 64
BASE_PORT = 8000
CONTEXT_SIZE = 2048
GPU_LAYERS = 32# Set your model path in config.py first
python ray_llama_server_deployment.pyThis will:
- Initialize Ray cluster
- Deploy 64 server instances on ports 8000-8063
- Perform health checks
- Run example inference requests
- Keep running until interrupted
In another terminal:
python client_example.pyThis will:
- Check health of all instances
- Send test requests
- Run parallel benchmarks
- Show performance metrics
Send requests to any instance:
curl -X POST http://localhost:8000/v1/chat/completions \\
-H "Content-Type: application/json" \\
-d '{
"messages": [{"role": "user", "content": "What is AI?"}],
"max_tokens": 100,
"temperature": 0.7
}'Health check:
curl http://localhost:8000/health┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ray Head │ │ Instance 0 │ │ Instance 63 │
│ │ │ Port 8000 │ │ Port 8063 │
│ - Coordination │ │ - Flask Server │ │ - Flask Server │
│ - Health Checks│ │ - llama-cli │ │ - llama-cli │
│ - Load Balance │ │ - 1/64 GPU │ │ - 1/64 GPU │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────┐
│ Client │
│ - Round Robin │
│ - Health Check │
│ - Benchmarking │
└─────────────────┘
- Each instance uses ~1/64 of GPU memory
- Adjust
GPU_LAYERSin config based on your GPU memory - Monitor GPU utilization:
nvidia-smi
- Each instance uses 1 CPU core
- Total: 64 CPU cores recommended
- Adjust
THREADS_PER_INSTANCEif needed
- Uses ports 8000-8063 by default
- Ensure firewall allows these ports
- Monitor network bandwidth for high throughput
The deployment automatically monitors:
- Process health
- HTTP endpoint availability
- Response time
- Memory usage
Monitor deployment logs:
tail -f ray_deployment.log# GPU usage
nvidia-smi -l 1
# CPU and memory
htop
# Network
iftop-
Model file not found:
- Verify
MODEL_PATHinconfig.py - Check file permissions
- Verify
-
Port conflicts:
- Adjust
BASE_PORTin config - Check for other services using ports
- Adjust
-
GPU memory errors:
- Reduce
GPU_LAYERS - Reduce
NUM_INSTANCES - Check available GPU memory
- Reduce
-
Instance startup failures:
- Check llama.cpp build
- Verify CUDA installation
- Review error logs
Enable debug logging:
# In config.py
LOG_LEVEL = "DEBUG"The deployed instances are compatible with OpenAI API format:
POST /v1/chat/completions
{
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100,
"temperature": 0.7
}
GET /health
- Increase
NUM_INSTANCESin config - Ensure sufficient GPU memory
- Monitor performance vs. resource usage
- Increase
CONTEXT_SIZEfor longer conversations - Adjust
GPU_LAYERSfor speed vs. memory trade-off - Tune
THREADS_PER_INSTANCEbased on CPU cores
This project follows the same license as llama.cpp.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request