Ray-based Llama.cpp Deployment for 64 Qwen2.5 0.5B Models

This project provides a Ray-based solution to deploy 64 instances of Qwen2.5 0.5B models using llama.cpp on a single node with efficient resource management.

Features

Scalable Deployment: Deploy 64 model instances in parallel using Ray
HTTP API: Each instance exposes an OpenAI-compatible HTTP API
Resource Management: Efficient GPU and CPU resource allocation
Health Monitoring: Built-in health checks and monitoring
Load Balancing: Round-robin request distribution
Fault Tolerance: Automatic instance recovery and error handling

Prerequisites

CUDA-enabled GPU: Required for running 64 model instances efficiently
llama.cpp: Built with CUDA support
Python 3.8+
Qwen2.5 0.5B GGUF model file

Installation

Install Python dependencies:
```
pip install -r requirements.txt
```

Verify llama.cpp build:

ls -la /mnt/weka/home/jianshu.she/jianshu/llama.cpp/build/bin/llama-cli

Download Qwen2.5 0.5B model (if not already available):

# Example download command
wget https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_0.gguf

Configuration

Edit config.py to set your model path:

# Set this to your Qwen2.5 0.5B GGUF model path
MODEL_PATH = "/path/to/your/qwen2.5-0.5b-instruct-q4_0.gguf"

# Optionally adjust other parameters
NUM_INSTANCES = 64
BASE_PORT = 8000
CONTEXT_SIZE = 2048
GPU_LAYERS = 32

Usage

1. Deploy All Instances

# Set your model path in config.py first
python ray_llama_server_deployment.py

This will:

Initialize Ray cluster
Deploy 64 server instances on ports 8000-8063
Perform health checks
Run example inference requests
Keep running until interrupted

2. Test the Deployment

In another terminal:

python client_example.py

This will:

Check health of all instances
Send test requests
Run parallel benchmarks
Show performance metrics

3. Manual API Testing

Send requests to any instance:

curl -X POST http://localhost:8000/v1/chat/completions \\
  -H "Content-Type: application/json" \\
  -d '{
    "messages": [{"role": "user", "content": "What is AI?"}],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Health check:

curl http://localhost:8000/health

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Ray Head      │    │  Instance 0     │    │  Instance 63    │
│                 │    │  Port 8000      │    │  Port 8063      │
│  - Coordination │    │  - Flask Server │    │  - Flask Server │
│  - Health Checks│    │  - llama-cli    │    │  - llama-cli    │
│  - Load Balance │    │  - 1/64 GPU     │    │  - 1/64 GPU     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────┐
                    │   Client        │
                    │  - Round Robin  │
                    │  - Health Check │
                    │  - Benchmarking │
                    └─────────────────┘

Performance Considerations

GPU Memory Management

Each instance uses ~1/64 of GPU memory
Adjust GPU_LAYERS in config based on your GPU memory
Monitor GPU utilization: nvidia-smi

CPU Allocation

Each instance uses 1 CPU core
Total: 64 CPU cores recommended
Adjust THREADS_PER_INSTANCE if needed

Network Resources

Uses ports 8000-8063 by default
Ensure firewall allows these ports
Monitor network bandwidth for high throughput

Monitoring

Health Checks

The deployment automatically monitors:

Process health
HTTP endpoint availability
Response time
Memory usage

Logs

Monitor deployment logs:

tail -f ray_deployment.log

Resource Monitoring

# GPU usage
nvidia-smi -l 1

# CPU and memory
htop

# Network
iftop

Troubleshooting

Common Issues

Model file not found:
- Verify MODEL_PATH in config.py
- Check file permissions
Port conflicts:
- Adjust BASE_PORT in config
- Check for other services using ports
GPU memory errors:
- Reduce GPU_LAYERS
- Reduce NUM_INSTANCES
- Check available GPU memory
Instance startup failures:
- Check llama.cpp build
- Verify CUDA installation
- Review error logs

Debug Mode

Enable debug logging:

# In config.py
LOG_LEVEL = "DEBUG"

API Compatibility

The deployed instances are compatible with OpenAI API format:

Chat Completions

POST /v1/chat/completions
{
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 100,
  "temperature": 0.7
}

Health Check

GET /health

Scaling

Horizontal Scaling

Increase NUM_INSTANCES in config
Ensure sufficient GPU memory
Monitor performance vs. resource usage

Vertical Scaling

Increase CONTEXT_SIZE for longer conversations
Adjust GPU_LAYERS for speed vs. memory trade-off
Tune THREADS_PER_INSTANCE based on CPU cores

License

This project follows the same license as llama.cpp.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
examples		examples
scripts		scripts
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
PACKAGE_OVERVIEW.md		PACKAGE_OVERVIEW.md
README.md		README.md
config.py		config.py
install.sh		install.sh
ray_llama_flexible.py		ray_llama_flexible.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Ray-based Llama.cpp Deployment for 64 Qwen2.5 0.5B Models

Features

Prerequisites

Installation

Configuration

Usage

1. Deploy All Instances

2. Test the Deployment

3. Manual API Testing

Architecture

Performance Considerations

GPU Memory Management

CPU Allocation

Network Resources

Monitoring

Health Checks

Logs

Resource Monitoring

Troubleshooting

Common Issues

Debug Mode

API Compatibility

Chat Completions

Health Check

Scaling

Horizontal Scaling

Vertical Scaling

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages