A containerized machine learning model serving platform that demonstrates how to deploy and manage ML models using Docker. This project provides a FastAPI-based REST API for model inference, with support for multiple models including Llama 3.2, health monitoring, and easy deployment.
- π³ Docker-based deployment - Easy containerization and deployment
- π FastAPI REST API - High-performance async API server
- π¦ Llama 3.2 Integration - Support for Meta's latest Llama 3.2 1B Instruct model
- π Dynamic model management - Load/unload models at runtime
- π Health monitoring - Built-in health checks and metrics
- π§ͺ Demo models included - Ready-to-use classification model + Llama
- π§ Client library - Python client for easy testing and integration
- π Performance testing - Built-in performance benchmarking
- π‘οΈ Security - Non-root container execution
- π CORS support - Cross-origin resource sharing enabled
- π€ Mixed model support - Traditional ML models + Large Language Models
- Docker and Docker Compose (with at least 8GB RAM available)
- Python 3.11+ (for client testing)
- Note: Llama 3.2 models require significant memory. Ensure your system has at least 8GB RAM.
git clone https://github.com/prashplus/docker-model-runner.git
cd docker-model-runner# Build and start the service
docker-compose up --build
# Or run in background
docker-compose up --build -dChoose the script that works best for your environment:
# Install dependencies first
pip install -r client-requirements.txt
# Run the Python quick start script
python quick-start.py# Open PowerShell and run:
.\scripts\quick-start.ps1
# With options:
.\scripts\quick-start.ps1 -SkipBuild # Skip building if image exists
.\scripts\quick-start.ps1 -SkipTest # Skip running tests
.\scripts\quick-start.ps1 -WaitTime 60 # Wait longer for startup
# If you have execution policy issues:
powershell -ExecutionPolicy Bypass -File .\scripts\quick-start.ps1# Run from Command Prompt:
.\scripts\quick-start.bat# Make executable and run
chmod +x scripts/quick-start.sh
./scripts/quick-start.shFirst, install client dependencies:
# Windows PowerShell
.\setup-client.ps1
# Windows Batch
.\setup-client.bat
# Linux/Mac
chmod +x setup-client.sh
./setup-client.shpip install -r client-requirements.txtThen run the tests:
# Run automated tests
python client.py --mode test
# Or try interactive mode
python client.py --mode interactivedocker build -t model-runner .# Basic run
docker run -p 8000:8000 model-runner
# With volume mounting for persistent models
docker run -p 8000:8000 -v $(pwd)/models:/app/models model-runner
# With environment variables
docker run -p 8000:8000 -e HOST=0.0.0.0 -e PORT=8000 model-runnerOnce the server is running, visit:
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- Model List: http://localhost:8000/models
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
API information |
| GET | /health |
Health check with metrics |
| GET | /models |
List all available models |
| POST | /predict |
Make predictions (ML models) or generate text (Llama) |
| POST | /generate |
Generate text using Llama models |
| POST | /models/{name}/load |
Load a specific model |
| DELETE | /models/{name} |
Unload a model |
| POST | /models/upload |
Upload a new model file |
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"data": [[1.0, 2.0, 3.0, 4.0], [0.5, 1.5, 2.5, 3.5]],
"model_name": "default"
}'curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain artificial intelligence in simple terms",
"model_name": "llama3.2",
"max_tokens": 150,
"temperature": 0.7
}'curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{
"data": "What is Docker and how does it work?",
"model_name": "llama3.2"
}'curl http://localhost:8000/healthcurl http://localhost:8000/modelsThe included Python client provides easy programmatic access to the API.
import asyncio
from client import ModelRunnerClient
async def main():
async with ModelRunnerClient("http://localhost:8000") as client:
# Check health
health = await client.health_check()
print(f"Server status: {health['status']}")
# Make traditional ML prediction
data = [[1.0, 2.0, 3.0, 4.0]]
result = await client.predict(data)
print(f"ML Prediction: {result['predictions']}")
# Generate text with Llama
text_result = await client.generate_text("What is machine learning?")
print(f"Llama Response: {text_result['generated_text']}")
asyncio.run(main())# Run automated tests
python client.py --mode test --url http://localhost:8000
# Interactive mode
python client.py --mode interactive --url http://localhost:8000This project includes Meta's Llama 3.2 1B Instruct model, which provides:
- 1 billion parameters - Compact yet powerful
- Instruction following - Optimized for chat and instruction tasks
- Efficient inference - Suitable for containerized deployment
- Broad language support - Multilingual capabilities
# Simple question answering
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain Docker containers in 2 sentences",
"max_tokens": 100,
"temperature": 0.5
}'
# Creative writing
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a haiku about machine learning",
"max_tokens": 50,
"temperature": 0.8
}'async with ModelRunnerClient() as client:
# Technical explanation
result = await client.generate_text(
"How do neural networks learn?",
max_tokens=200,
temperature=0.6
)
print(result['generated_text'])
# Code explanation
result = await client.generate_text(
"Explain this Python code: def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
max_tokens=150
)
print(result['generated_text'])Save your trained scikit-learn model as a pickle file:
import pickle
from sklearn.ensemble import RandomForestClassifier
# Train your model
model = RandomForestClassifier()
# ... training code ...
# Save the model
with open('my_model.pkl', 'wb') as f:
pickle.dump(model, f)Place your .pkl files in the models/ directory before building:
cp my_model.pkl models/
docker-compose up --buildUse the upload endpoint:
curl -X POST "http://localhost:8000/models/upload" \
-F "file=@my_model.pkl"Mount your models directory:
docker run -p 8000:8000 -v /path/to/your/models:/app/models model-runner# Load the model
curl -X POST "http://localhost:8000/models/my_model/load"
# Make predictions
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"data": [[1,2,3,4]], "model_name": "my_model"}'| Variable | Default | Description |
|---|---|---|
HOST |
0.0.0.0 |
Server bind address |
PORT |
8000 |
Server port |
Modify docker-compose.yml to customize:
- Port mappings
- Volume mounts
- Environment variables
- Resource limits
- Network settings
Example with resource limits:
services:
model-runner:
build: .
ports:
- "8000:8000"
deploy:
resources:
limits:
cpus: '2.0'
memory: 4G
reservations:
cpus: '1.0'
memory: 2Gdocker-model-runner/
βββ src/
β βββ server.py # FastAPI application
β βββ model_manager.py # Model management logic
βββ models/ # Model storage directory
βββ scripts/ # Helper scripts
β βββ quick-start.sh # Linux/Mac quick start
β βββ quick-start.ps1 # Windows quick start
βββ client.py # Python client library
βββ Dockerfile # Container definition
βββ docker-compose.yml # Multi-container setup
βββ requirements.txt # Python dependencies
βββ client-requirements.txt # Client dependencies
βββ README.md # This file
For development with auto-reload:
# Install dependencies locally
pip install -r requirements.txt
# Run with uvicorn directly
cd src
uvicorn server:app --reload --host 0.0.0.0 --port 8000The current implementation supports scikit-learn models and Llama models. To add support for other frameworks:
- Extend
ModelManagerclass insrc/model_manager.py - Add framework-specific loading logic
- Handle different prediction methods
- Update model info metadata
Example for TensorFlow:
# In model_manager.py
import tensorflow as tf
async def load_tensorflow_model(self, model_path: str, model_name: str):
model = tf.keras.models.load_model(model_path)
self.models[model_name] = model
# ... update model_infoExample for other Hugging Face models:
# In model_manager.py
async def load_custom_transformer(self, model_name: str, hf_model_name: str):
tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
model = AutoModelForSequenceClassification.from_pretrained(hf_model_name)
self.models[model_name] = model
self.tokenizers[model_name] = tokenizer
# ... update model_infoThe client includes comprehensive tests:
python client.py --mode testTest categories:
- Basic functionality (health, model listing)
- Traditional ML prediction accuracy and performance
- Llama text generation capabilities
- Mixed model testing (ML + LLM)
- Model management (load/unload)
- Error handling
- Performance benchmarking
Use interactive mode for manual exploration:
python client.py --mode interactiveAvailable commands:
health- Check server healthmodels- List available modelspredict- Make a prediction with sample datagenerate <prompt>- Generate text with Llamallama- Quick Llama testload <model>- Load a specific modelunload <model>- Unload a modelquit- Exit interactive mode
The client includes performance benchmarking:
# Run 100 prediction requests
python -c "
import asyncio
from client import ModelRunnerClient
async def test():
async with ModelRunnerClient() as client:
await test_performance(client, num_requests=100)
asyncio.run(test())
"The service includes built-in health checks:
# Docker health check
docker ps # Shows health status
# Manual health check
curl http://localhost:8000/healthHealth response includes:
- Service status
- Number of loaded models
- Uptime in seconds
- Timestamp
Logs are written to stdout and can be viewed with:
# Docker Compose logs
docker-compose logs -f
# Docker logs
docker logs <container_id> -fLog levels can be configured via environment variables.
Basic metrics are available through the health endpoint. For production monitoring, consider integrating:
- Prometheus metrics
- Application Performance Monitoring (APM)
- Custom logging solutions
- Non-root execution: Container runs as non-root user
- Resource limits: Set appropriate CPU/memory limits
- Network security: Use proper firewall rules
- Model validation: Validate uploaded models
- Input sanitization: Validate API inputs
# docker-compose.yml
services:
model-runner:
# ... configuration ...
deploy:
replicas: 3Use a reverse proxy like nginx:
upstream model_runners {
server localhost:8001;
server localhost:8002;
server localhost:8003;
}
server {
listen 80;
location / {
proxy_pass http://model_runners;
}
}For production, use persistent volumes:
services:
model-runner:
volumes:
- model_data:/app/models
- log_data:/app/logs
volumes:
model_data:
log_data:If the PowerShell script doesn't run:
# Check current execution policy
Get-ExecutionPolicy
# Temporarily allow script execution (run as Administrator)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Or run the script with bypass
powershell -ExecutionPolicy Bypass -File .\scripts\quick-start.ps1
# Alternative: Use the batch file instead
.\scripts\quick-start.batIf you get "ModuleNotFoundError" when running the client:
# Install client dependencies
pip install -r client-requirements.txt
# Or use the setup scripts
.\setup-client.ps1 # Windows PowerShell
.\setup-client.bat # Windows Batch
./setup-client.sh # Linux/Mac
# Check if dependencies are installed
python -c "import aiohttp; print('aiohttp OK')"
python -c "import requests; print('requests OK')"# Find process using port 8000
lsof -i :8000 # Linux/Mac
netstat -ano | findstr :8000 # Windows
# Kill the process or use different port
docker run -p 8001:8000 model-runner# Check model files
docker exec -it <container> ls -la /app/models/
# Check logs
docker logs <container>
# Validate model format
python -c "import pickle; pickle.load(open('model.pkl', 'rb'))"# Check container resource usage
docker stats
# Increase memory limits for Llama models
docker run -m 8g model-runner
# For Docker Compose, edit docker-compose.yml:
# deploy:
# resources:
# limits:
# memory: 8G# Check if model is downloading
docker logs <container> | grep -i llama
# Check available disk space for model cache
docker exec -it <container> df -h /app/cache
# Manual model loading test
docker exec -it <container> python -c "
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-1B-Instruct')
print('Llama model accessible')
"Enable debug logging:
docker run -e LOG_LEVEL=DEBUG -p 8000:8000 model-runner- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Ensure all tests pass:
python client.py --mode test - Submit a pull request
# Clone and setup
git clone <your-fork>
cd docker-model-runner
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
pip install -r client-requirements.txt
# Run tests
python client.py --mode testThis project is open source. Please check the LICENSE file for details.
Note on Llama 3.2: The Llama 3.2 model is subject to Meta's custom license. Please review the Llama 2 Community License Agreement before commercial use.
For issues and questions:
- Create an issue on GitHub
- Check the troubleshooting section
- Review the API documentation at
/docs - For Llama-specific issues, check the model loading logs
- Minimum: 4GB RAM, 2 CPU cores
- Recommended for Llama: 8GB RAM, 4 CPU cores
- GPU: Optional but recommended for Llama inference
- Storage: ~2-3GB for Llama model cache
- First container start may take 5-10 minutes (Llama model download)
- Model files are cached in Docker volume for subsequent runs
- Use
docker logs <container>to monitor loading progress
# Quick Llama test
python test_llama.py
# Interactive mode with Llama commands
python client.py --mode interactive
# Then try: llama, generate Hello!, etc.
# Full test suite
python client.py --mode testHappy Model Serving with Llama 3.2! ππ³π¦
- Prashant Piprotar - - Prash+
Visit my blog for more Tech Stuff