Docker Model Runner

A containerized machine learning model serving platform that demonstrates how to deploy and manage ML models using Docker. This project provides a FastAPI-based REST API for model inference, with support for multiple models including Llama 3.2, health monitoring, and easy deployment.

Features

🐳 Docker-based deployment - Easy containerization and deployment
🚀 FastAPI REST API - High-performance async API server
🦙 Llama 3.2 Integration - Support for Meta's latest Llama 3.2 1B Instruct model
🔄 Dynamic model management - Load/unload models at runtime
📊 Health monitoring - Built-in health checks and metrics
🧪 Demo models included - Ready-to-use classification model + Llama
🔧 Client library - Python client for easy testing and integration
📈 Performance testing - Built-in performance benchmarking
🛡️ Security - Non-root container execution
🔍 CORS support - Cross-origin resource sharing enabled
🤖 Mixed model support - Traditional ML models + Large Language Models

Quick Start

Prerequisites

Docker and Docker Compose (with at least 8GB RAM available)
Python 3.11+ (for client testing)
Note: Llama 3.2 models require significant memory. Ensure your system has at least 8GB RAM.

1. Clone the Repository

git clone https://github.com/prashplus/docker-model-runner.git
cd docker-model-runner

2. Build and Run with Docker Compose

# Build and start the service
docker-compose up --build

# Or run in background
docker-compose up --build -d

Alternative: Quick Start Scripts

Choose the script that works best for your environment:

Python Script (Cross-platform, recommended)

# Install dependencies first
pip install -r client-requirements.txt

# Run the Python quick start script
python quick-start.py

PowerShell (Windows)

# Open PowerShell and run:
.\scripts\quick-start.ps1

# With options:
.\scripts\quick-start.ps1 -SkipBuild    # Skip building if image exists
.\scripts\quick-start.ps1 -SkipTest     # Skip running tests
.\scripts\quick-start.ps1 -WaitTime 60  # Wait longer for startup

# If you have execution policy issues:
powershell -ExecutionPolicy Bypass -File .\scripts\quick-start.ps1

Batch File (Windows Command Prompt)

# Run from Command Prompt:
.\scripts\quick-start.bat

Bash (Linux/Mac)

# Make executable and run
chmod +x scripts/quick-start.sh
./scripts/quick-start.sh

3. Test the API

First, install client dependencies:

Option A: Automatic Setup Scripts

# Windows PowerShell
.\setup-client.ps1

# Windows Batch
.\setup-client.bat

# Linux/Mac
chmod +x setup-client.sh
./setup-client.sh

Option B: Manual Installation

pip install -r client-requirements.txt

Then run the tests:

# Run automated tests
python client.py --mode test

# Or try interactive mode
python client.py --mode interactive

Manual Docker Setup

Build the Image

docker build -t model-runner .

Run the Container

# Basic run
docker run -p 8000:8000 model-runner

# With volume mounting for persistent models
docker run -p 8000:8000 -v $(pwd)/models:/app/models model-runner

# With environment variables
docker run -p 8000:8000 -e HOST=0.0.0.0 -e PORT=8000 model-runner

API Documentation

Once the server is running, visit:

API Documentation: http://localhost:8000/docs
Health Check: http://localhost:8000/health
Model List: http://localhost:8000/models

Key Endpoints

Method	Endpoint	Description
GET	`/`	API information
GET	`/health`	Health check with metrics
GET	`/models`	List all available models
POST	`/predict`	Make predictions (ML models) or generate text (Llama)
POST	`/generate`	Generate text using Llama models
POST	`/models/{name}/load`	Load a specific model
DELETE	`/models/{name}`	Unload a model
POST	`/models/upload`	Upload a new model file

Example API Usage

Make a Traditional ML Prediction

curl -X POST "http://localhost:8000/predict" \
     -H "Content-Type: application/json" \
     -d '{
       "data": [[1.0, 2.0, 3.0, 4.0], [0.5, 1.5, 2.5, 3.5]],
       "model_name": "default"
     }'

Generate Text with Llama 3.2

curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{
       "prompt": "Explain artificial intelligence in simple terms",
       "model_name": "llama3.2",
       "max_tokens": 150,
       "temperature": 0.7
     }'

Make Llama Prediction via Predict Endpoint

curl -X POST "http://localhost:8000/predict" \
     -H "Content-Type: application/json" \
     -d '{
       "data": "What is Docker and how does it work?",
       "model_name": "llama3.2"
     }'

Check Health

curl http://localhost:8000/health

List Models

curl http://localhost:8000/models

Client Usage

The included Python client provides easy programmatic access to the API.

Basic Usage

import asyncio
from client import ModelRunnerClient

async def main():
    async with ModelRunnerClient("http://localhost:8000") as client:
        # Check health
        health = await client.health_check()
        print(f"Server status: {health['status']}")
        
        # Make traditional ML prediction
        data = [[1.0, 2.0, 3.0, 4.0]]
        result = await client.predict(data)
        print(f"ML Prediction: {result['predictions']}")
        
        # Generate text with Llama
        text_result = await client.generate_text("What is machine learning?")
        print(f"Llama Response: {text_result['generated_text']}")

asyncio.run(main())

Command Line Usage

# Run automated tests
python client.py --mode test --url http://localhost:8000

# Interactive mode
python client.py --mode interactive --url http://localhost:8000

Working with Llama 3.2

Model Information

This project includes Meta's Llama 3.2 1B Instruct model, which provides:

1 billion parameters - Compact yet powerful
Instruction following - Optimized for chat and instruction tasks
Efficient inference - Suitable for containerized deployment
Broad language support - Multilingual capabilities

Text Generation Examples

Using the Generate Endpoint

# Simple question answering
curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{
       "prompt": "Explain Docker containers in 2 sentences",
       "max_tokens": 100,
       "temperature": 0.5
     }'

# Creative writing
curl -X POST "http://localhost:8000/generate" \
     -H "Content-Type: application/json" \
     -d '{
       "prompt": "Write a haiku about machine learning",
       "max_tokens": 50,
       "temperature": 0.8
     }'

Using Python Client

async with ModelRunnerClient() as client:
    # Technical explanation
    result = await client.generate_text(
        "How do neural networks learn?",
        max_tokens=200,
        temperature=0.6
    )
    print(result['generated_text'])
    
    # Code explanation
    result = await client.generate_text(
        "Explain this Python code: def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
        max_tokens=150
    )
    print(result['generated_text'])

Adding Custom Models

Traditional ML Models

1. Prepare Your Model

Save your trained scikit-learn model as a pickle file:

import pickle
from sklearn.ensemble import RandomForestClassifier

# Train your model
model = RandomForestClassifier()
# ... training code ...

# Save the model
with open('my_model.pkl', 'wb') as f:
    pickle.dump(model, f)

2. Add to Container

Option A: Build-time inclusion

Place your .pkl files in the models/ directory before building:

cp my_model.pkl models/
docker-compose up --build

Option B: Runtime upload

Use the upload endpoint:

curl -X POST "http://localhost:8000/models/upload" \
     -F "file=@my_model.pkl"

Option C: Volume mounting

Mount your models directory:

docker run -p 8000:8000 -v /path/to/your/models:/app/models model-runner

3. Load and Use

# Load the model
curl -X POST "http://localhost:8000/models/my_model/load"

# Make predictions
curl -X POST "http://localhost:8000/predict" \
     -H "Content-Type: application/json" \
     -d '{"data": [[1,2,3,4]], "model_name": "my_model"}'

Configuration

Environment Variables

Variable	Default	Description
`HOST`	`0.0.0.0`	Server bind address
`PORT`	`8000`	Server port

Docker Compose Configuration

Modify docker-compose.yml to customize:

Port mappings
Volume mounts
Environment variables
Resource limits
Network settings

Example with resource limits:

services:
  model-runner:
    build: .
    ports:
      - "8000:8000"
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G

Development

Project Structure

docker-model-runner/
├── src/
│   ├── server.py           # FastAPI application
│   └── model_manager.py    # Model management logic
├── models/                 # Model storage directory
├── scripts/               # Helper scripts
│   ├── quick-start.sh     # Linux/Mac quick start
│   └── quick-start.ps1    # Windows quick start
├── client.py              # Python client library
├── Dockerfile             # Container definition
├── docker-compose.yml     # Multi-container setup
├── requirements.txt       # Python dependencies
├── client-requirements.txt # Client dependencies
└── README.md              # This file

Running in Development Mode

For development with auto-reload:

# Install dependencies locally
pip install -r requirements.txt

# Run with uvicorn directly
cd src
uvicorn server:app --reload --host 0.0.0.0 --port 8000

Adding New Model Types

The current implementation supports scikit-learn models and Llama models. To add support for other frameworks:

Extend ModelManager class in src/model_manager.py
Add framework-specific loading logic
Handle different prediction methods
Update model info metadata

Example for TensorFlow:

# In model_manager.py
import tensorflow as tf

async def load_tensorflow_model(self, model_path: str, model_name: str):
    model = tf.keras.models.load_model(model_path)
    self.models[model_name] = model
    # ... update model_info

Example for other Hugging Face models:

# In model_manager.py
async def load_custom_transformer(self, model_name: str, hf_model_name: str):
    tokenizer = AutoTokenizer.from_pretrained(hf_model_name)
    model = AutoModelForSequenceClassification.from_pretrained(hf_model_name)
    
    self.models[model_name] = model
    self.tokenizers[model_name] = tokenizer
    # ... update model_info

Testing

Automated Testing

The client includes comprehensive tests:

python client.py --mode test

Test categories:

Basic functionality (health, model listing)
Traditional ML prediction accuracy and performance
Llama text generation capabilities
Mixed model testing (ML + LLM)
Model management (load/unload)
Error handling
Performance benchmarking

Manual Testing

Use interactive mode for manual exploration:

python client.py --mode interactive

Available commands:

health - Check server health
models - List available models
predict - Make a prediction with sample data
generate <prompt> - Generate text with Llama
llama - Quick Llama test
load <model> - Load a specific model
unload <model> - Unload a model
quit - Exit interactive mode

Performance Testing

The client includes performance benchmarking:

# Run 100 prediction requests
python -c "
import asyncio
from client import ModelRunnerClient

async def test():
    async with ModelRunnerClient() as client:
        await test_performance(client, num_requests=100)

asyncio.run(test())
"

Monitoring and Observability

Health Checks

The service includes built-in health checks:

# Docker health check
docker ps  # Shows health status

# Manual health check
curl http://localhost:8000/health

Health response includes:

Service status
Number of loaded models
Uptime in seconds
Timestamp

Logging

Logs are written to stdout and can be viewed with:

# Docker Compose logs
docker-compose logs -f

# Docker logs
docker logs <container_id> -f

Log levels can be configured via environment variables.

Metrics

Basic metrics are available through the health endpoint. For production monitoring, consider integrating:

Prometheus metrics
Application Performance Monitoring (APM)
Custom logging solutions

Production Deployment

Security Considerations

Non-root execution: Container runs as non-root user
Resource limits: Set appropriate CPU/memory limits
Network security: Use proper firewall rules
Model validation: Validate uploaded models
Input sanitization: Validate API inputs

Scaling

Horizontal Scaling

# docker-compose.yml
services:
  model-runner:
    # ... configuration ...
    deploy:
      replicas: 3

Load Balancing

Use a reverse proxy like nginx:

upstream model_runners {
    server localhost:8001;
    server localhost:8002;
    server localhost:8003;
}

server {
    listen 80;
    location / {
        proxy_pass http://model_runners;
    }
}

Persistent Storage

For production, use persistent volumes:

services:
  model-runner:
    volumes:
      - model_data:/app/models
      - log_data:/app/logs

volumes:
  model_data:
  log_data:

Troubleshooting

Common Issues

PowerShell Execution Policy Issues

If the PowerShell script doesn't run:

# Check current execution policy
Get-ExecutionPolicy

# Temporarily allow script execution (run as Administrator)
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Or run the script with bypass
powershell -ExecutionPolicy Bypass -File .\scripts\quick-start.ps1

# Alternative: Use the batch file instead
.\scripts\quick-start.bat

Client Dependencies Issues

If you get "ModuleNotFoundError" when running the client:

# Install client dependencies
pip install -r client-requirements.txt

# Or use the setup scripts
.\setup-client.ps1    # Windows PowerShell
.\setup-client.bat    # Windows Batch
./setup-client.sh     # Linux/Mac

# Check if dependencies are installed
python -c "import aiohttp; print('aiohttp OK')"
python -c "import requests; print('requests OK')"

Port Already in Use

# Find process using port 8000
lsof -i :8000  # Linux/Mac
netstat -ano | findstr :8000  # Windows

# Kill the process or use different port
docker run -p 8001:8000 model-runner

Model Loading Failures

# Check model files
docker exec -it <container> ls -la /app/models/

# Check logs
docker logs <container>

# Validate model format
python -c "import pickle; pickle.load(open('model.pkl', 'rb'))"

Memory Issues

# Check container resource usage
docker stats

# Increase memory limits for Llama models
docker run -m 8g model-runner

# For Docker Compose, edit docker-compose.yml:
# deploy:
#   resources:
#     limits:
#       memory: 8G

Llama Model Loading Issues

# Check if model is downloading
docker logs <container> | grep -i llama

# Check available disk space for model cache
docker exec -it <container> df -h /app/cache

# Manual model loading test
docker exec -it <container> python -c "
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-1B-Instruct')
print('Llama model accessible')
"

Debug Mode

Enable debug logging:

docker run -e LOG_LEVEL=DEBUG -p 8000:8000 model-runner

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Ensure all tests pass: python client.py --mode test
Submit a pull request

Development Setup

# Clone and setup
git clone <your-fork>
cd docker-model-runner

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt
pip install -r client-requirements.txt

# Run tests
python client.py --mode test

License

This project is open source. Please check the LICENSE file for details.

Note on Llama 3.2: The Llama 3.2 model is subject to Meta's custom license. Please review the Llama 2 Community License Agreement before commercial use.

Support

For issues and questions:

Create an issue on GitHub
Check the troubleshooting section
Review the API documentation at /docs
For Llama-specific issues, check the model loading logs

Performance Notes

Resource Requirements

Minimum: 4GB RAM, 2 CPU cores
Recommended for Llama: 8GB RAM, 4 CPU cores
GPU: Optional but recommended for Llama inference
Storage: ~2-3GB for Llama model cache

First-time Setup

First container start may take 5-10 minutes (Llama model download)
Model files are cached in Docker volume for subsequent runs
Use docker logs <container> to monitor loading progress

Quick Test Commands

# Quick Llama test
python test_llama.py

# Interactive mode with Llama commands
python client.py --mode interactive
# Then try: llama, generate Hello!, etc.

# Full test suite
python client.py --mode test

Happy Model Serving with Llama 3.2! 🚀🐳🦙

Authors

Prashant Piprotar - - Prash+

Visit my blog for more Tech Stuff

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
models		models
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
client-requirements.txt		client-requirements.txt
client.py		client.py
create_custom_models.py		create_custom_models.py
docker-compose.yml		docker-compose.yml
quick-start.py		quick-start.py
requirements.txt		requirements.txt
setup-client.bat		setup-client.bat
setup-client.ps1		setup-client.ps1
setup-client.sh		setup-client.sh
test_llama.py		test_llama.py

prashplus/docker-model-runner

Folders and files

Latest commit

History

Repository files navigation