Ollama Smart Proxy

Version 3.6 | Production Ready ✅

Intelligent request queue and load balancer for Ollama, optimizing GPU VRAM usage through smart request prioritization and model affinity scheduling.

What It Does

The Smart Proxy sits between your clients and Ollama, intelligently managing request order to minimize model swapping and maximize throughput:

VRAM-Aware Scheduling: Prioritizes requests for already-loaded models
Model Affinity: Batches requests for the same model together
Fair Queuing: Prevents IP monopolization while avoiding starvation
Analytics & Monitoring: Tracks performance, errors, and usage patterns
Production Ready: Database logging, fallback mechanisms, Docker deployment

Quick Start

Installation

# Clone repository
git clone <repo-url>
cd ollama_smart_proxy

# Setup environment (WSL/Ubuntu recommended)
conda activate ./.conda
pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env to set OLLAMA_API_BASE and TOTAL_VRAM_MB

Run

# Development
./.conda/bin/python src/smart_proxy.py

# Docker (production)
docker-compose up -d

Test

# Send request via proxy
curl -X POST http://localhost:8003/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Check status
curl http://localhost:8003/proxy/health | jq

How It Works

Priority Scoring

Requests are scored (lower = higher priority) based on:

VRAM Efficiency (0-800 points)
- 0: Model already loaded (instant)
- 200: Can fit alongside current models
- 400: Small model swap required
- 800: Large model swap (>40GB)
IP Fairness (+0-100 points)
- Penalties for IPs with many queued/recent requests
- Prevents monopolization
Wait Time Bonus (-1/second)
- Older requests gain priority
- Prevents starvation

Example: A request for an already-loaded model from a new IP that's been waiting 30 seconds gets a score of 0 + 0 - 30 = -30 (very high priority).

Architecture

Client Request → Smart Proxy → Priority Queue → Ollama
                      ↓                ↓
                 VRAM Monitor    Request Tracker
                      ↓                ↓
                  Database ←  Analytics Engine

Components:

Priority Queue: Async queue with dynamic recalculation
VRAM Monitor: Polls Ollama's /api/ps to track loaded models
Request Tracker: IP fairness, rate limiting, request history
Database: SQLite (dev) or PostgreSQL (prod) for logging
Analytics: Historical performance analysis

See docs/ARCHITECTURE.md for details.

API Endpoints

OpenAI-Compatible

POST /v1/chat/completions - Chat completions
POST /v1/completions - Text completions

Ollama-Compatible

POST /api/chat - Ollama chat format
POST /api/generate - Ollama generation format

Monitoring (Public)

GET /proxy/health - System health + queue stats
GET /proxy/queue - Real-time queue with priorities
GET /proxy/vram - VRAM usage and loaded models

Admin (Protected)

POST /proxy/auth - Authenticate for 24h (requires PROXY_ADMIN_KEY)
GET /proxy/analytics - Historical analytics (requests, errors, priorities)
POST /api/pull|push|create|delete - Model management

Admin Authentication:

Static IP whitelist (ADMIN_IPS env var)
Session via /proxy/auth with admin key
X-Admin-Key: <key> header

Configuration

Key environment variables (.env):

# Ollama connection
OLLAMA_API_BASE=http://localhost:11434

# Proxy settings
PROXY_HOST=0.0.0.0
PROXY_PORT=8003
OLLAMA_MAX_PARALLEL=3

# VRAM (adjust for your GPU)
TOTAL_VRAM_MB=80000  # 80GB
VRAM_POLL_INTERVAL=5

# Security
PROXY_ADMIN_KEY=your_secret_key
ADMIN_IPS=127.0.0.1,::1

# Database
DB_TYPE=sqlite  # or postgres for production
DB_PATH=./db/requests.db

# Logging
LOG_FORMAT=json  # or human
LOG_LEVEL=INFO

See .env.example for all options.

Tuning Priority Weights

# Base scores (lower = higher priority)
PRIORITY_BASE_LOADED=100          # Model already loaded
PRIORITY_BASE_PARALLEL=200        # Can fit in parallel
PRIORITY_BASE_SMALL_SWAP=400      # Small swap
PRIORITY_BASE_LARGE_SWAP=800      # Large swap (>40GB)

# Modifiers
PRIORITY_WAIT_TIME_MULTIPLIER=-1  # -1 point/second waiting
PRIORITY_RATE_LIMIT_MULTIPLIER=5  # +5 per recent request
RATE_LIMIT_WINDOW=600             # 10 minutes

Admin Dashboard

Interactive terminal dashboard for monitoring:

# Install dependencies
pip install -r scripts/requirements-dashboard.txt

# Run live dashboard
python scripts/admin_dashboard.py \
  --url http://localhost:8003 \
  --key YOUR_ADMIN_KEY \
  --refresh 5

# Snapshot mode (one-time)
python scripts/admin_dashboard.py --once

Features:

Live health, VRAM, and queue status
Historical analytics (models, IPs, errors, priorities)
Model bunching detection
Optimized for 1080p terminal

See scripts/README_DASHBOARD.md for details.

Testing

# Run all tests
./.conda/bin/pytest

# Specific test suite
./.conda/bin/pytest tests/test_scenarios.py -v

# With coverage
./.conda/bin/pytest --cov=src tests/

Test coverage: 35 tests covering priority logic, queue management, database operations, and fallback mechanisms.

See docs/TESTING_GUIDE.md for comprehensive scenarios.

Production Deployment

Docker Compose (Recommended)

# Configure
cp .env.example .env
# Edit .env: set OLLAMA_API_BASE, DB_TYPE=postgres, PROXY_ADMIN_KEY

# Deploy
docker-compose -f docker-compose.yml up -d

# Check logs
docker-compose logs -f smart-proxy

# View database
docker-compose exec postgres psql -U proxy -d proxy_db

PostgreSQL Setup

# Database automatically initializes via docker-compose
# Or manually:
docker-compose exec postgres psql -U proxy -d proxy_db -f /app/scripts/schema.sql

# Migration
./.conda/bin/python scripts/migrate_db.py migrate

# Backfill from fallback logs
./.conda/bin/python scripts/migrate_db.py backfill

See docs/DEPLOYMENT.md for complete guide.

Database & Analytics

Logging

All requests logged to database with:

Request ID, source IP, model name
Timestamps (received, started, completed)
Priority score, queue wait, processing time
Prompt and response text
Error messages (if failed)

Fallback: If database unavailable, logs written to db/fallback_logs/ and recovered on startup.

Analytics Queries

# Get analytics via API
curl -H "X-Admin-Key: YOUR_KEY" \
  "http://localhost:8003/proxy/analytics?hours=24" | jq

# Direct database query
./.conda/bin/python -c "
from data_access import get_analytics_repo
from datetime import datetime, timedelta
repo = get_analytics_repo()
end = datetime.utcnow()
start = end - timedelta(hours=24)
print(repo.get_request_count_by_model(start, end))
"

Available analytics:

Request counts by model/IP
Average duration by model
Priority score distribution
Error rate analysis
Model bunching detection (identifies request clustering)
Requests over time

See docs/DATABASE_INTEGRATION.md for schema and queries.

Troubleshooting

Proxy won't start

# Check Ollama
curl http://localhost:11434/api/tags

# Check port
lsof -i :8003

# View logs
tail -f logs/proxy.log

VRAM not detected

Wait 5-10 seconds after first request (poll interval)
Verify: curl http://localhost:8003/proxy/vram | jq
Check Ollama: curl http://localhost:11434/api/ps

Queue seems stuck

# Check queue
curl http://localhost:8003/proxy/queue | jq

# Check if paused (testing endpoint)
curl http://localhost:8003/proxy/health | jq .paused

Database errors

# Check fallback logs
ls -lh db/fallback_logs/

# Recover manually
./.conda/bin/python scripts/migrate_db.py backfill

# Reset database (DESTRUCTIVE)
rm db/requests.db
./.conda/bin/python src/smart_proxy.py  # reinitializes

Documentation

Architecture - System design and components
Deployment - Docker and production setup
Database Integration - Schema and queries
Testing Guide - Test scenarios and coverage
Admin Dashboard - Monitoring tool usage
Changelog - Version history

Project Structure

ollama_smart_proxy/
├── src/
│   ├── smart_proxy.py       # Main FastAPI application
│   ├── vram_monitor.py      # VRAM tracking
│   ├── database.py          # Database layer (SQLAlchemy)
│   ├── data_access.py       # Repository pattern
│   └── log_formatter.py     # Structured logging
├── scripts/
│   ├── admin_dashboard.py   # Terminal dashboard
│   ├── migrate_db.py        # Database migrations
│   └── example_usage.sh     # API examples
├── tests/
│   ├── test_scenarios.py    # Integration tests
│   ├── test_database.py     # DB tests
│   └── test_analytics.py    # Analytics tests
├── docs/
│   ├── ARCHITECTURE.md
│   ├── DEPLOYMENT.md
│   └── changelog/           # Version history
├── db/                      # SQLite database + fallback logs
├── docker-compose.yml       # Production deployment
├── Dockerfile
├── requirements.txt
└── README.md

Performance Tips

Tune VRAM polling: Increase VRAM_POLL_INTERVAL for more stability, decrease for faster response
Adjust parallel limit: Set OLLAMA_MAX_PARALLEL based on GPU memory and model sizes
Priority weights: Fine-tune based on your workload (see Configuration)
Database: Use PostgreSQL for production, SQLite for development
Monitoring: Use admin dashboard to identify bottlenecks

Version History

v3.6 (2026-02-04) - Current

✅ Analytics API endpoint (/proxy/analytics)
✅ Admin dashboard client with live monitoring
✅ Admin authentication for protected endpoints

v3.5 (2025-12-20)

✅ Database integration (SQLite/PostgreSQL)
✅ Analytics queries (bunching, errors, distributions)
✅ Fallback logging mechanism
✅ Docker deployment ready

v3.4 (2025-12-19)

✅ Structured logging (JSON/human)
✅ Request ID tracking
✅ Enhanced test suite (35 tests)

See docs/changelog/ for complete history.

Contributing

This project is currently in production use. For bugs or features, see the TODO.md for planned work.

License

[Add your license here]

Related Projects

Ollama - Run LLMs locally
LiteLLM - LLM proxy framework (used internally)

Built with: FastAPI, LiteLLM, SQLAlchemy, Rich

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.cursor/rules		.cursor/rules
.github		.github
docs		docs
scripts		scripts
src		src
static/dashboard		static/dashboard
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml.example		docker-compose.yml.example
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_proxy.sh		run_proxy.sh

Folders and files

Latest commit

History

Repository files navigation

Ollama Smart Proxy

What It Does

Quick Start

Installation

Run

Test

How It Works

Priority Scoring

Architecture

API Endpoints

OpenAI-Compatible

Ollama-Compatible

Monitoring (Public)

Admin (Protected)

Configuration

Tuning Priority Weights

Admin Dashboard

Testing

Production Deployment

Docker Compose (Recommended)

PostgreSQL Setup

Database & Analytics

Logging

Analytics Queries

Troubleshooting

Proxy won't start

VRAM not detected

Queue seems stuck

Database errors

Documentation

Project Structure

Performance Tips

Version History

v3.6 (2026-02-04) - Current

v3.5 (2025-12-20)

v3.4 (2025-12-19)

Contributing

License

Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages