vLLM Efficient Serving Stack — OpenAI-Compatible, Multi-LoRA, Autoscaling & Telemetry

Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF on latency, throughput, quality, and cost per token.

✨ Features

OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /v1/models (via vLLM OpenAI server).
Multi‑LoRA: choose an adapter per request (header X-Adapter-Id) and fan‑out adapter loads to pods.
Quantization matrix: three separate vLLM Deployments for AWQ, GPTQ, GGUF.
Autoscaling: KEDA ScaledObjects using PromQL over vLLM backlog metrics (e.g., sum(vllm:num_requests_waiting)).
Observability: Prometheus scraping + Grafana dashboard JSON; optional OpenTelemetry traces from the router.
Bench: async loadgen for latency; optional Prometheus pull for tokens/sec + cost/token; lm-eval-harness helper for quality.

🧭 Architecture

Client (OpenAI SDK)
   → Router (FastAPI; LoRA injection; OTel)
     → vLLM backends: AWQ | GPTQ | GGUF
        ↘ /metrics → Prometheus → Grafana
KEDA ← PromQL (queue depth) → scales vLLM Deployments

🚀 Quick start (Kubernetes)

Namespace & monitoring

kubectl apply -f kube/00-namespace.yaml
kubectl apply -f kube/20-prometheus-configmap.yaml   -f kube/21-prometheus-deployment.yaml   -f kube/22-prometheus-service.yaml   -f kube/30-grafana-configmap-datasource.yaml   -f kube/31-grafana-deployment.yaml   -f kube/32-grafana-service.yaml

vLLM backends
Edit kube/10-12* to set model artifacts you can run (HF repos for AWQ/GPTQ, a local file path for GGUF).

kubectl apply -f kube/10-deploy-vllm-awq.yaml -f kube/13-svc-vllm-awq.yaml
kubectl apply -f kube/11-deploy-vllm-gptq.yaml -f kube/14-svc-vllm-gptq.yaml
kubectl apply -f kube/12-deploy-vllm-gguf.yaml -f kube/15-svc-vllm-gguf.yaml

Router

# Build & push your router image (or use the GitHub Actions workflow below)
cd router
docker build -t ghcr.io/<your-username-or-org>/vllm-router:latest .
kubectl apply -f ../kube/01-configmap-router.yaml   -f ../kube/02-deploy-router.yaml   -f ../kube/03-svc-router.yaml

(Optional) Autoscaling with KEDA (requires KEDA installed)

kubectl apply -f kube/40-keda-scaledobject-awq.yaml   -f kube/41-keda-scaledobject-gptq.yaml   -f kube/42-keda-scaledobject-gguf.yaml

Smoke test

kubectl -n vllm-demo port-forward svc/router 8080:8080 &
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "llama-3.1-8b-instruct",
  "messages": [{"role":"user","content":"Say hi in one sentence."}]
}'

🎛️ Multi‑LoRA usage

Send X-Adapter-Id: <adapter_name> header; the router injects vendor params into extra_body.
Load adapters to pods via the router admin:

curl -X POST http://<router>/admin/adapters/load   -H 'Content-Type: application/json'   -d '{"backend":"awq","source":"hf","adapter":"<org/repo>","name":"my-adapter"}'

Note: Runtime adapter loading is convenient for dev; for multi‑replica production, coordinate adapter distribution and ensure shared storage or pre‑load adapters at start.

📈 Observability

Metrics: vLLM exposes Prometheus metrics at /metrics (queue depth, tokens/sec, latency histograms like TTFT/TPOT). Dashboard JSON is in dashboards/grafana.
Traces: Set OTEL_EXPORTER_OTLP_ENDPOINT on the router (and/or vLLM) to export spans to your OTLP collector (Jaeger/Tempo/Zipkin).

🧪 Benchmarking

bench/run_latency_throughput.py → p50/p95 latencies; optional --prom-url + --gpu-cost to compute cost/token from tokens/sec.
bench/run_lm_eval.sh → runs lm-evaluation-harness via the OpenAI Chat Completions provider (MMLU, HellaSwag, etc.).

🗂️ Repository layout

/router        # FastAPI gateway (OpenAI passthrough + LoRA routing + OTel)
/bench         # loadgen + lm-eval helper
/dashboards    # Grafana JSON
/kube          # Kubernetes manifests (vLLM, Prometheus, Grafana, KEDA)
/infra/local   # optional docker-compose: Prometheus + Grafana

🛠️ CI: build & push the router image (GHCR)

This repo includes .github/workflows/docker-router.yml. It builds the router image on pushes to main/master, releases, and manual runs, and pushes to ghcr.io using the automatically provided GITHUB_TOKEN.

Ensure your repository has Packages: write permission for workflows (set in the workflow’s permissions block).
The image is tagged at ghcr.io/<owner>/<repo>/vllm-router:<tag>.

📄 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vLLM Efficient Serving Stack — OpenAI-Compatible, Multi-LoRA, Autoscaling & Telemetry

✨ Features

🧭 Architecture

🚀 Quick start (Kubernetes)

🎛️ Multi‑LoRA usage

📈 Observability

🧪 Benchmarking

🗂️ Repository layout

🛠️ CI: build & push the router image (GHCR)

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
bench		bench
dashboards/grafana		dashboards/grafana
infra/local		infra/local
kube		kube
router		router
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

ai-art-dev99/vLLM-efficient-serving-stack

Folders and files

Latest commit

History

Repository files navigation

vLLM Efficient Serving Stack — OpenAI-Compatible, Multi-LoRA, Autoscaling & Telemetry

✨ Features

🧭 Architecture

🚀 Quick start (Kubernetes)

🎛️ Multi‑LoRA usage

📈 Observability

🧪 Benchmarking

🗂️ Repository layout

🛠️ CI: build & push the router image (GHCR)

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages