Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF on latency, throughput, quality, and cost per token.
- OpenAI-compatible endpoints:
/v1/chat/completions
,/v1/completions
,/v1/models
(via vLLM OpenAI server). - Multi‑LoRA: choose an adapter per request (header
X-Adapter-Id
) and fan‑out adapter loads to pods. - Quantization matrix: three separate vLLM Deployments for AWQ, GPTQ, GGUF.
- Autoscaling: KEDA ScaledObjects using PromQL over vLLM backlog metrics (e.g.,
sum(vllm:num_requests_waiting)
). - Observability: Prometheus scraping + Grafana dashboard JSON; optional OpenTelemetry traces from the router.
- Bench: async loadgen for latency; optional Prometheus pull for tokens/sec + cost/token;
lm-eval-harness
helper for quality.
Client (OpenAI SDK)
→ Router (FastAPI; LoRA injection; OTel)
→ vLLM backends: AWQ | GPTQ | GGUF
↘ /metrics → Prometheus → Grafana
KEDA ← PromQL (queue depth) → scales vLLM Deployments
- Namespace & monitoring
kubectl apply -f kube/00-namespace.yaml
kubectl apply -f kube/20-prometheus-configmap.yaml -f kube/21-prometheus-deployment.yaml -f kube/22-prometheus-service.yaml -f kube/30-grafana-configmap-datasource.yaml -f kube/31-grafana-deployment.yaml -f kube/32-grafana-service.yaml
- vLLM backends
Editkube/10-12*
to set model artifacts you can run (HF repos for AWQ/GPTQ, a local file path for GGUF).
kubectl apply -f kube/10-deploy-vllm-awq.yaml -f kube/13-svc-vllm-awq.yaml
kubectl apply -f kube/11-deploy-vllm-gptq.yaml -f kube/14-svc-vllm-gptq.yaml
kubectl apply -f kube/12-deploy-vllm-gguf.yaml -f kube/15-svc-vllm-gguf.yaml
- Router
# Build & push your router image (or use the GitHub Actions workflow below)
cd router
docker build -t ghcr.io/<your-username-or-org>/vllm-router:latest .
kubectl apply -f ../kube/01-configmap-router.yaml -f ../kube/02-deploy-router.yaml -f ../kube/03-svc-router.yaml
- (Optional) Autoscaling with KEDA (requires KEDA installed)
kubectl apply -f kube/40-keda-scaledobject-awq.yaml -f kube/41-keda-scaledobject-gptq.yaml -f kube/42-keda-scaledobject-gguf.yaml
- Smoke test
kubectl -n vllm-demo port-forward svc/router 8080:8080 &
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "llama-3.1-8b-instruct",
"messages": [{"role":"user","content":"Say hi in one sentence."}]
}'
- Send
X-Adapter-Id: <adapter_name>
header; the router injects vendor params intoextra_body
. - Load adapters to pods via the router admin:
curl -X POST http://<router>/admin/adapters/load -H 'Content-Type: application/json' -d '{"backend":"awq","source":"hf","adapter":"<org/repo>","name":"my-adapter"}'
Note: Runtime adapter loading is convenient for dev; for multi‑replica production, coordinate adapter distribution and ensure shared storage or pre‑load adapters at start.
- Metrics: vLLM exposes Prometheus metrics at
/metrics
(queue depth, tokens/sec, latency histograms like TTFT/TPOT). Dashboard JSON is indashboards/grafana
. - Traces: Set
OTEL_EXPORTER_OTLP_ENDPOINT
on the router (and/or vLLM) to export spans to your OTLP collector (Jaeger/Tempo/Zipkin).
bench/run_latency_throughput.py
→ p50/p95 latencies; optional--prom-url
+--gpu-cost
to compute cost/token from tokens/sec.bench/run_lm_eval.sh
→ runs lm-evaluation-harness via the OpenAI Chat Completions provider (MMLU, HellaSwag, etc.).
/router # FastAPI gateway (OpenAI passthrough + LoRA routing + OTel)
/bench # loadgen + lm-eval helper
/dashboards # Grafana JSON
/kube # Kubernetes manifests (vLLM, Prometheus, Grafana, KEDA)
/infra/local # optional docker-compose: Prometheus + Grafana
This repo includes .github/workflows/docker-router.yml. It builds the router image on pushes to main/master
, releases, and manual runs, and pushes to ghcr.io using the automatically provided GITHUB_TOKEN
.
- Ensure your repository has Packages: write permission for workflows (set in the workflow’s
permissions
block). - The image is tagged at
ghcr.io/<owner>/<repo>/vllm-router:<tag>
.
MIT