Skip to content

Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF.

License

Notifications You must be signed in to change notification settings

ai-art-dev99/vLLM-efficient-serving-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM Efficient Serving Stack — OpenAI-Compatible, Multi-LoRA, Autoscaling & Telemetry

Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF on latency, throughput, quality, and cost per token.

✨ Features

  • OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /v1/models (via vLLM OpenAI server).
  • Multi‑LoRA: choose an adapter per request (header X-Adapter-Id) and fan‑out adapter loads to pods.
  • Quantization matrix: three separate vLLM Deployments for AWQ, GPTQ, GGUF.
  • Autoscaling: KEDA ScaledObjects using PromQL over vLLM backlog metrics (e.g., sum(vllm:num_requests_waiting)).
  • Observability: Prometheus scraping + Grafana dashboard JSON; optional OpenTelemetry traces from the router.
  • Bench: async loadgen for latency; optional Prometheus pull for tokens/sec + cost/token; lm-eval-harness helper for quality.

🧭 Architecture

Client (OpenAI SDK)
   → Router (FastAPI; LoRA injection; OTel)
     → vLLM backends: AWQ | GPTQ | GGUF
        ↘ /metrics → Prometheus → Grafana
KEDA ← PromQL (queue depth) → scales vLLM Deployments

🚀 Quick start (Kubernetes)

  1. Namespace & monitoring
kubectl apply -f kube/00-namespace.yaml
kubectl apply -f kube/20-prometheus-configmap.yaml   -f kube/21-prometheus-deployment.yaml   -f kube/22-prometheus-service.yaml   -f kube/30-grafana-configmap-datasource.yaml   -f kube/31-grafana-deployment.yaml   -f kube/32-grafana-service.yaml
  1. vLLM backends
    Edit kube/10-12* to set model artifacts you can run (HF repos for AWQ/GPTQ, a local file path for GGUF).
kubectl apply -f kube/10-deploy-vllm-awq.yaml -f kube/13-svc-vllm-awq.yaml
kubectl apply -f kube/11-deploy-vllm-gptq.yaml -f kube/14-svc-vllm-gptq.yaml
kubectl apply -f kube/12-deploy-vllm-gguf.yaml -f kube/15-svc-vllm-gguf.yaml
  1. Router
# Build & push your router image (or use the GitHub Actions workflow below)
cd router
docker build -t ghcr.io/<your-username-or-org>/vllm-router:latest .
kubectl apply -f ../kube/01-configmap-router.yaml   -f ../kube/02-deploy-router.yaml   -f ../kube/03-svc-router.yaml
  1. (Optional) Autoscaling with KEDA (requires KEDA installed)
kubectl apply -f kube/40-keda-scaledobject-awq.yaml   -f kube/41-keda-scaledobject-gptq.yaml   -f kube/42-keda-scaledobject-gguf.yaml
  1. Smoke test
kubectl -n vllm-demo port-forward svc/router 8080:8080 &
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "llama-3.1-8b-instruct",
  "messages": [{"role":"user","content":"Say hi in one sentence."}]
}'

🎛️ Multi‑LoRA usage

  • Send X-Adapter-Id: <adapter_name> header; the router injects vendor params into extra_body.
  • Load adapters to pods via the router admin:
curl -X POST http://<router>/admin/adapters/load   -H 'Content-Type: application/json'   -d '{"backend":"awq","source":"hf","adapter":"<org/repo>","name":"my-adapter"}'

Note: Runtime adapter loading is convenient for dev; for multi‑replica production, coordinate adapter distribution and ensure shared storage or pre‑load adapters at start.

📈 Observability

  • Metrics: vLLM exposes Prometheus metrics at /metrics (queue depth, tokens/sec, latency histograms like TTFT/TPOT). Dashboard JSON is in dashboards/grafana.
  • Traces: Set OTEL_EXPORTER_OTLP_ENDPOINT on the router (and/or vLLM) to export spans to your OTLP collector (Jaeger/Tempo/Zipkin).

🧪 Benchmarking

  • bench/run_latency_throughput.py → p50/p95 latencies; optional --prom-url + --gpu-cost to compute cost/token from tokens/sec.
  • bench/run_lm_eval.sh → runs lm-evaluation-harness via the OpenAI Chat Completions provider (MMLU, HellaSwag, etc.).

🗂️ Repository layout

/router        # FastAPI gateway (OpenAI passthrough + LoRA routing + OTel)
/bench         # loadgen + lm-eval helper
/dashboards    # Grafana JSON
/kube          # Kubernetes manifests (vLLM, Prometheus, Grafana, KEDA)
/infra/local   # optional docker-compose: Prometheus + Grafana

🛠️ CI: build & push the router image (GHCR)

This repo includes .github/workflows/docker-router.yml. It builds the router image on pushes to main/master, releases, and manual runs, and pushes to ghcr.io using the automatically provided GITHUB_TOKEN.

  • Ensure your repository has Packages: write permission for workflows (set in the workflow’s permissions block).
  • The image is tagged at ghcr.io/<owner>/<repo>/vllm-router:<tag>.

📄 License

MIT

About

Production-grade vLLM serving with an OpenAI-compatible API, per-request LoRA routing, KEDA autoscaling on Prometheus metrics, Grafana/OTel observability, and a benchmark comparing AWQ vs GPTQ vs GGUF.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published