-
Notifications
You must be signed in to change notification settings - Fork 164
Open
Description
Summary
Enable comprehensive network observability for the AppProxy-Traefik-Container traffic path by automatically registering metrics endpoints with Prometheus service discovery.
Background
Currently, when using Traefik as the backend for AppProxy:
- Traefik's own metrics are not collected - Traefik exposes a
/metricsendpoint with valuable
proxy performance data (request counts, latencies, connection states), but this endpoint is not
registered with the Prometheus service discovery. - Container metrics are not automatically registered - When Circuits and Routes are created via
the Traefik backend path, the container's/metricsendpoints are not registered with service
discovery, preventing Prometheus from scraping them.
Current Architecture
Manager → AppProxy Coordinator → etcd → Traefik → Container
│
╳ Missing: Service Discovery registration
│
Service Discovery ← Prometheus HTTP SD
Goals
- Traefik Metrics Collection: Register Traefik's
/metricsendpoint with service discovery so
Prometheus can scrape proxy-level metrics (request throughput, latency, error rates, active
connections). - Automatic Route Metrics Registration: When Circuits/Routes are created or updated via the
Traefik backend, automatically register container metrics endpoints with service discovery. - Metrics Correlation: Include Circuit ID, Session ID, and Worker ID as Prometheus labels to
enable correlation between AppProxy routing decisions and container-level metrics.
Scope
In Scope
- Register AppProxy Coordinator/Worker metrics endpoints with service discovery
- Register Traefik metrics endpoint with service discovery when Traefik backend is enabled
- Modify
CircuitManagerto register/deregister Route metrics endpoints on Circuit lifecycle events - Add appropriate labels (circuit_id, session_id, worker_authority) for metrics correlation
Out of Scope
- Changes to Prometheus configuration (already supports HTTP SD)
- Changes to Traefik configuration (already exposes /metrics)
- Grafana dashboard creation (separate task)
Acceptance Criteria
- When AppProxy Coordinator starts, it registers its
/metricsendpoint with service discovery - When AppProxy Worker starts with Traefik enabled, it registers Traefik's
/metricsendpoint
with service discovery - When a Circuit is created via Traefik backend, all healthy Route metrics endpoints are
registered with service discovery - When a Circuit's Routes are updated, service discovery registrations are updated accordingly
- When a Circuit is deleted, its Route metrics endpoints are deregistered from service discovery
- Prometheus can discover and scrape all registered endpoints via HTTP SD
- Metrics include labels for circuit_id, session_id, and worker_authority
Technical Notes
- Leverage existing
ServiceDiscoveryinterface (src/ai/backend/common/service_discovery/) - Use
ModelServiceMetadatafor Route metrics registration (already used by Manager's
RouteExecutor) - Reference Manager's
sync_service_discovery()implementation in
src/ai/backend/manager/sokovan/deployment/route/executor.py
Related Components
- AppProxy Coordinator:
src/ai/backend/appproxy/coordinator/ - AppProxy Worker:
src/ai/backend/appproxy/worker/ - Service Discovery:
src/ai/backend/common/service_discovery/ - Prometheus HTTP SD endpoint:
GET /metrics/service_discovery(Manager)
JIRA Issue: BA-4038
Metadata
Metadata
Assignees
Labels
No labels