Skip to content

Automatic Prometheus Registration for Traefik #8251

@seedspirit

Description

@seedspirit

Summary

Enable comprehensive network observability for the AppProxy-Traefik-Container traffic path by automatically registering metrics endpoints with Prometheus service discovery.

Background

Currently, when using Traefik as the backend for AppProxy:

  1. Traefik's own metrics are not collected - Traefik exposes a /metrics endpoint with valuable
    proxy performance data (request counts, latencies, connection states), but this endpoint is not
    registered with the Prometheus service discovery.
  2. Container metrics are not automatically registered - When Circuits and Routes are created via
    the Traefik backend path, the container's /metrics endpoints are not registered with service
    discovery, preventing Prometheus from scraping them.

Current Architecture

Manager → AppProxy Coordinator → etcd → Traefik → Container

╳ Missing: Service Discovery registration

Service Discovery ← Prometheus HTTP SD

Goals

  1. Traefik Metrics Collection: Register Traefik's /metrics endpoint with service discovery so
    Prometheus can scrape proxy-level metrics (request throughput, latency, error rates, active
    connections).
  2. Automatic Route Metrics Registration: When Circuits/Routes are created or updated via the
    Traefik backend, automatically register container metrics endpoints with service discovery.
  3. Metrics Correlation: Include Circuit ID, Session ID, and Worker ID as Prometheus labels to
    enable correlation between AppProxy routing decisions and container-level metrics.

Scope

In Scope

  • Register AppProxy Coordinator/Worker metrics endpoints with service discovery
  • Register Traefik metrics endpoint with service discovery when Traefik backend is enabled
  • Modify CircuitManager to register/deregister Route metrics endpoints on Circuit lifecycle events
  • Add appropriate labels (circuit_id, session_id, worker_authority) for metrics correlation

Out of Scope

  • Changes to Prometheus configuration (already supports HTTP SD)
  • Changes to Traefik configuration (already exposes /metrics)
  • Grafana dashboard creation (separate task)

Acceptance Criteria

  • When AppProxy Coordinator starts, it registers its /metrics endpoint with service discovery
  • When AppProxy Worker starts with Traefik enabled, it registers Traefik's /metrics endpoint
    with service discovery
  • When a Circuit is created via Traefik backend, all healthy Route metrics endpoints are
    registered with service discovery
  • When a Circuit's Routes are updated, service discovery registrations are updated accordingly
  • When a Circuit is deleted, its Route metrics endpoints are deregistered from service discovery
  • Prometheus can discover and scrape all registered endpoints via HTTP SD
  • Metrics include labels for circuit_id, session_id, and worker_authority

Technical Notes

  • Leverage existing ServiceDiscovery interface (src/ai/backend/common/service_discovery/)
  • Use ModelServiceMetadata for Route metrics registration (already used by Manager's
    RouteExecutor)
  • Reference Manager's sync_service_discovery() implementation in
    src/ai/backend/manager/sokovan/deployment/route/executor.py

Related Components

  • AppProxy Coordinator: src/ai/backend/appproxy/coordinator/
  • AppProxy Worker: src/ai/backend/appproxy/worker/
  • Service Discovery: src/ai/backend/common/service_discovery/
  • Prometheus HTTP SD endpoint: GET /metrics/service_discovery (Manager)

JIRA Issue: BA-4038

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions