Skip to content

Latest commit

 

History

History
125 lines (92 loc) · 9.1 KB

File metadata and controls

125 lines (92 loc) · 9.1 KB

Observability

AuthTranslator surfaces health probes, Prometheus metrics, and structured logs out‑of‑the‑box so you can plug it into your existing monitoring stack with minimal fuss.


1  Endpoints

Path Method Purpose Typical probe
/_at_internal/healthz GET Liveness: returns 200 OK once the HTTP server is up. No external deps are checked. Kubernetes livenessProbe every 10 s
/_at_internal/metrics GET Exposes Prometheus text format. Includes Go runtime metrics and AuthTranslator‑specific counters. Prometheus scrape_interval 15 s

The health endpoint is always available and returns an X-Last-Reload header showing the most recent configuration reload time. The metrics endpoint is exposed by default but can be disabled with -enable-metrics=false. Provide both -metrics-user and -metrics-pass to require HTTP Basic credentials – omitting either one causes the service to exit on startup.

Any non‑2xx response includes an X-AT-Upstream-Error header. true means the error came from the upstream service. false indicates AuthTranslator generated the response. When the proxy generates a 4xx or 5xx reply it also sets X-AT-Error-Reason with a short explanation such as "integration not found", "authentication failed", "caller rate limited", "integration rate limited", or "no proxy configured".


2  Metrics cheat‑sheet

The exact metric list is taken from code; field names below match what ships today.

Metric Type Labels Description
authtranslator_requests_total counter integration Total requests handled per integration, including local rejections. Requests that do not match a configured integration are labeled unknown.
authtranslator_upstream_responses_total counter integration, code HTTP status codes returned by upstreams.
authtranslator_upstream_roundtrip_duration_seconds histogram integration Time from proxy handoff until the upstream response is received.
authtranslator_end_to_end_duration_seconds histogram integration Full request latency from handler entry until AuthTranslator finishes responding.
authtranslator_pre_proxy_duration_seconds histogram integration Request-side processing time inside AuthTranslator before proxy handoff or a local response.
authtranslator_response_processing_duration_seconds histogram integration Response-side processing time inside AuthTranslator after an upstream response is received.
authtranslator_rate_limit_events_total counter integration Incremented when a request is rejected with 429.
authtranslator_auth_failures_total counter integration Incoming and outgoing auth plugin failures.
authtranslator_internal_responses_total counter integration, code, reason Proxy-generated non-upstream responses grouped by coarse reason.
authtranslator_last_reload gauge Timestamp of the most recent configuration reload.

The reason label on authtranslator_internal_responses_total uses bounded categories such as integration_not_found, incoming_auth_failure, caller_rate_limited, integration_rate_limited, invalid_destination, and no_proxy_configured.

Missing a metric? Write a small metrics plugin to hook into requests and responses or open a PR—new counters are easy to wire in. WriteProm calls every registered plugin's own WriteProm method so any custom counters you output will appear alongside the built‑in ones. Plugins must manage their own state (typically in memory). See Metrics Plugins for a primer.


3  Prometheus scrape example

targets:
  - job_name: "authtranslator"
    metrics_path: "/_at_internal/metrics"
    static_configs:
      - targets: ["authtranslator.default.svc.cluster.local:8080"]

When running multiple replicas behind a Service or Load Balancer, prefer the Prometheus ServiceMonitor CRD (Kube‑Prometheus stack) or scrape via the node exporter.


4  Grafana jump-start

Use these sample PromQL queries to bootstrap your dashboard panels. They assume you scrape the metrics under the default job label authtranslator.

Panel idea PromQL Why it helps
Request rate per integration sum(rate(authtranslator_requests_total{job="authtranslator"}[5m])) by (integration) Highlights traffic leaders, local rejection spikes, and sudden drops.
Error ratio sum(rate(authtranslator_upstream_responses_total{job="authtranslator",code=~"5.."}[5m]))
/
sum(rate(authtranslator_upstream_responses_total{job="authtranslator"}[5m]))
Surfaces spikes in upstream failures.
95th percentile total latency histogram_quantile(0.95, sum(rate(authtranslator_end_to_end_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration)) Tracks the latency callers actually experience.
95th percentile upstream latency histogram_quantile(0.95, sum(rate(authtranslator_upstream_roundtrip_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration)) Separates upstream slowness from local proxy overhead.
95th percentile pre-proxy latency histogram_quantile(0.95, sum(rate(authtranslator_pre_proxy_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration)) Shows request-side latency introduced by AuthTranslator before proxying.
Rate-limit rejections sum(rate(authtranslator_rate_limit_events_total{job="authtranslator"}[5m])) by (integration) Shows when callers are constrained and need more quota.
Internal failures by reason sum(rate(authtranslator_internal_responses_total{job="authtranslator"}[5m])) by (integration, reason) Separates local proxy rejections from upstream failures.

You can convert the table above into Grafana time-series panels by pasting the queries into new panels and turning on Legend → {{integration}}. For a snapshot of current conditions, duplicate the panels and switch the visualization type to Stat.


5  Structured logs

The proxy logs in structured text by default. Pass -log-format json to emit JSON using Go’s slog. Fields:

Key Example Meaning
level INFO / WARN / ERROR Log severity
msg "incoming request" / "upstream response" Log message
integration "slack" Integration block name
caller_id "user-123" Identifier from incoming plugin
method "POST" HTTP method (request log)
path "/api/chat.postMessage" Request path (request log)
status 200 Upstream status code (response log)

Sample line (wrapped for readability):

{"time":"2025-05-29T07:00:12Z","level":"INFO","msg":"incoming request","method":"POST","integration":"slack","path":"/api/chat.postMessage","caller_id":"user-123"}
{"time":"2025-05-29T07:00:12Z","level":"INFO","msg":"upstream response","integration":"slack","status":200}

Log level

  • Default: INFO
  • Override: run the proxy with -log-level DEBUG (adds request/response headers—secrets redacted)

6  Alerting pointers

Alert Expression Rationale
High upstream 5xx rate sum(rate(authtranslator_upstream_responses_total{code=~"5.."}[5m])) > 0.1 Upstream failures or mis‑config.
Prolonged rate‑limit hits increase(authtranslator_rate_limit_events_total[5m]) > 100 Callers need higher quota.
Health endpoint down Blackbox probe against /_at_internal/healthz fails Pod crash or network break.

Tune thresholds to your traffic patterns.