Observability

AuthTranslator surfaces health probes, Prometheus metrics, and structured logs out‑of‑the‑box so you can plug it into your existing monitoring stack with minimal fuss.

1 Endpoints

Path	Method	Purpose	Typical probe
`/_at_internal/healthz`	`GET`	Liveness: returns 200 OK once the HTTP server is up. No external deps are checked.	Kubernetes `livenessProbe` every 10 s
`/_at_internal/metrics`	`GET`	Exposes Prometheus text format. Includes Go runtime metrics and AuthTranslator‑specific counters.	Prometheus `scrape_interval` 15 s

The health endpoint is always available and returns an X-Last-Reload header showing the most recent configuration reload time. The metrics endpoint is exposed by default but can be disabled with -enable-metrics=false. Provide both -metrics-user and -metrics-pass to require HTTP Basic credentials – omitting either one causes the service to exit on startup.

Any non‑2xx response includes an X-AT-Upstream-Error header. true means the error came from the upstream service. false indicates AuthTranslator generated the response. When the proxy generates a 4xx or 5xx reply it also sets X-AT-Error-Reason with a short explanation such as "integration not found", "authentication failed", "caller rate limited", "integration rate limited", or "no proxy configured".

2 Metrics cheat‑sheet

The exact metric list is taken from code; field names below match what ships today.

Metric	Type	Labels	Description
`authtranslator_requests_total`	counter	`integration`	Total requests handled per integration, including local rejections. Requests that do not match a configured integration are labeled `unknown`.
`authtranslator_upstream_responses_total`	counter	`integration`, `code`	HTTP status codes returned by upstreams.
`authtranslator_upstream_roundtrip_duration_seconds`	histogram	`integration`	Time from proxy handoff until the upstream response is received.
`authtranslator_end_to_end_duration_seconds`	histogram	`integration`	Full request latency from handler entry until AuthTranslator finishes responding.
`authtranslator_pre_proxy_duration_seconds`	histogram	`integration`	Request-side processing time inside AuthTranslator before proxy handoff or a local response.
`authtranslator_response_processing_duration_seconds`	histogram	`integration`	Response-side processing time inside AuthTranslator after an upstream response is received.
`authtranslator_rate_limit_events_total`	counter	`integration`	Incremented when a request is rejected with 429.
`authtranslator_auth_failures_total`	counter	`integration`	Incoming and outgoing auth plugin failures.
`authtranslator_internal_responses_total`	counter	`integration`, `code`, `reason`	Proxy-generated non-upstream responses grouped by coarse reason.
`authtranslator_last_reload`	gauge	–	Timestamp of the most recent configuration reload.

The reason label on authtranslator_internal_responses_total uses bounded categories such as integration_not_found, incoming_auth_failure, caller_rate_limited, integration_rate_limited, invalid_destination, and no_proxy_configured.

Missing a metric? Write a small metrics plugin to hook into requests and responses or open a PR—new counters are easy to wire in. WriteProm calls every registered plugin's own WriteProm method so any custom counters you output will appear alongside the built‑in ones. Plugins must manage their own state (typically in memory). See Metrics Plugins for a primer.

3 Prometheus scrape example

targets:
  - job_name: "authtranslator"
    metrics_path: "/_at_internal/metrics"
    static_configs:
      - targets: ["authtranslator.default.svc.cluster.local:8080"]

When running multiple replicas behind a Service or Load Balancer, prefer the Prometheus ServiceMonitor CRD (Kube‑Prometheus stack) or scrape via the node exporter.

4 Grafana jump-start

Use these sample PromQL queries to bootstrap your dashboard panels. They assume you scrape the metrics under the default job label authtranslator.

Panel idea	PromQL	Why it helps
Request rate per integration	`sum(rate(authtranslator_requests_total{job="authtranslator"}[5m])) by (integration)`	Highlights traffic leaders, local rejection spikes, and sudden drops.
Error ratio	`sum(rate(authtranslator_upstream_responses_total{job="authtranslator",code=~"5.."}[5m]))` `/` `sum(rate(authtranslator_upstream_responses_total{job="authtranslator"}[5m]))`	Surfaces spikes in upstream failures.
95th percentile total latency	`histogram_quantile(0.95, sum(rate(authtranslator_end_to_end_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration))`	Tracks the latency callers actually experience.
95th percentile upstream latency	`histogram_quantile(0.95, sum(rate(authtranslator_upstream_roundtrip_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration))`	Separates upstream slowness from local proxy overhead.
95th percentile pre-proxy latency	`histogram_quantile(0.95, sum(rate(authtranslator_pre_proxy_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration))`	Shows request-side latency introduced by AuthTranslator before proxying.
Rate-limit rejections	`sum(rate(authtranslator_rate_limit_events_total{job="authtranslator"}[5m])) by (integration)`	Shows when callers are constrained and need more quota.
Internal failures by reason	`sum(rate(authtranslator_internal_responses_total{job="authtranslator"}[5m])) by (integration, reason)`	Separates local proxy rejections from upstream failures.

You can convert the table above into Grafana time-series panels by pasting the queries into new panels and turning on Legend → {{integration}}. For a snapshot of current conditions, duplicate the panels and switch the visualization type to Stat.

5 Structured logs

The proxy logs in structured text by default. Pass -log-format json to emit JSON using Go’s slog. Fields:

Key	Example	Meaning
`level`	`INFO` / `WARN` / `ERROR`	Log severity
`msg`	`"incoming request"` / `"upstream response"`	Log message
`integration`	`"slack"`	Integration block name
`caller_id`	`"user-123"`	Identifier from incoming plugin
`method`	`"POST"`	HTTP method (request log)
`path`	`"/api/chat.postMessage"`	Request path (request log)
`status`	`200`	Upstream status code (response log)

Sample line (wrapped for readability):

{"time":"2025-05-29T07:00:12Z","level":"INFO","msg":"incoming request","method":"POST","integration":"slack","path":"/api/chat.postMessage","caller_id":"user-123"}
{"time":"2025-05-29T07:00:12Z","level":"INFO","msg":"upstream response","integration":"slack","status":200}

Log level

Default: INFO
Override: run the proxy with -log-level DEBUG (adds request/response headers—secrets redacted)

6 Alerting pointers

Alert	Expression	Rationale
High upstream 5xx rate	`sum(rate(authtranslator_upstream_responses_total{code=~"5.."}[5m])) > 0.1`	Upstream failures or mis‑config.
Prolonged rate‑limit hits	`increase(authtranslator_rate_limit_events_total[5m]) > 100`	Callers need higher quota.
Health endpoint down	Blackbox probe against `/_at_internal/healthz` fails	Pod crash or network break.

Tune thresholds to your traffic patterns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability

1 Endpoints

2 Metrics cheat‑sheet

3 Prometheus scrape example

4 Grafana jump-start

5 Structured logs

Log level

6 Alerting pointers

FilesExpand file tree

observability.md

Latest commit

History

observability.md

File metadata and controls

Observability

1 Endpoints

2 Metrics cheat‑sheet

3 Prometheus scrape example

4 Grafana jump-start

5 Structured logs

Log level

6 Alerting pointers