AuthTranslator surfaces health probes, Prometheus metrics, and structured logs out‑of‑the‑box so you can plug it into your existing monitoring stack with minimal fuss.
| Path | Method | Purpose | Typical probe |
|---|---|---|---|
/_at_internal/healthz |
GET |
Liveness: returns 200 OK once the HTTP server is up. No external deps are checked. | Kubernetes livenessProbe every 10 s |
/_at_internal/metrics |
GET |
Exposes Prometheus text format. Includes Go runtime metrics and AuthTranslator‑specific counters. | Prometheus scrape_interval 15 s |
The health endpoint is always available and returns an X-Last-Reload header
showing the most recent configuration reload time. The metrics endpoint is
exposed by default but can be disabled with -enable-metrics=false. Provide
both -metrics-user and -metrics-pass to require HTTP Basic
credentials – omitting either one causes the service to exit on startup.
Any non‑2xx response includes an X-AT-Upstream-Error header. true means
the error came from the upstream service. false indicates AuthTranslator
generated the response. When the proxy generates a 4xx or 5xx reply it also
sets X-AT-Error-Reason with a short explanation such as "integration not found",
"authentication failed", "caller rate limited", "integration rate limited", or "no proxy configured".
The exact metric list is taken from code; field names below match what ships today.
| Metric | Type | Labels | Description |
|---|---|---|---|
authtranslator_requests_total |
counter | integration |
Total requests handled per integration, including local rejections. Requests that do not match a configured integration are labeled unknown. |
authtranslator_upstream_responses_total |
counter | integration, code |
HTTP status codes returned by upstreams. |
authtranslator_upstream_roundtrip_duration_seconds |
histogram | integration |
Time from proxy handoff until the upstream response is received. |
authtranslator_end_to_end_duration_seconds |
histogram | integration |
Full request latency from handler entry until AuthTranslator finishes responding. |
authtranslator_pre_proxy_duration_seconds |
histogram | integration |
Request-side processing time inside AuthTranslator before proxy handoff or a local response. |
authtranslator_response_processing_duration_seconds |
histogram | integration |
Response-side processing time inside AuthTranslator after an upstream response is received. |
authtranslator_rate_limit_events_total |
counter | integration |
Incremented when a request is rejected with 429. |
authtranslator_auth_failures_total |
counter | integration |
Incoming and outgoing auth plugin failures. |
authtranslator_internal_responses_total |
counter | integration, code, reason |
Proxy-generated non-upstream responses grouped by coarse reason. |
authtranslator_last_reload |
gauge | – | Timestamp of the most recent configuration reload. |
The reason label on authtranslator_internal_responses_total uses bounded categories such as integration_not_found, incoming_auth_failure, caller_rate_limited, integration_rate_limited, invalid_destination, and no_proxy_configured.
Missing a metric? Write a small metrics plugin to hook into requests and responses or open a PR—new counters are easy to wire in. WriteProm calls every registered plugin's own WriteProm method so any custom counters you output will appear alongside the built‑in ones. Plugins must manage their own state (typically in memory). See Metrics Plugins for a primer.
targets:
- job_name: "authtranslator"
metrics_path: "/_at_internal/metrics"
static_configs:
- targets: ["authtranslator.default.svc.cluster.local:8080"]When running multiple replicas behind a Service or Load Balancer, prefer the Prometheus ServiceMonitor CRD (Kube‑Prometheus stack) or scrape via the node exporter.
Use these sample PromQL queries to bootstrap your dashboard panels. They
assume you scrape the metrics under the default job label
authtranslator.
| Panel idea | PromQL | Why it helps |
|---|---|---|
| Request rate per integration | sum(rate(authtranslator_requests_total{job="authtranslator"}[5m])) by (integration) |
Highlights traffic leaders, local rejection spikes, and sudden drops. |
| Error ratio | sum(rate(authtranslator_upstream_responses_total{job="authtranslator",code=~"5.."}[5m]))/sum(rate(authtranslator_upstream_responses_total{job="authtranslator"}[5m])) |
Surfaces spikes in upstream failures. |
| 95th percentile total latency | histogram_quantile(0.95, sum(rate(authtranslator_end_to_end_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration)) |
Tracks the latency callers actually experience. |
| 95th percentile upstream latency | histogram_quantile(0.95, sum(rate(authtranslator_upstream_roundtrip_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration)) |
Separates upstream slowness from local proxy overhead. |
| 95th percentile pre-proxy latency | histogram_quantile(0.95, sum(rate(authtranslator_pre_proxy_duration_seconds_bucket{job="authtranslator"}[5m])) by (le, integration)) |
Shows request-side latency introduced by AuthTranslator before proxying. |
| Rate-limit rejections | sum(rate(authtranslator_rate_limit_events_total{job="authtranslator"}[5m])) by (integration) |
Shows when callers are constrained and need more quota. |
| Internal failures by reason | sum(rate(authtranslator_internal_responses_total{job="authtranslator"}[5m])) by (integration, reason) |
Separates local proxy rejections from upstream failures. |
You can convert the table above into Grafana time-series panels by pasting the
queries into new panels and turning on Legend → {{integration}}. For a
snapshot of current conditions, duplicate the panels and switch the visualization
type to Stat.
The proxy logs in structured text by default. Pass
-log-format json to emit JSON using Go’s slog. Fields:
| Key | Example | Meaning |
|---|---|---|
level |
INFO / WARN / ERROR |
Log severity |
msg |
"incoming request" / "upstream response" |
Log message |
integration |
"slack" |
Integration block name |
caller_id |
"user-123" |
Identifier from incoming plugin |
method |
"POST" |
HTTP method (request log) |
path |
"/api/chat.postMessage" |
Request path (request log) |
status |
200 |
Upstream status code (response log) |
Sample line (wrapped for readability):
{"time":"2025-05-29T07:00:12Z","level":"INFO","msg":"incoming request","method":"POST","integration":"slack","path":"/api/chat.postMessage","caller_id":"user-123"}
{"time":"2025-05-29T07:00:12Z","level":"INFO","msg":"upstream response","integration":"slack","status":200}- Default: INFO
- Override: run the proxy with
-log-level DEBUG(adds request/response headers—secrets redacted)
| Alert | Expression | Rationale |
|---|---|---|
| High upstream 5xx rate | sum(rate(authtranslator_upstream_responses_total{code=~"5.."}[5m])) > 0.1 |
Upstream failures or mis‑config. |
| Prolonged rate‑limit hits | increase(authtranslator_rate_limit_events_total[5m]) > 100 |
Callers need higher quota. |
| Health endpoint down | Blackbox probe against /_at_internal/healthz fails |
Pod crash or network break. |
Tune thresholds to your traffic patterns.