Problem
A misconfigured observability exporter fails silently from the operator's point of view. Delivery failures (unset OBJSTORE_CRED_* / SLS_CRED_* / DD_CRED_* env vars, wrong endpoint, revoked key) surface only in DP logs; the dashboard shows the exporter as enabled forever. The docs promise "the sink reports unhealthy" — but that state never leaves the DP process.
What exists today (origin/main)
crates/aisix-obs/src/pipeline.rs — SinkStatsSnapshot { sent, dropped, retries, failed_batches, last_error } is already tracked per sink in-process, with masked last_error (≤200 chars). Never transmitted anywhere.
crates/aisix-obs/src/sink/mod.rs — ObservabilitySink::healthcheck() -> SinkHealth { healthy, detail } exists, but is a stub (always healthy) for otlp_http / sls / datadog; only object_store has a real connectivity probe.
crates/aisix-server/src/telemetry.rs — a periodic mTLS reporting loop to the CP already exists (usage events, flush at 100 events / 5s to /dp/telemetry), and a separate heartbeat worker POSTs /dp/heartbeat with {dp_id, uptime_seconds, version, rejected_resources}.
- Config arrives via the kine/etcd watch (
/aisix/<env>/observability_exporters/<id>), so the DP already has a CP→DP command path it watches.
Proposed design
- Passive health reporting (the core). Extend the heartbeat payload with a per-exporter block derived from
SinkStatsSnapshot:
"exporter_health": [
{ "exporter_id": "…", "healthy": true, "sent": 1234, "dropped": 0,
"failed_batches": 0, "last_error": null, "last_success_at": "…" }
]
Health is derived from delivery outcomes (e.g. unhealthy when the most recent batch failed permanently or N consecutive batches failed) — no extra network traffic against the customer's target.
- On-demand probe ("Send test event"). CP writes a probe request under a kine prefix the DP already watches (e.g.
/aisix/<env>/observability_probes/<probe_id> carrying exporter_id); the DP executes ONE synthetic delivery through the real sink (resolving credential_ref locally as usual) and reports {probe_id, ok, error} in the next heartbeat. Probe records are short-lived (CP deletes after terminal state).
- Keep
healthcheck() stubs as-is or implement them via the probe path — a separate always-on prober is NOT needed once 1 + 2 exist.
Security constraints (unchanged invariants)
- The CP never connects to the customer's telemetry target; only the DP delivers (probe included).
- No credential material ever leaves the DP:
last_error stays masked, the probe result carries no request/credential detail.
- The synthetic probe event must contain no end-user prompt/response content.
Out of scope
- CP-side persistence / dashboard UI (tracked in the AISIX-Cloud counterpart issue, linked below).
- Prometheus / OTLP-metrics egress for the DP itself.
Acceptance criteria
Problem
A misconfigured observability exporter fails silently from the operator's point of view. Delivery failures (unset
OBJSTORE_CRED_*/SLS_CRED_*/DD_CRED_*env vars, wrong endpoint, revoked key) surface only in DP logs; the dashboard shows the exporter asenabledforever. The docs promise "the sink reports unhealthy" — but that state never leaves the DP process.What exists today (origin/main)
crates/aisix-obs/src/pipeline.rs—SinkStatsSnapshot { sent, dropped, retries, failed_batches, last_error }is already tracked per sink in-process, with maskedlast_error(≤200 chars). Never transmitted anywhere.crates/aisix-obs/src/sink/mod.rs—ObservabilitySink::healthcheck() -> SinkHealth { healthy, detail }exists, but is a stub (always healthy) forotlp_http/sls/datadog; onlyobject_storehas a real connectivity probe.crates/aisix-server/src/telemetry.rs— a periodic mTLS reporting loop to the CP already exists (usage events, flush at 100 events / 5s to/dp/telemetry), and a separate heartbeat worker POSTs/dp/heartbeatwith{dp_id, uptime_seconds, version, rejected_resources}./aisix/<env>/observability_exporters/<id>), so the DP already has a CP→DP command path it watches.Proposed design
SinkStatsSnapshot:/aisix/<env>/observability_probes/<probe_id>carryingexporter_id); the DP executes ONE synthetic delivery through the real sink (resolvingcredential_reflocally as usual) and reports{probe_id, ok, error}in the next heartbeat. Probe records are short-lived (CP deletes after terminal state).healthcheck()stubs as-is or implement them via the probe path — a separate always-on prober is NOT needed once 1 + 2 exist.Security constraints (unchanged invariants)
last_errorstays masked, the probe result carries no request/credential detail.Out of scope
Acceptance criteria
exporter_healthfor every configured exporter (all four kinds), with maskedlast_error.probe_id.healthy=falsewith an actionable, masked error (e.g. names the missing env var — the var NAME is not a secret).object_storeanddatadog.