Skip to content

aisix-obs: report per-exporter delivery health to the control plane + on-demand probe delivery #583

Description

@moonming

Problem

A misconfigured observability exporter fails silently from the operator's point of view. Delivery failures (unset OBJSTORE_CRED_* / SLS_CRED_* / DD_CRED_* env vars, wrong endpoint, revoked key) surface only in DP logs; the dashboard shows the exporter as enabled forever. The docs promise "the sink reports unhealthy" — but that state never leaves the DP process.

What exists today (origin/main)

  • crates/aisix-obs/src/pipeline.rsSinkStatsSnapshot { sent, dropped, retries, failed_batches, last_error } is already tracked per sink in-process, with masked last_error (≤200 chars). Never transmitted anywhere.
  • crates/aisix-obs/src/sink/mod.rsObservabilitySink::healthcheck() -> SinkHealth { healthy, detail } exists, but is a stub (always healthy) for otlp_http / sls / datadog; only object_store has a real connectivity probe.
  • crates/aisix-server/src/telemetry.rs — a periodic mTLS reporting loop to the CP already exists (usage events, flush at 100 events / 5s to /dp/telemetry), and a separate heartbeat worker POSTs /dp/heartbeat with {dp_id, uptime_seconds, version, rejected_resources}.
  • Config arrives via the kine/etcd watch (/aisix/<env>/observability_exporters/<id>), so the DP already has a CP→DP command path it watches.

Proposed design

  1. Passive health reporting (the core). Extend the heartbeat payload with a per-exporter block derived from SinkStatsSnapshot:
    "exporter_health": [
      { "exporter_id": "", "healthy": true, "sent": 1234, "dropped": 0,
        "failed_batches": 0, "last_error": null, "last_success_at": "" }
    ]
    Health is derived from delivery outcomes (e.g. unhealthy when the most recent batch failed permanently or N consecutive batches failed) — no extra network traffic against the customer's target.
  2. On-demand probe ("Send test event"). CP writes a probe request under a kine prefix the DP already watches (e.g. /aisix/<env>/observability_probes/<probe_id> carrying exporter_id); the DP executes ONE synthetic delivery through the real sink (resolving credential_ref locally as usual) and reports {probe_id, ok, error} in the next heartbeat. Probe records are short-lived (CP deletes after terminal state).
  3. Keep healthcheck() stubs as-is or implement them via the probe path — a separate always-on prober is NOT needed once 1 + 2 exist.

Security constraints (unchanged invariants)

  • The CP never connects to the customer's telemetry target; only the DP delivers (probe included).
  • No credential material ever leaves the DP: last_error stays masked, the probe result carries no request/credential detail.
  • The synthetic probe event must contain no end-user prompt/response content.

Out of scope

  • CP-side persistence / dashboard UI (tracked in the AISIX-Cloud counterpart issue, linked below).
  • Prometheus / OTLP-metrics egress for the DP itself.

Acceptance criteria

  • Heartbeat carries exporter_health for every configured exporter (all four kinds), with masked last_error.
  • A probe record written to kine triggers exactly one synthetic delivery and exactly one result report; records are idempotent per probe_id.
  • An exporter with a missing credential env var reports healthy=false with an actionable, masked error (e.g. names the missing env var — the var NAME is not a secret).
  • e2e: mock-edge test pins heartbeat payload shape + probe round-trip for at least object_store and datadog.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-value differentiatorcross-repoRequires changes in DP + CP + Dashboard UI + e2eenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions