Service Health Dashboard for Self-Hosted Prefect #20738

2026-02-18T16:03:56Z

devin-ai-integration[bot]
bot Feb 18, 2026

Service Health Dashboard for Self-Hosted Prefect

Context

From the Slack thread: a user's Postgres degraded silently, causing docket tasks to pile up in Redis, which eventually caused a Lua table overflow panic. There was no single place to see that the system was degrading before it crashed.

Today, prefect server services ls shows enabled/disabled status via CLI, /health returns a boolean, and /admin/settings dumps config. None of these tell you whether services are actually healthy at runtime.

What Exists Today

Two kinds of background services

Consumer-based (Service subclasses) - long-running message consumers:

Service	What it does
`EventPersister`	Persists events from the message bus to the database
`EventLogger`	Logs events to console (debug)
`ReactiveTriggers`	Evaluates reactive automation triggers
`Actions`	Runs actions triggered by automations
`TaskRunRecorder`	Records task run state from events
`Distributor`	Distributes events to websocket subscribers
`LogsStream`	Streams logs to subscribers

Docket perpetual tasks (@perpetual_service) - periodic background jobs:

Service	What it does	Loop interval
`monitor_worker_health` (Foreman)	Marks stale workers/pools offline	~15s
`schedule_deployments`	Schedules flow runs from deployments	~60s
`schedule_recent_deployments`	Fast-path scheduling for new deployments	~5s
`monitor_late_runs`	Marks late flow runs	configurable
`monitor_expired_pauses`	Fails expired paused flow runs	configurable
`monitor_cancelled_flow_runs`	Cancels child tasks of cancelled flows	configurable
`monitor_subflow_runs`	Cancels subflow runs	configurable
`monitor_expired_leases`	Revokes expired concurrency leases	configurable
`send_telemetry_heartbeat`	Anonymous telemetry to Prefect	600s

Existing health signals

/health - returns true (just checks if the API server is up)
/admin/settings - dumps full settings (no runtime health)
/admin/version - version string
Worker heartbeats - POST /work_pools/{name}/workers/heartbeat updates last_heartbeat_time in DB
Flow run heartbeats - emit_event("prefect.flow-run.heartbeat", ...) stored as events (this is what bloated Tom's events table)
Prometheus /metrics - optional, behind PREFECT_API_ENABLE_METRICS

Key infrastructure dependencies

PostgreSQL - primary data store
Redis (optional) - docket backend, messaging, concurrency leases, event ordering
Docket - manages all perpetual service scheduling and one-off background tasks

Proposed Design

Core Idea

Each background service emits lightweight heartbeat-style events on every successful iteration. A new /api/admin/health endpoint aggregates these into a single health snapshot. The UI renders this on a new Settings > Service Health page (or a dedicated route).

1. Service Heartbeat Events

For perpetual services (docket-based)

Each @perpetual_service function already runs on a known cadence. After each successful execution, emit a structured event:

emit_event(
    event="prefect.server.service.heartbeat",
    resource={
        "prefect.resource.id": f"prefect.server.service.{function_name}",
        "prefect.resource.name": function_name,
        "prefect.service.type": "perpetual",
    },
    payload={
        "duration_ms": elapsed_ms,       # how long this iteration took
        "status": "healthy",             # or "degraded" / "error"
        "details": {...},                # service-specific metrics
    },
)

For example, schedule_deployments could report {"runs_scheduled": 42}, and monitor_worker_health could report {"workers_marked_offline": 3}.

Important: These heartbeat events should NOT be persisted to the events table (that's what caused Tom's problem!). They should be:

Held in-memory only in a bounded ring buffer / dict on the server
OR written to a separate lightweight store (a single row per service in a service_health table, upserted on each heartbeat)

The simplest approach: a server-side in-memory dict[str, ServiceHealthStatus] that each service updates after completing an iteration. No events table, no Redis, no additional DB writes by default.

@dataclass
class ServiceHealthSnapshot:
    service_name: str
    service_type: str          # "perpetual" | "consumer"
    last_heartbeat: datetime
    last_duration_ms: float
    status: str                # "healthy" | "degraded" | "error" | "stopped"
    details: dict[str, Any]    # service-specific metrics
    error: str | None          # last error message if any
    expected_interval_seconds: float | None  # for perpetual services

# Global registry, updated in-process
_service_health: dict[str, ServiceHealthSnapshot] = {}

For consumer-based services

Consumer services (EventPersister, TaskRunRecorder, etc.) don't have a natural "iteration complete" moment - they process messages continuously. Instead, they could update their health snapshot:

On each successful flush/batch (e.g., EventPersister logs "Persisting N events...")
Periodically (e.g., every 30s)
With queue depth as a key metric (EventPersister already warns at 80% queue capacity)

Example for EventPersister:

update_service_health("EventPersister", {
    "queue_depth": queue.qsize(),
    "queue_capacity": queue_max_size,
    "consecutive_failures": consecutive_failures,
    "last_flush_size": len(batch),
})

2. Infrastructure Health Checks

Beyond individual services, the dashboard should show health of the underlying infrastructure:

Database Health

async def check_database_health() -> dict:
    db = provide_database_interface()
    start = time.monotonic()
    async with db.session_context() as session:
        await session.execute(sa.text("SELECT 1"))
    latency_ms = (time.monotonic() - start) * 1000

    # Also report pool stats if available
    engine = await db.engine()
    pool = engine.pool
    return {
        "status": "healthy" if latency_ms < 1000 else "degraded",
        "latency_ms": latency_ms,
        "pool_size": pool.size(),
        "checked_in": pool.checkedin(),
        "checked_out": pool.checkedout(),
        "overflow": pool.overflow(),
    }

Redis/Docket Health (when configured)

async def check_docket_health(docket: Docket) -> dict:
    # Docket wraps Redis - check connectivity and queue depth
    start = time.monotonic()
    try:
        info = await docket.redis.info("server")
        latency_ms = (time.monotonic() - start) * 1000
        pending = await docket.redis.xlen(docket.stream_key)
        return {
            "status": "healthy",
            "latency_ms": latency_ms,
            "redis_version": info.get("redis_version"),
            "pending_tasks": pending,
            "url": "redis://***",  # redacted
        }
    except Exception as e:
        return {"status": "error", "error": str(e)}

Events Table Size (the thing that bit Tom)

async def check_events_table_health() -> dict:
    db = provide_database_interface()
    async with db.session_context() as session:
        count = await session.scalar(
            sa.select(sa.func.count()).select_from(db.Event)
        )
        # Optionally: oldest event, table size estimate
        return {
            "event_count": count,
            "status": "healthy" if count < 1_000_000 else "warning",
        }

3. API Endpoint

A new endpoint on the admin router:

GET /api/admin/health

Response shape:

{
  "server": {
    "version": "3.x.y",
    "uptime_seconds": 86400,
    "started_at": "2026-02-16T12:00:00Z"
  },
  "database": {
    "status": "healthy",
    "latency_ms": 2.3,
    "dialect": "postgresql",
    "pool_size": 5,
    "pool_checked_out": 2,
    "pool_overflow": 0
  },
  "docket": {
    "status": "healthy",
    "backend": "redis",
    "latency_ms": 0.8,
    "pending_tasks": 12
  },
  "services": {
    "perpetual": [
      {
        "name": "schedule_deployments",
        "status": "healthy",
        "last_heartbeat": "2026-02-17T21:59:30Z",
        "last_duration_ms": 142,
        "expected_interval_seconds": 60,
        "details": {"runs_scheduled": 42}
      },
      {
        "name": "monitor_worker_health",
        "status": "degraded",
        "last_heartbeat": "2026-02-17T21:58:00Z",
        "last_duration_ms": 5200,
        "expected_interval_seconds": 15,
        "details": {"workers_marked_offline": 0},
        "error": null
      }
    ],
    "consumer": [
      {
        "name": "EventPersister",
        "status": "healthy",
        "last_heartbeat": "2026-02-17T21:59:55Z",
        "details": {
          "queue_depth": 15,
          "queue_capacity": 50000,
          "consecutive_failures": 0
        }
      },
      {
        "name": "TaskRunRecorder",
        "status": "healthy",
        "last_heartbeat": "2026-02-17T21:59:58Z",
        "details": {"queue_depth": 3}
      }
    ]
  },
  "events_table": {
    "status": "warning",
    "event_count": 5400000,
    "message": "Events table has 5.4M rows - consider enabling retention cleanup"
  }
}

Status derivation: A perpetual service is "degraded" if now - last_heartbeat > expected_interval * 2, and "error" if > expected_interval * 5 or if the last iteration raised an exception. Consumer services report their own status based on queue depth and consecutive failures.

4. UI

Where it lives

Option A: Dedicated "Service Health" page - new route at /service-health, new nav item in the sidebar. Most discoverable, but adds clutter for users who don't self-host.

Option B: Section on Settings page - extends the existing Settings page with a "Service Health" card below the existing settings dump. Lower friction to implement, co-located with config.

Option C: Dashboard card - a collapsible "System Health" card on the main dashboard, similar to DashboardWorkPoolsCard. Always visible.

Recommendation: Option A with a link from Settings. Self-hosted users running into issues (like Tom) will look for this. It doesn't need to be in the main nav initially - it could be accessible from Settings > "View Service Health" link, or as a sub-route under /settings/health.

What it shows

For the React UI (ui-v2), this would be a new route and component:

ui-v2/src/routes/settings/health.tsx
ui-v2/src/components/settings/service-health/
  index.ts
  service-health-page.tsx
  service-status-card.tsx
  infrastructure-health-card.tsx
  service-health-table.tsx

The page would show:

Infrastructure summary at the top - traffic-light indicators for DB, Redis/Docket, API server
Services table - all services with status, last heartbeat, duration, and expandable details
Events table health - row count, oldest event, retention config
Auto-refresh via TanStack Query polling (every 10-30s)

Visual design: similar to the existing work pools card which shows status badges. Each service gets a row with:

Name
Type badge (perpetual / consumer)
Status indicator (green/yellow/red dot)
Last heartbeat (relative time, e.g., "3s ago")
Last duration (for perpetual services)
Expandable details panel

5. Multi-process considerations

When running --workers > 1 (multiple uvicorn workers), the in-memory health dict only reflects the process that handles the GET request. Services run in a single process (either the main worker or prefect server services start).

Solutions:

If services are co-located (single worker mode): in-memory dict works fine
If services are separate (prefect server services start): the health endpoint needs to query the services process. Options:
- Write health snapshots to a lightweight DB table (simplest)
- Expose a health endpoint from the services process itself
- Use Redis as a shared health store (if Redis is already configured)

Recommendation: Start with in-memory for the common single-process case. For separate services, write to a service_health_status table with upserts (one row per service, updated on each heartbeat). The API endpoint reads from this table. This is a single small query regardless of how many services exist.

Implementation Plan

Phase 1: Backend health reporting (minimal)

Add ServiceHealthRegistry with in-memory dict + optional DB persistence
Instrument @perpetual_service decorator to auto-update health after each iteration (wrap the function call, measure duration, catch exceptions)
Instrument Service base class start() to periodically report health
Add GET /api/admin/health endpoint with DB + docket + services health
Add events table row count to the response

Phase 2: UI

Add buildGetHealthQuery to ui-v2/src/api/admin/
Add MSW mock handler for the health endpoint
Create service-health-page.tsx component with TanStack Query polling
Add route at /settings/health (or /service-health)
Link from Settings page

Phase 3: Alerting integration

Optionally emit prefect.server.service.degraded / prefect.server.service.error events (to the event bus, not persisted by default)
Users can set up automations to trigger on these events (e.g., send a Slack notification when EventPersister queue depth exceeds threshold)

Open Questions

Should health snapshots be persisted to DB or kept in-memory?
- In-memory is simpler and avoids adding DB load, but doesn't work across processes
- A service_health table is small (one row per service) and handles multi-process deployments
- Could start in-memory and add DB persistence behind a setting
Should we report docket task queue depth?
- This is what caused Tom's cascade - docket tasks piling up because Postgres was slow
- Docket uses Redis streams; we can check XLEN on the stream
- Would require access to the docket's Redis connection
How much detail should the health endpoint expose?
- Minimal: status + last_heartbeat per service
- Medium: add duration, queue depth, error messages
- Full: add DB pool stats, Redis memory, events table size
- Recommendation: medium by default, full behind a ?detailed=true query param
Should the health endpoint be authenticated?
- It exposes internal operational details (pool sizes, error messages)
- Probably yes, same auth as /admin/settings
- But /health (the existing boolean endpoint) should remain unauthenticated for load balancers
How does this relate to Prometheus metrics?
- The existing /metrics endpoint (behind PREFECT_API_ENABLE_METRICS) could export the same data as Prometheus gauges
- The health dashboard would be the "batteries included" version for users who don't run Prometheus
- Both can coexist; health registry feeds both
What about the Vue UI (ui/)?
- The Vue UI is the current default; ui-v2 is the new React UI
- If this ships soon, it probably only needs to be in ui-v2
- Could add a minimal version to the Vue UI's Settings page as well

desertaxle · 2026-02-18T16:33:35Z

desertaxle
Feb 18, 2026
Maintainer

One idea I had while reading this: pydocket supports persisting task results with a TTL, so it might be better to use that functionality as a health indicator rather than in-memory Prefect events.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service Health Dashboard for Self-Hosted Prefect #20738

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Service Health Dashboard for Self-Hosted Prefect #20738

Uh oh!

devin-ai-integration[bot] bot Feb 18, 2026

Service Health Dashboard for Self-Hosted Prefect

Context

What Exists Today

Two kinds of background services

Existing health signals

Key infrastructure dependencies

Proposed Design

Core Idea

1. Service Heartbeat Events

For perpetual services (docket-based)

For consumer-based services

2. Infrastructure Health Checks

Database Health

Redis/Docket Health (when configured)

Events Table Size (the thing that bit Tom)

3. API Endpoint

4. UI

Where it lives

What it shows

5. Multi-process considerations

Implementation Plan

Phase 1: Backend health reporting (minimal)

Phase 2: UI

Phase 3: Alerting integration

Open Questions

Replies: 1 comment

Uh oh!

desertaxle Feb 18, 2026 Maintainer

devin-ai-integration[bot]
bot Feb 18, 2026

desertaxle
Feb 18, 2026
Maintainer