Service Health Dashboard for Self-Hosted Prefect #20738
Unanswered
Replies: 1 comment
-
|
One idea I had while reading this: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Service Health Dashboard for Self-Hosted Prefect
Context
From the Slack thread: a user's Postgres degraded silently, causing docket tasks to pile up in Redis, which eventually caused a Lua table overflow panic. There was no single place to see that the system was degrading before it crashed.
Today,
prefect server services lsshows enabled/disabled status via CLI,/healthreturns a boolean, and/admin/settingsdumps config. None of these tell you whether services are actually healthy at runtime.What Exists Today
Two kinds of background services
Consumer-based (
Servicesubclasses) - long-running message consumers:EventPersisterEventLoggerReactiveTriggersActionsTaskRunRecorderDistributorLogsStreamDocket perpetual tasks (
@perpetual_service) - periodic background jobs:monitor_worker_health(Foreman)schedule_deploymentsschedule_recent_deploymentsmonitor_late_runsmonitor_expired_pausesmonitor_cancelled_flow_runsmonitor_subflow_runsmonitor_expired_leasessend_telemetry_heartbeatExisting health signals
/health- returnstrue(just checks if the API server is up)/admin/settings- dumps full settings (no runtime health)/admin/version- version stringPOST /work_pools/{name}/workers/heartbeatupdateslast_heartbeat_timein DBemit_event("prefect.flow-run.heartbeat", ...)stored as events (this is what bloated Tom's events table)/metrics- optional, behindPREFECT_API_ENABLE_METRICSKey infrastructure dependencies
Proposed Design
Core Idea
Each background service emits lightweight heartbeat-style events on every successful iteration. A new
/api/admin/healthendpoint aggregates these into a single health snapshot. The UI renders this on a new Settings > Service Health page (or a dedicated route).1. Service Heartbeat Events
For perpetual services (docket-based)
Each
@perpetual_servicefunction already runs on a known cadence. After each successful execution, emit a structured event:For example,
schedule_deploymentscould report{"runs_scheduled": 42}, andmonitor_worker_healthcould report{"workers_marked_offline": 3}.Important: These heartbeat events should NOT be persisted to the events table (that's what caused Tom's problem!). They should be:
service_healthtable, upserted on each heartbeat)The simplest approach: a server-side in-memory
dict[str, ServiceHealthStatus]that each service updates after completing an iteration. No events table, no Redis, no additional DB writes by default.For consumer-based services
Consumer services (
EventPersister,TaskRunRecorder, etc.) don't have a natural "iteration complete" moment - they process messages continuously. Instead, they could update their health snapshot:"Persisting N events...")Example for EventPersister:
2. Infrastructure Health Checks
Beyond individual services, the dashboard should show health of the underlying infrastructure:
Database Health
Redis/Docket Health (when configured)
Events Table Size (the thing that bit Tom)
3. API Endpoint
A new endpoint on the admin router:
Response shape:
{ "server": { "version": "3.x.y", "uptime_seconds": 86400, "started_at": "2026-02-16T12:00:00Z" }, "database": { "status": "healthy", "latency_ms": 2.3, "dialect": "postgresql", "pool_size": 5, "pool_checked_out": 2, "pool_overflow": 0 }, "docket": { "status": "healthy", "backend": "redis", "latency_ms": 0.8, "pending_tasks": 12 }, "services": { "perpetual": [ { "name": "schedule_deployments", "status": "healthy", "last_heartbeat": "2026-02-17T21:59:30Z", "last_duration_ms": 142, "expected_interval_seconds": 60, "details": {"runs_scheduled": 42} }, { "name": "monitor_worker_health", "status": "degraded", "last_heartbeat": "2026-02-17T21:58:00Z", "last_duration_ms": 5200, "expected_interval_seconds": 15, "details": {"workers_marked_offline": 0}, "error": null } ], "consumer": [ { "name": "EventPersister", "status": "healthy", "last_heartbeat": "2026-02-17T21:59:55Z", "details": { "queue_depth": 15, "queue_capacity": 50000, "consecutive_failures": 0 } }, { "name": "TaskRunRecorder", "status": "healthy", "last_heartbeat": "2026-02-17T21:59:58Z", "details": {"queue_depth": 3} } ] }, "events_table": { "status": "warning", "event_count": 5400000, "message": "Events table has 5.4M rows - consider enabling retention cleanup" } }Status derivation: A perpetual service is "degraded" if
now - last_heartbeat > expected_interval * 2, and "error" if> expected_interval * 5or if the last iteration raised an exception. Consumer services report their own status based on queue depth and consecutive failures.4. UI
Where it lives
Option A: Dedicated "Service Health" page - new route at
/service-health, new nav item in the sidebar. Most discoverable, but adds clutter for users who don't self-host.Option B: Section on Settings page - extends the existing Settings page with a "Service Health" card below the existing settings dump. Lower friction to implement, co-located with config.
Option C: Dashboard card - a collapsible "System Health" card on the main dashboard, similar to
DashboardWorkPoolsCard. Always visible.Recommendation: Option A with a link from Settings. Self-hosted users running into issues (like Tom) will look for this. It doesn't need to be in the main nav initially - it could be accessible from Settings > "View Service Health" link, or as a sub-route under
/settings/health.What it shows
For the React UI (
ui-v2), this would be a new route and component:The page would show:
Visual design: similar to the existing work pools card which shows status badges. Each service gets a row with:
5. Multi-process considerations
When running
--workers > 1(multiple uvicorn workers), the in-memory health dict only reflects the process that handles the GET request. Services run in a single process (either the main worker orprefect server services start).Solutions:
prefect server services start): the health endpoint needs to query the services process. Options:Recommendation: Start with in-memory for the common single-process case. For separate services, write to a
service_health_statustable with upserts (one row per service, updated on each heartbeat). The API endpoint reads from this table. This is a single small query regardless of how many services exist.Implementation Plan
Phase 1: Backend health reporting (minimal)
ServiceHealthRegistrywith in-memory dict + optional DB persistence@perpetual_servicedecorator to auto-update health after each iteration (wrap the function call, measure duration, catch exceptions)Servicebase classstart()to periodically report healthGET /api/admin/healthendpoint with DB + docket + services healthPhase 2: UI
buildGetHealthQuerytoui-v2/src/api/admin/service-health-page.tsxcomponent with TanStack Query polling/settings/health(or/service-health)Phase 3: Alerting integration
prefect.server.service.degraded/prefect.server.service.errorevents (to the event bus, not persisted by default)Open Questions
Should health snapshots be persisted to DB or kept in-memory?
service_healthtable is small (one row per service) and handles multi-process deploymentsShould we report docket task queue depth?
XLENon the streamHow much detail should the health endpoint expose?
?detailed=truequery paramShould the health endpoint be authenticated?
/admin/settings/health(the existing boolean endpoint) should remain unauthenticated for load balancersHow does this relate to Prometheus metrics?
/metricsendpoint (behindPREFECT_API_ENABLE_METRICS) could export the same data as Prometheus gaugesWhat about the Vue UI (
ui/)?Beta Was this translation helpful? Give feedback.
All reactions