Skip to content

🤖 Autopilot: Structured Health Check Endpoint#87

Merged
crshdn merged 4 commits intomainfrom
autopilot/structured-health-check-endpoint
Mar 22, 2026
Merged

🤖 Autopilot: Structured Health Check Endpoint#87
crshdn merged 4 commits intomainfrom
autopilot/structured-health-check-endpoint

Conversation

@crshdn
Copy link
Owner

@crshdn crshdn commented Mar 22, 2026

What was built

Structured health check endpoints for Mission Control, enabling integration with monitoring tools (UptimeRobot, Grafana, Prometheus).

Endpoints

Route Auth Format Purpose
GET /api/health No JSON Summary: {status, uptime_seconds, version}
GET /api/health Bearer token JSON Full component breakdown
GET /api/health/metrics No Prometheus text Scrape target for Prometheus/Grafana

Component checks (authenticated detail)

  • db: PRAGMA integrity_check(1), user_version, page_count * page_size for size, writable test
  • gateway: OpenClaw WebSocket client isConnected() status
  • agents: Active/total counts from agents table + health state aggregation from agent_health
  • queue: Task status breakdown (assigned, in_progress, queued, testing, verification)
  • research: Per-product latest cycle phase, freshness, and status
  • costs: Per-cap utilization percentage from cost_caps table

Prometheus metrics exposed

autensa_up, autensa_uptime_seconds, autensa_db_ok, autensa_db_size_bytes, autensa_db_schema_version, autensa_gateway_connected, autensa_agents_active, autensa_agents_total, autensa_queue_assigned, autensa_queue_in_progress, autensa_queue_total_pending, autensa_cost_utilization_pct{product_id,cap_type}, autensa_research_last_cycle_age_seconds{product_id}

Research backing

Research recommended "Add health check endpoint (/api/health) that reports gateway connectivity, database status, active agent count, and queue depth for monitoring." This is standard infrastructure that was missing from Mission Control.

Technical approach

  • No new dependencies — Prometheus format built with string templates, DB checks via existing better-sqlite3 PRAGMAs
  • Auth bypass/api/health and /api/health/metrics added to middleware exclusion list (same pattern as /api/webhooks/)
  • Inline auth check — The health route handler checks Bearer token independently for detailed vs summary response
  • Read-only DB queries — All checks are SELECTs and PRAGMAs, no writes
  • Module-level uptime trackingDate.now() captured at import time

Risks / trade-offs

  • Gateway connectivity is a point-in-time check (WebSocket may reconnect between checks)
  • PRAGMA integrity_check(1) checks only one page — fast but not a full scan (intentional for performance)
  • Cost cap utilization depends on current_spend_usd being kept up to date by the cost tracking system

Files changed

File Change
src/lib/health.ts Core health check logic (summary, detail, Prometheus formatting)
src/app/api/health/route.ts GET handler with auth-gated detail
src/app/api/health/metrics/route.ts Prometheus text exposition endpoint
src/middleware.ts Auth bypass for /api/health routes
src/lib/health.test.ts Unit tests (4 passing)

Testing

  • ✅ 4/4 unit tests pass (summary shape, detail components, Prometheus format, error state)
  • next build succeeds — both routes compiled
  • ✅ Auth middleware exclusion pattern matches existing webhook bypass

Task ID: db22214b-b5b3-4a0c-8e24-60f4d3e27755

crshdn and others added 4 commits March 21, 2026 21:51
…/metrics)

- GET /api/health (unauthenticated): returns {status, uptime_seconds, version}
- GET /api/health (authenticated): returns full component breakdown (db, gateway, agents, queue, research, costs)
- GET /api/health/metrics: Prometheus text exposition format for scraping
- Middleware updated to bypass auth for /api/health routes
- Core logic in src/lib/health.ts with separate component checks
- DB: integrity_check, user_version, page_count*page_size for size
- Gateway: checks OpenClaw client isConnected()
- Agents: active/total counts + health state aggregation
- Queue: task status breakdown (assigned, in_progress, queued, testing, verification)
- Research: per-product latest cycle phase and freshness
- Costs: per-cap utilization percentage from existing cost_caps
- Unit tests for summary, detail, and Prometheus formatting

Task: db22214b-b5b3-4a0c-8e24-60f4d3e27755
- Remove .mc-workspace.json, .tmp/db-backups/*, db-backups/* from repo
- Add db-backups/, .tmp/, .mc-workspace.json to .gitignore
- Fix checkResearch() query: MAX(id) on UUIDs doesn't give latest cycle,
  use MAX(started_at) instead
- Remove dead code block in checkGateway()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integrity check already proves DB is accessible. Derive writable from
integrity_check result instead of a separate SELECT 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@crshdn crshdn merged commit 2389f65 into main Mar 22, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant