Version: 4.0
Status: In Progress
Last Updated: 2026-01-28
- Overview
- One-File Orchestrator
- Healdec Auto-Healing Engine
- 12 Parallel Workers
- Database Schema
- Data Flows
- Full-Stack Applications
- UI/UX Design System
- Production Deployment
- Implementation Status
Brain-Doctor Hospital V4 (AlgoBrainDoctor) is a production-grade repository health monitoring and auto-healing platform. The system continuously scans Git repositories, tracks developer identities, computes health scores, and automatically remediates detected issues through intelligent recovery strategies.
- Self-Healing: Autonomous error detection, classification, and recovery via Healdec engine
- Parallel Execution: 12 specialized workers process jobs concurrently with minimal contention
- Single Orchestrator: One centralized scheduler manages all coordination and supervision
- Immutable Events: All actions logged as append-only events for audit and replay
- Graceful Degradation: System continues operating even when individual workers fail
- Real-time repository health scoring (0-100 scale)
- Developer identity claim resolution and merging
- Automated healing workflows for common failure patterns
- Multi-tenant support with organization-level isolation
- Full observability through structured logs and metrics
- Production-ready API with authentication and rate limiting
The One-File Orchestrator is the central nervous system of Brain-Doctor Hospital. It follows a single-responsibility pattern where all coordination logic resides in one entry point, making the system easier to reason about, debug, and deploy.
- Polls the
jobstable for pending work - Assigns jobs to available workers based on priority and capacity
- Enforces concurrency limits per worker type
- Implements fair scheduling to prevent starvation
- Monitors worker health via heartbeats
- Detects stuck or crashed workers
- Triggers graceful shutdown on termination signals
- Reports worker status to monitoring systems
- Manages database connection pools
- Controls Redis cache access
- Enforces memory and CPU limits
- Implements backpressure when queues grow
- Handles startup initialization (schema validation, migrations)
- Coordinates graceful shutdown (drain queues, finish in-flight jobs)
- Supports rolling deploys with zero downtime
- Manages configuration hot-reloading
1. INIT: Load config, connect to DB/cache, validate schema
2. SCHEDULE: Poll jobs table every N seconds
3. DISPATCH: Assign jobs to workers based on availability
4. SUPERVISE: Monitor worker heartbeats and health
5. HEAL: Invoke Healdec on detected failures
6. REPEAT: Loop back to SCHEDULE until shutdown signal
interface OrchestratorConfig {
pollIntervalMs: number; // Job polling frequency (default: 5000)
maxConcurrentJobs: number; // Global job limit (default: 100)
workerTimeoutMs: number; // Worker heartbeat timeout (default: 60000)
gracefulShutdownTimeoutMs: number; // Max time for shutdown (default: 30000)
healdecEnabled: boolean; // Enable auto-healing (default: true)
}Healdec (Health Decision Controller) is the autonomous recovery system that detects, classifies, and heals failures without human intervention. It operates on a feedback loop that learns from past recovery attempts to optimize future decisions.
When: Transient network errors, rate limiting, timeouts
How: Exponential backoff with jitter (1s, 2s, 4s, 8s, max 60s)
Invariants:
- Max 5 retry attempts per job
- Reset retry counter on successful completion
- Log each retry with failure context
When: Worker crashes, memory leaks, corrupted state
How: Terminate worker process, spawn new instance, reassign job
Invariants:
- Preserve job data and history
- Increment restart counter
- Alert on >3 restarts within 1 hour
When: Repeated failures, data corruption, invalid inputs
How: Move job to quarantine queue, mark as needs-human-review
Invariants:
- Job removed from active scheduling
- Original error preserved for debugging
- Operator notified via Slack/email
When: Partial completion with side effects (DB writes, API calls)
How: Invoke compensating transactions to undo changes
Invariants:
- Maintain operation log for rollback
- Verify rollback completion
- Mark job as failed after rollback
When: Critical system errors, resource exhaustion, security issues
How: Page on-call engineer, halt related jobs, capture full diagnostics
Invariants:
- Immediate notification
- System state snapshot preserved
- Safe mode activated (read-only operations)
graph TD
A[Job Failed] --> B{Error Type?}
B -->|Transient| C[Retry Strategy]
B -->|Worker Crash| D[Restart Strategy]
B -->|Data Error| E[Quarantine Strategy]
B -->|Partial Success| F[Rollback Strategy]
B -->|System Critical| G[Escalate Strategy]
C --> H{Max Retries?}
H -->|No| C
H -->|Yes| E
D --> I{Max Restarts?}
I -->|No| D
I -->|Yes| G
interface HealdecLog {
id: string;
timestamp: Date;
jobId: string;
workerId: string;
failureType: 'transient' | 'crash' | 'data' | 'partial' | 'critical';
strategy: 'retry' | 'restart' | 'quarantine' | 'rollback' | 'escalate';
attemptNumber: number;
success: boolean;
context: Record<string, any>;
operatorNotified: boolean;
}- No Infinite Loops: Every strategy has a bounded retry limit
- State Preservation: All failure context is logged before recovery
- Idempotency: Recovery actions can be safely repeated
- Observability: Every Healdec action emits structured logs
- Human Override: Operators can disable auto-healing per job type
Each worker is an independent, long-running process that:
- Polls for jobs of its designated type
- Executes job logic with proper error handling
- Reports progress via heartbeats
- Respects graceful shutdown signals
- Logs all actions to the events table
All workers implement the base Worker interface:
interface Worker {
readonly name: string;
readonly jobType: string;
readonly concurrency: number;
start(): Promise<void>;
stop(): Promise<void>;
execute(job: Job): Promise<JobResult>;
healthCheck(): Promise<boolean>;
}Purpose: Discover and index new repositories in organizations
Trigger: Scheduled hourly, manual API call
Concurrency: 5
Flow:
- Query GitHub API for org repos
- Compare against local
repostable - Insert new repos with initial status
- Emit
repo.discoveredevents - Trigger downstream identity and score jobs
Output: New rows in repos table
Purpose: Extract developer identities from Git commits
Trigger: New repo indexed, manual refresh
Concurrency: 10
Flow:
- Clone repo (shallow, last 1000 commits)
- Parse
git logfor author/committer data - Normalize email addresses (lowercase, trim)
- Insert unique identities into
identitiestable - Create identity claims linking identities to repos
Output: New rows in identities, identity_claims tables
Purpose: Compute repository health scores based on metrics
Trigger: Scheduled daily, post-commit hook
Concurrency: 8
Flow:
- Fetch repo metadata (stars, forks, issues, PRs)
- Analyze commit frequency (last 30/90/365 days)
- Check for README, LICENSE, CI config
- Compute weighted score (0-100)
- Store in
scorestable with breakdown
Scoring Formula:
score = (
activity_score * 0.40 +
documentation_score * 0.25 +
community_score * 0.20 +
maintenance_score * 0.15
)
Output: New rows in scores table
Purpose: Stream real-time events from GitHub webhooks
Trigger: GitHub webhook delivery (push, PR, issue)
Concurrency: 20
Flow:
- Validate webhook signature
- Parse event payload
- Store raw event in
eventstable - Trigger relevant downstream jobs (score recalc, identity update)
- Acknowledge webhook within 10s
Output: New rows in events table, spawned jobs
Purpose: Synchronize local repo state with GitHub source of truth
Trigger: Scheduled every 6 hours
Concurrency: 5
Flow:
- Fetch latest repo metadata from GitHub API
- Compare with local
repostable - Update changed fields (stars, description, archived status)
- Emit
repo.syncedevents - Handle deleted repos (mark as archived)
Output: Updated rows in repos table
Purpose: Clean up old data, prune stale jobs, reclaim disk space
Trigger: Scheduled nightly at 2 AM UTC
Concurrency: 1
Flow:
- Delete events older than 90 days (configurable retention)
- Remove quarantined jobs older than 30 days
- Vacuum database tables to reclaim space
- Prune unused identity claims
- Emit
gc.completedevent with stats
Output: Deleted rows, freed disk space
Purpose: Monitor system health and send notifications
Trigger: Real-time on threshold breach
Concurrency: 3
Flow:
- Subscribe to metrics stream (worker failures, queue depth)
- Evaluate alerting rules (e.g., failure rate >5% for 5 min)
- Deduplicate alerts (same alert within 1 hour)
- Send notifications via Slack, email, PagerDuty
- Log alert in
healdec_logtable
Output: External notifications, logged alerts
Purpose: Generate reports and export data to external systems
Trigger: Scheduled weekly, manual API call
Concurrency: 2
Flow:
- Query aggregated data (scores, identities, events)
- Generate CSV/JSON/PDF reports
- Upload to S3 or send via email
- Update
exportstable with status - Emit
export.completedevent
Output: Files in S3, rows in exports table
Purpose: Track and log all system actions for compliance
Trigger: Real-time on sensitive operations
Concurrency: 5
Flow:
- Subscribe to events stream (user actions, config changes)
- Enrich with user context (IP, session ID, permissions)
- Write to immutable
audit_logtable - Optionally forward to SIEM (Splunk, Datadog)
- Enforce retention policies (7 years for compliance)
Output: New rows in audit_log table
Purpose: Fix data inconsistencies and corrupted state
Trigger: Manual trigger by operator, Healdec escalation
Concurrency: 1
Flow:
- Receive repair job with target entity (repo ID, identity ID)
- Run diagnostic checks (orphaned claims, duplicate identities)
- Apply fixes (merge duplicates, rebuild indexes)
- Validate repair success
- Log repair actions in
healdec_log
Output: Corrected data, logged repair actions
Purpose: Populate historical data for newly added metrics
Trigger: Manual trigger after schema migration
Concurrency: 3
Flow:
- Query all repos missing new metric (e.g.,
security_score) - Fetch historical data from GitHub API
- Compute metric for past dates
- Insert into
scores_historytable - Emit
backfill.completedevent
Output: Historical rows in scores_history table
Purpose: Perform routine maintenance tasks (reindex, optimize)
Trigger: Scheduled weekly on Sunday 3 AM UTC
Concurrency: 1
Flow:
- Reindex database tables for query performance
- Update materialized views (top repos, trending identities)
- Optimize query plans
- Run database health checks (fragmentation, bloat)
- Emit
maintenance.completedevent with stats
Output: Optimized database, updated views
| Worker | Priority | Notes |
|---|---|---|
| IngestWorker | 10 (Highest) | Real-time webhook ingestion |
| AlertWorker | 9 | Critical for observability |
| AuditWorker | 8 | Compliance requirement |
| IdentityWorker | 6 | Blocks score computation |
| ScoreWorker | 6 | Core business logic |
| SyncWorker | 5 | Can tolerate delays |
| IndexWorker | 5 | Scheduled batch work |
| RepairWorker | 4 | On-demand only |
| ExportWorker | 3 | Low priority batch |
| BackfillWorker | 2 | One-time operations |
| MaintenanceWorker | 1 | Background optimization |
| GCWorker | 1 | Can run off-peak |
Primary entity representing a Git repository.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE TABLE repos (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL REFERENCES orgs(id),
name VARCHAR(255) NOT NULL,
full_name VARCHAR(512) NOT NULL UNIQUE,
description TEXT,
url VARCHAR(1024) NOT NULL,
stars INTEGER DEFAULT 0,
forks INTEGER DEFAULT 0,
archived BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
last_synced_at TIMESTAMP,
INDEX idx_repos_org (org_id),
INDEX idx_repos_archived (archived)
);Developer identities extracted from Git commits.
CREATE TABLE identities (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) NOT NULL UNIQUE,
name VARCHAR(255),
github_username VARCHAR(255),
avatar_url VARCHAR(1024),
merged_into UUID REFERENCES identities(id),
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_identities_email (email),
INDEX idx_identities_merged (merged_into)
);Many-to-many relationship: identities ↔ repos.
CREATE TABLE identity_claims (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
identity_id UUID NOT NULL REFERENCES identities(id),
repo_id UUID NOT NULL REFERENCES repos(id),
commit_count INTEGER DEFAULT 0,
first_commit_at TIMESTAMP,
last_commit_at TIMESTAMP,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
UNIQUE (identity_id, repo_id),
INDEX idx_claims_identity (identity_id),
INDEX idx_claims_repo (repo_id)
);Repository health scores over time.
CREATE TABLE scores (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
repo_id UUID NOT NULL REFERENCES repos(id),
score INTEGER NOT NULL CHECK (score >= 0 AND score <= 100),
activity_score INTEGER,
documentation_score INTEGER,
community_score INTEGER,
maintenance_score INTEGER,
breakdown JSONB,
computed_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_scores_repo (repo_id),
INDEX idx_scores_computed (computed_at DESC)
);Immutable event log for audit and replay.
CREATE TABLE events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type VARCHAR(100) NOT NULL,
entity_type VARCHAR(50) NOT NULL,
entity_id UUID,
payload JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_events_type (event_type),
INDEX idx_events_entity (entity_type, entity_id),
INDEX idx_events_created (created_at DESC)
);Work queue for all worker types.
CREATE TABLE jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
job_type VARCHAR(50) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
priority INTEGER DEFAULT 5,
payload JSONB NOT NULL,
worker_id VARCHAR(100),
attempts INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 5,
scheduled_at TIMESTAMP NOT NULL DEFAULT NOW(),
started_at TIMESTAMP,
completed_at TIMESTAMP,
error TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_jobs_status (status, scheduled_at),
INDEX idx_jobs_type (job_type),
INDEX idx_jobs_worker (worker_id)
);Auto-healing decision log.
CREATE TABLE healdec_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
job_id UUID NOT NULL REFERENCES jobs(id),
worker_id VARCHAR(100),
failure_type VARCHAR(50) NOT NULL,
strategy VARCHAR(50) NOT NULL,
attempt_number INTEGER NOT NULL,
success BOOLEAN NOT NULL,
context JSONB,
operator_notified BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
INDEX idx_healdec_job (job_id),
INDEX idx_healdec_strategy (strategy),
INDEX idx_healdec_created (created_at DESC)
);Schema version tracking.
CREATE TABLE migrations (
id INTEGER PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
applied_at TIMESTAMP NOT NULL DEFAULT NOW()
);┌─────────────┐
│ GitHub Org │
└──────┬──────┘
│
v
┌─────────────────┐
│ IndexWorker │──> Discover repos via GitHub API
└────────┬────────┘
│
v
┌────────┐
│ repos │──> Insert new repos
└────┬───┘
│
├──> Spawn IdentityWorker job
│
└──> Spawn ScoreWorker job
┌──────────────────┐
│ IdentityWorker │──> Clone repo, extract identities
└─────────┬────────┘
│
v
┌────────────┐
│ identities │──> Insert unique identities
└──────┬─────┘
│
v
┌────────────────┐
│ identity_claims│──> Link identities to repos
└────────────────┘
┌──────────────────┐
│ ScoreWorker │──> Compute health score
└─────────┬────────┘
│
v
┌────────┐
│ scores │──> Store score + breakdown
└────────┘
1. IdentityWorker discovers new commits
└─> Extract author email/name from git log
2. Check if identity exists
├─> YES: Fetch existing identity_id
└─> NO: Insert new identity, get identity_id
3. Check if claim exists for (identity_id, repo_id)
├─> YES: Update commit_count, last_commit_at
└─> NO: Insert new claim
4. Emit identity.claimed event
5. Trigger identity merge job if duplicates detected
├─> Compare by email similarity
├─> Check GitHub username match
└─> Merge identities, update merged_into field
┌──────────┐
│ Job Fails│
└─────┬────┘
│
v
┌─────────────────┐
│ Healdec Invoked │
└────────┬────────┘
│
v
┌──────────────────┐
│ Classify Failure │
└──────┬───────────┘
│
├──> Transient? → Retry Strategy
├──> Worker Crash? → Restart Strategy
├──> Data Error? → Quarantine Strategy
├──> Partial Success? → Rollback Strategy
└──> Critical? → Escalate Strategy
Each strategy:
1. Log decision to healdec_log
2. Execute recovery action
3. Update job status
4. Emit healdec.action event
5. If success: Mark job completed
6. If failure: Apply next strategy or quarantine
Technology: Node.js + Express + TypeScript
Port: 3000
Auth: JWT tokens, API keys
Endpoints:
GET /api/v1/repos- List repositoriesGET /api/v1/repos/:id- Get repo detailsGET /api/v1/repos/:id/score- Get latest scoreGET /api/v1/identities- List identitiesPOST /api/v1/jobs- Create manual jobGET /api/v1/health- Health check
Features:
- Rate limiting (100 req/min per API key)
- Request validation with Zod schemas
- Structured logging with Pino
- OpenAPI spec auto-generation
- CORS support for web clients
Technology: React + TypeScript + Vite
URL: https://dashboard.algobrain.doctor
Pages:
- Home: System overview (active jobs, worker status, recent events)
- Repositories: Searchable table with health scores and filters
- Identities: Developer identity management and merge tools
- Jobs: Job queue monitor with retry/cancel actions
- Healdec: Auto-healing history and manual override controls
- Logs: Real-time log streaming with filtering
Access Control:
- Read-only: View data, no mutations
- Operator: Read + trigger jobs, approve merges
- Admin: Full access + user management
Technology: React + TypeScript + Vite
URL: https://admin.algobrain.doctor
Features:
- Organization management (create, configure, disable)
- User management (invite, assign roles, revoke access)
- System configuration (worker concurrency, Healdec rules)
- Database tools (run migrations, trigger backfills)
- Monitoring dashboards (Grafana embeds)
- Incident response (force job cancellation, quarantine inspection)
Security:
- Admin-only access (enforced at API level)
- Audit logging for all admin actions
- 2FA required for sensitive operations
The Brain-Doctor Hospital V4 interface combines Aura FX (soft neon diffusion effects), Neo-Glow (cyber-medical aesthetics), and GitHub Dark (familiar developer theme) to create a distinctive yet professional experience.
--violet-aura: #A78BFA; /* Soft purple glow for primary actions */
--aqua-pulse: #4FD1C5; /* Cyan pulse for health indicators */
--coral-heat: #F87171; /* Red-orange for alerts and warnings */
--cyber-yellow: #FACC15; /* Bright yellow for caution states */--bg-primary: #0d1117; /* Main background */
--bg-secondary: #161b22; /* Card/panel background */
--bg-tertiary: #21262d; /* Hover states */
--border-default: #30363d; /* Borders and dividers */
--text-primary: #c9d1d9; /* Primary text */
--text-secondary: #8b949e; /* Secondary text */
--text-muted: #6e7681; /* Muted/disabled text */--success: #3fb950; /* Success states */
--warning: #d29922; /* Warning states */
--error: #f85149; /* Error states */
--info: #58a6ff; /* Info states */--font-family-base: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Noto Sans', Helvetica, Arial, sans-serif;
--font-family-mono: ui-monospace, SFMono-Regular, 'SF Mono', Menlo, Consolas, 'Liberation Mono', monospace;
--font-size-xs: 0.75rem; /* 12px */
--font-size-sm: 0.875rem; /* 14px */
--font-size-base: 1rem; /* 16px */
--font-size-lg: 1.125rem; /* 18px */
--font-size-xl: 1.25rem; /* 20px */
--font-size-2xl: 1.5rem; /* 24px */
--font-size-3xl: 2rem; /* 32px */┌─────────────────────────────────────────────────────────┐
│ Top Nav Bar │
├──────────┬─────────────────────────────────┬────────────┤
│ │ │ │
│ LEFT │ CENTER │ RIGHT │
│ PANEL │ PANEL │ PANEL │
│ │ │ │
│ Repo │ Vitals + Scan Results │ Logs + │
│ Switcher │ Health Charts │ Health │
│ │ Action Buttons │ Stream │
│ │ │ │
│ 240px │ Flexible │ 320px │
│ │ │ │
└──────────┴─────────────────────────────────┴────────────┘
Breakpoints:
- Desktop: ≥1280px (3-column layout)
- Tablet: 768px - 1279px (2-column: left hidden, center + right stacked)
- Mobile: <768px (1-column: hamburger menu for left, stacked content)
| Component | Purpose | Visual Style | Interactions |
|---|---|---|---|
| VitalsModal | Display key repo health metrics | Glassmorphic card with neon borders | Hover glow, click to expand details |
| HealthReportModal | Full diagnostic report with breakdown | Fullscreen overlay with tabs | Scroll sections, download PDF |
| RepoCard | Repository list item with quick stats | Card with color-coded health bar | Hover lift effect, click to open |
| ScanBox | Individual scan result display | Bordered box with status icon | Pulse animation on active scan |
| SmartBrainTerminal | Real-time log viewer with AI insights | Monospace text with syntax highlighting | Auto-scroll, filter, search |
| BlackboxModal | Job queue inspector | Dark modal with JSON viewer | Expand/collapse, copy to clipboard |
| ImmunizerModal | Healdec healing controls | Cyber-medical theme with action buttons | Confirm dialogs, progress bars |
| ActionWorkflowModal | Trigger manual jobs and workflows | Step-by-step wizard | Form validation, real-time preview |
Applied to cards, buttons, and focus states.
.aura-glow {
box-shadow:
0 0 10px rgba(167, 139, 250, 0.3),
0 0 20px rgba(167, 139, 250, 0.2),
0 0 30px rgba(167, 139, 250, 0.1);
transition: box-shadow 0.3s ease;
}
.aura-glow:hover {
box-shadow:
0 0 15px rgba(167, 139, 250, 0.5),
0 0 30px rgba(167, 139, 250, 0.3),
0 0 45px rgba(167, 139, 250, 0.2);
}Used for active scans and real-time updates.
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.6; }
}
.pulse-effect {
animation: pulse 2s cubic-bezier(0.4, 0, 0.6, 1) infinite;
}Border animation for loading states.
@keyframes neon-sweep {
0% { border-color: var(--violet-aura); }
33% { border-color: var(--aqua-pulse); }
66% { border-color: var(--coral-heat); }
100% { border-color: var(--violet-aura); }
}
.neon-sweep {
animation: neon-sweep 3s linear infinite;
}- Cards: Lift effect (translateY -2px) + glow intensifies
- Buttons: Background lightens + border glow
- Links: Underline appears + text color shifts to accent
- Scale down: Scale 0.98 for 100ms on click
- Ripple effect: Radial gradient spreads from click point
- Success flash: Brief aqua-pulse glow on action completion
- Skeleton screens: Animated gradient shimmer
- Spinners: Rotating neon ring (violet-aura)
- Progress bars: Gradient fill with pulse effect
- Input validation: Red border + shake animation
- Failed actions: Coral-heat glow + error message toast
- Network errors: Dimmed overlay + retry button
- Minimum 44x44px for all tappable elements
- Increased padding on buttons and cards
- Swipe left/right: Navigate between dashboard tabs
- Pull to refresh: Reload current data view
- Pinch to zoom: Scale charts and graphs
- Hamburger menu: Left panel collapses into slide-out drawer
- Stacked cards: Center and right panels stack vertically
- Floating action button: Primary action always accessible
- WCAG AA compliance: Color contrast ratios ≥4.5:1
- Keyboard navigation: Tab order follows visual hierarchy
- Screen reader support: Semantic HTML + ARIA labels
- Focus indicators: High-contrast outline (not just glow)
- Reduced motion: Respects
prefers-reduced-motionmedia query
┌─────────────────────────────────────────────────────┐
│ Load Balancer (AWS ALB) │
└────────────┬──────────────────────────┬─────────────┘
│ │
┌────────v────────┐ ┌────────v────────┐
│ API Server │ │ API Server │
│ (ECS Fargate) │ │ (ECS Fargate) │
└────────┬────────┘ └────────┬────────┘
│ │
└─────────┬────────────────┘
│
┌─────────v─────────┐
│ PostgreSQL │
│ (RDS Aurora) │
│ Multi-AZ │
└─────────┬─────────┘
│
┌─────────v─────────┐
│ Redis Cache │
│ (ElastiCache) │
└───────────────────┘
┌─────────────────────────────────────────────────────┐
│ Orchestrator + Workers (ECS) │
│ ┌───────────────┐ ┌─────────────────────────┐ │
│ │ Orchestrator │ │ Worker Pool (12) │ │
│ │ (1 instance) │ │ (Auto-scaling 1-10x) │ │
│ └───────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Static Frontend (S3 + CloudFront) │
└─────────────────────────────────────────────────────┘
- Container: Node.js 20 + Express
- CPU: 1 vCPU
- Memory: 2 GB
- Auto-scaling: 2-10 tasks based on CPU >70%
- Health check:
GET /healthevery 30s
- Container: Node.js 20 + TypeScript
- CPU: 2 vCPU
- Memory: 4 GB
- Replicas: 1 (singleton pattern)
- Health check: Heartbeat to DynamoDB every 60s
- Container: Node.js 20 + TypeScript
- CPU: 1 vCPU per worker
- Memory: 2 GB per worker
- Auto-scaling: 1-10 instances per worker type
- Concurrency: Controlled by orchestrator
- Engine: PostgreSQL 15 (Aurora Serverless v2)
- ACU: 0.5 - 8 (auto-scaling)
- Backup: Daily snapshots, 7-day retention
- Replicas: 1 read replica in different AZ
- Engine: Redis 7
- Node type: cache.t4g.medium
- Replicas: 1 (automatic failover)
- Eviction: LRU policy
- Zero Downtime Deploys: Rolling updates with health checks
- Database Migrations: Run in transaction, with rollback on failure
- Worker Isolation: Each worker type in separate task definition
- Secrets Management: AWS Secrets Manager, rotated every 90 days
- Log Retention: CloudWatch logs kept for 30 days, then archived to S3
- Monitoring: Datadog APM + custom metrics dashboard
- Alerting: PagerDuty integration for critical failures
- Disaster Recovery: RTO 4 hours, RPO 1 hour (point-in-time restore)
# Database
DATABASE_URL=postgresql://user:pass@host:5432/dbname
DATABASE_POOL_SIZE=20
# Redis
REDIS_URL=redis://host:6379
REDIS_PASSWORD=secret
# GitHub
GITHUB_TOKEN=ghp_xxxxxxxxxxxxx
GITHUB_WEBHOOK_SECRET=whsec_xxxxxxxxxxxxx
# Orchestrator
ORCHESTRATOR_POLL_INTERVAL_MS=5000
ORCHESTRATOR_MAX_CONCURRENT_JOBS=100
HEALDEC_ENABLED=true
# Workers
WORKER_CONCURRENCY_INDEX=5
WORKER_CONCURRENCY_IDENTITY=10
WORKER_CONCURRENCY_SCORE=8
# ... (one per worker type)
# Observability
DATADOG_API_KEY=xxxxxxxxxxxxx
LOG_LEVEL=info
SENTRY_DSN=https://[email protected]/xxxxxxxxxxxxx
# API
JWT_SECRET=xxxxxxxxxxxxx
API_RATE_LIMIT_PER_MINUTE=100- Database schema design and migrations
- Core table implementation (repos, identities, identity_claims, scores, events)
- Orchestration tables (jobs, healdec_log, migrations)
- One-File Orchestrator skeleton
- Worker base interface and contracts
- Healdec engine framework (classification + strategies)
- Public API scaffold with health endpoint
- Basic authentication (JWT + API keys)
- Logging infrastructure (Pino + structured logs)
- CI/CD pipeline (GitHub Actions → AWS ECS)
- IndexWorker implementation (70% complete)
- IdentityWorker implementation (60% complete)
- ScoreWorker implementation (50% complete)
- Operator Dashboard UI (40% complete)
- Layout system
- VitalsModal component
- RepoCard component
- SmartBrainTerminal
- Other modals
- Healdec retry strategy (80% complete)
- Healdec restart strategy (40% complete)
- IngestWorker (webhook ingestion)
- SyncWorker (GitHub sync)
- GCWorker (garbage collection)
- AlertWorker (notifications)
- ExportWorker (reports)
- AuditWorker (compliance logging)
- RepairWorker (data repair)
- BackfillWorker (historical data)
- MaintenanceWorker (DB optimization)
- Admin Console UI
- Healdec quarantine strategy
- Healdec rollback strategy
- Healdec escalate strategy
- Full API endpoint suite
- Rate limiting implementation
- OpenAPI spec generation
- Monitoring dashboards (Grafana)
- Production load testing
- Disaster recovery drills
- v4.0.0 (2026-01-28): Initial production architecture specification
- v3.0.0 (2025-12): Prototype with 6 workers, basic healing
- v2.0.0 (2025-06): MVP with manual job triggering
- v1.0.0 (2024-12): Proof of concept, single-repo scoring
- MERMEDA.md - Complete Mermaid diagram suite
- docs/REPOSITORY_STRUCTURE.md - Suggested folder layout
- docs/OPERATOR_HANDBOOK.md - Deployment and operations guide
- docs/DOCS_SITE_STRUCTURE.md - Documentation site layout
Last Updated: 2026-01-28
Maintained By: AlgoBrainDoctor Core Team
License: MIT