Whaley Architecture Guide

This document is a comprehensive architecture reference for Whaley — the dedicated Docker instancer for CTF competitions. It covers every module, data flow, and design decision relevant to maintainers and contributors.

Project Overview
Repository Structure
Runtime Architecture
Startup & Shutdown Sequence
Core Subsystems
Request Flows
Data & Persistence
Frontend Architecture
Build & Deployment Pipeline
Security Guardrails
Prometheus Metrics
Admin Surface Map
Troubleshooting Playbook
Extension Points

Project Overview

Whaley provides isolated Docker challenge instances to CTF participants. Each user (or team, in team mode) gets their own set of containers, reachable via dynamically-generated Traefik routes. The system handles the full lifecycle: spawn, extend, stop, and automatic expiry.

Core Capabilities

Capability	Description
Instance Lifecycle	Spawn / stop / extend with ownership and rate-limit enforcement
Dynamic Routing	Per-instance Traefik HTTP/TCP routers written to Redis KV at runtime
Port Allocation	Deterministic lane-based port assignment with database persistence
Authentication	CTFd Bearer-token validation or IP-based no-auth mode
Team Mode	Shared instances and flags per team, with team ownership semantics
Dynamic Flags	Per-owner unique flags injected into challenge files at spawn time
Suspicious Submission Detection	Cross-references CTFd submissions against flag ownership
Instance Forensics	Auto-capture or live-capture container logs on termination
Resource Monitoring	Per-container CPU, memory, network, block I/O, and PID metrics
Discord Notifications	Rich embed notifications for spawn/stop/extend/failure events
Admin Management	Web UI for challenges, settings, logs, flags, forensics, monitoring
Prometheus Metrics	30+ metric families for SLO tracking and operational observability

Repository Structure

whaley/
├── app/                          # FastAPI backend
│   ├── __init__.py
│   ├── main.py                   # FastAPI app, all route handlers, lifespan, metrics
│   ├── config.py                 # Pydantic Settings (50+ env vars with defaults)
│   ├── models.py                 # Pydantic request/response models
│   ├── auth.py                   # CTFd token validation, team mode, no-auth IP identity
│   ├── docker_manager.py         # Challenge loading, spawn/stop/extend lifecycle
│   ├── docker_client.py          # Docker SDK wrapper (async over docker-py)
│   ├── port_manager.py           # Lane-based port allocation with DB persistence
│   ├── traefik_redis.py          # Traefik Redis KV dynamic router/service keys
│   ├── distributed_lock.py       # Redis distributed lock with asyncio local fallback
│   ├── flag_manager.py           # Dynamic flag generation, CTFd registration, submission check
│   ├── forensics.py              # Container log capture, indexing, retention
│   ├── monitoring.py             # Container/system resource metrics via Docker stats
│   ├── logger.py                 # Event logging to SQLite/PostgreSQL + memory cache
│   ├── discord_webhook.py        # Discord rich-embed notifications for lifecycle events
│   ├── database/
│   │   ├── __init__.py
│   │   ├── connection.py         # Async SQLAlchemy engine & session factory
│   │   └── models.py             # ORM models (7 tables)
│   └── static/                   # Built frontend assets (output of Vite build)
│       ├── index.html            # User-facing React SPA
│       ├── admin.html            # Admin panel React SPA
│       ├── assets/               # Hashed JS/CSS/font bundles
│       ├── icon.png              # Favicon
│       ├── app.js                # Legacy user dashboard (no longer used)
│       └── style.css             # Legacy stylesheet (no longer used)
├── frontend/                     # React + TypeScript + Vite source
│   ├── package.json
│   ├── vite.config.js            # Multi-page build (main + admin), output to ../app/static
│   ├── tailwind.config.js
│   ├── tsconfig.json
│   ├── index.html                # User SPA entry point
│   ├── admin.html                # Admin SPA entry point
│   └── src/
│       ├── main.tsx              # User app bootstrap
│       ├── admin.tsx             # Admin app bootstrap
│       ├── admin/
│       │   ├── AdminApp.tsx      # Admin auth + 6-tab navigation
│       │   ├── pages/
│       │   │   ├── DashboardPage.tsx
│       │   │   ├── ChallengesPage.tsx
│       │   │   ├── FlagsPage.tsx
│       │   │   ├── LogsPage.tsx
│       │   │   ├── MonitoringPage.tsx
│       │   │   └── SettingsPage.tsx
│       │   └── types/
│       ├── user/
│       │   ├── UserApp.tsx       # User challenge spawning UI
│       │   └── types.ts
│       └── shared/
│           ├── api/              # HTTP client + admin/user API functions
│           ├── components/       # UI primitives (Badge, Button, Card, Modal, etc.)
│           ├── hooks/            # useConfirm, useToast
│           ├── types/            # Shared TypeScript types
│           └── utils/            # format, time, download helpers
├── challenges/                   # Challenge definitions (instance.toml + compose files)
├── data/                         # Persistent data directory
├── logs/
│   ├── forensics/                # Captured container logs (plain or gzipped)
│   └── events.jsonl              # Event log output
├── docs/
│   ├── ARCHITECTURE.md           # This file
│   ├── DOCUMENTATION.md          # User/operator documentation
│   ├── DYNAMIC-FLAGS.md          # Deep-dive on dynamic flags subsystem
│   └── IDENTITY.md               # MCTF 5.0 brand identity notes
├── images/                       # Screenshots
├── reports/                      # Security audit reports
├── Dockerfile                    # Multi-stage: Node frontend + Python backend
├── docker-compose.yaml           # Production deployment (instancer + Redis)
├── requirements.txt              # Python dependencies
├── .env.example                  # Full configuration reference
├── .env.prod                     # Production configuration (sensitive)
└── LICENSE

Runtime Architecture

In-Process Managers

All core managers are module-level singletons, instantiated at import or during lifespan startup:

Manager	Module	Role
`PortManager`	`port_manager.py`	Port allocation/release, lane distribution
`DockerManager`	`docker_manager.py`	Challenge lifecycle orchestration
`EventLogger`	`logger.py`	Event persistence to DB + memory cache
`FlagManager`	`flag_manager.py`	Dynamic flag lifecycle
`ForensicsManager`	`forensics.py`	Container log capture/retrieval
`MonitoringManager`	`monitoring.py`	Resource metrics collection
`DistributedLockManager`	`distributed_lock.py`	Redis or local asyncio locks
`DockerClient`	`docker_client.py`	Docker SDK wrapper
`TraefikRedisProvider`	`traefik_redis.py`	Redis KV router management
`DiscordWebhookNotifier`	`discord_webhook.py`	Lifecycle event notifications

Design Implication: Most runtime state (active instances, challenge configs, port maps) is process-local and memory-backed. Horizontal scaling requires Redis locks and external persistence (PostgreSQL). On restart, in-memory state is rebuilt via cleanup_stale_instances_on_startup().

Singleton Access Pattern

Each manager exposes a lazy singleton via a get_*() function:

# Example pattern
_flag_manager: Optional[FlagManager] = None

def get_flag_manager() -> FlagManager:
    global _flag_manager
    if _flag_manager is None:
        _flag_manager = FlagManager(...)
    return _flag_manager

For managers requiring async initialization, init_*() and close_*() functions are called during the FastAPI lifespan.

Startup & Shutdown Sequence

Startup (in order)

1. init_database()                    # Async SQLAlchemy engine + table creation
2. init_lock_manager(REDIS_URL)       # Redis or local lock backend
3. init_event_logger()                # Determine max existing log ID
4. port_manager.initialize()          # Load persisted port mappings from DB
5. init_auth()                        # CTFd client initialization
6. docker_manager.load_challenges()   # Parse all instance.toml files
7. docker_manager.load_challenge_settings()  # Active/inactive + resource overrides
8. _load_settings_from_db()           # Apply persisted setting overrides (30+ keys)
9. init_traefik_provider()            # Bootstrap permanent Redis KV keys
10. docker_manager.cleanup_stale_instances_on_startup()  # Orphan cleanup
11. docker_manager.start_cleanup_task()    # Background 60s cleanup loop
12. init_team_mode()                   # Auto-detect or use configured team mode
13. Start _auto_check_submissions()    # Background 60s submission checker
14. Log SYSTEM_START event

Shutdown (in order)

1. Log SYSTEM_STOP event
2. Cancel submission checker background task
3. Cancel cleanup background task
4. close_traefik_provider()
5. close_lock_manager()
6. close_database()

Core Subsystems

1. Authentication & Identity (`auth.py`)

Two modes:

CTFd Mode (AUTH_MODE=ctfd):

Validates Authorization: Bearer <token> against CTFd /api/v1/users/me
Enriches with team metadata via CTFd team endpoints
Multi-strategy team member resolution (members endpoint, team details, /teams/me, user list filter)
Returns UserInfo with user_id, username, team_id, team_name

No-Auth Mode (AUTH_MODE=none):

Identifies users by IP from X-Forwarded-For or X-Real-IP headers
Creates pseudo-user with ID prefix user_
Falls back to "anonymous" if no IP available

Team Mode:

TEAM_MODE=auto: queries CTFd /api/v1/configs/user_mode to detect
TEAM_MODE=enabled: forces team mode, requires team membership
TEAM_MODE=disabled: forces user mode
When enabled, users without a CTFd team are refused (403)

2. Docker Client (`docker_client.py`)

Async wrapper over the docker-py SDK. All blocking operations run via loop.run_in_executor().

Network Management:

create_isolated_network() — bridge driver with ICC/internal options, whaley labels
remove_network(), list_whaley_networks()

Image Management:

build_image() — with build args and cache control

Container Management:

run_container(), stop_container(), remove_container()
get_container_logs(), get_container_stats() — full resource snapshot

Compose Operations (subprocess bridge):

compose_up() / compose_down() — shells out to docker compose
list_compose_projects() / remove_compose_project() — force-cleans per-spawn resources
list_containers_by_project() — finds containers by compose project label

Resource Cleanup:

cleanup_whaley_resources() — orphan sweep for networks and per-spawn images
Uses a safety window to avoid deleting newly-created resources
Maintains a reserved network list for infrastructure-level networks

3. Docker Manager (`docker_manager.py`)

The core orchestrator. This is the largest module (~1400 lines).

Challenge Loading:

Parses instance.toml from each subdirectory of CHALLENGES_DIR
Supports .yaml and .yml compose files
_lint_compose_file() detects pinned subnets and IPs (non-fatal warnings)
Loads active/inactive status and resource overrides from ChallengeSettings DB table

Spawn Critical Section:

Validate challenge exists and is active
Enter global spawn semaphore (max 10 concurrent spawns)
Acquire distributed lock on spawn:{owner_id} key
Enforce MAX_INSTANCES_PER_USER / MAX_INSTANCES_PER_TEAM
Prevent duplicate running instance for same owner+challenge
Generate instance_id = {challenge_id}-{owner_id[:8]}-{uuid_hex[:8]}
Allocate ports via PortManager (with up to 3 retries on conflict)
Create dynamic flag if enabled
Set up connection hints (Traefik FQDN or direct host:port)
Copy challenge to temp directory, inject flags, enforce resource limits
Run docker compose up via SDK
Register Traefik route via Redis provider
On failure: best-effort cleanup (compose down, route delete, port release, network removal)

Stop Flow:

Verify ownership (user or team member)
Optionally auto-capture forensics logs
docker compose down
Delete Traefik route
Remove isolated network
Clean per-spawn images/volumes
Release ports
Log event, send Discord notification

Extend Policy:

Extension step comes from instance.toml (extend_time, default 1800s)
Allowed only after half of timeout has elapsed
Total added extension capped at timeout (max extra = timeout)

Background Cleanup (every 60s):

Stops expired RUNNING instances
Every ~10 minutes: sweeps orphan networks/images
Every ~60 minutes: cleans old forensics logs

4. Port Manager (`port_manager.py`)

Deterministic, lane-based port allocation with database persistence.

Lane-Based Algorithm:

Port range divided into N lanes (up to 32)
Blake2b hash of {user_id}:{challenge_id}:{internal_port} maps to a primary lane
Each lane maintains a cursor for next-available scanning
Falls back to other lanes in deterministic order if primary lane is full
Before use, verifies port is actually free via socket.bind() (scavenger check)

Persistence:

Port assignments stored in user_port_mappings DB table
On respawn, same user+challenge gets the same ports (when available)
Allocation acquires distributed locks: per-user+challenge + global allocator

Retry Logic:

Spawn retries up to 3 times on port conflicts
Each retry re-allocates fresh ports

5. Traefik Redis Provider (`traefik_redis.py`)

Writes dynamic router/service configuration into Redis KV for Traefik to consume.

Permanent Keys (bootstrapped at startup):

Default TCP catch-all route (block-all, priority 1)
Optional CTFd redirect middleware
Optional dashboard basic auth users
Additional keys from YAML file or JSON string

Per-Instance Routes:

HTTP: Router with Host({fqdn}) rule, TLS, load balancer targeting {backend_host}:{backend_port}
TCP: SNI router with HostSNI({fqdn}), TLS passthrough, load balancer address
Custom (SSH, etc.): Dedicated entrypoint, optional TLS

Cleanup:

delete_instance_route() scans Redis with pattern matching to remove all keys for an instance
Uses Redis pipeline for atomic batch writes

6. Distributed Lock (`distributed_lock.py`)

Redis-based distributed lock with graceful fallback to local asyncio.Lock.

Redis mode: uses redis.asyncio with SET NX semantics
Local mode: transparent asyncio.Lock per name
acquire() is an async context manager with configurable timeout
acquire_multiple() acquires locks in sorted order to prevent deadlocks
is_distributed property signals whether Redis is active

7. Flag Manager (`flag_manager.py`)

Complete dynamic flag lifecycle management. All data is stored in the database (FlagMappingModel, SuspiciousSubmissionModel, and WhaleySettings models); the legacy logs/flag_mappings.json file has been removed.

Full technical details (extraction algorithm, injection regex, ownership semantics, detection sequence diagrams) are in DYNAMIC-FLAGS.md.

Flag Format: {FLAG_PREFIX}{base_content}_{16 hex chars} (default: FLAG{...})

base_content is the inner text extracted from the first PREFIX{...} placeholder found in challenge files
If no placeholder exists, falls back to fully random: PREFIX{<32 hex chars>}

Spawn-Time Flow:

_extract_base_flag_content() — scans for PREFIX{...}, extracts the inner text (priority: flag files > config > source files)
generate_flag(base_content) → FLAG{base_content_<16hex>}
create_flag_for_owner() — creates flag in CTFd via API, inserts into flag_mappings DB table, updates in-memory indexes (user_flags, owner_flags, flag_lookup)
_inject_flag_into_files() — replaces every PREFIX{...} occurrence with the dynamic flag (regex PREFIX{[^}\n]+})
Per-challenge override: If disable_dynamic_flags = true is set in instance.toml, the spawn skips flag creation entirely. Existing CTFd challenge mappings are pruned on load/reload. The admin Flags panel prevents mapping the challenge while this is set.

Flag Reuse: Same owner+challenge always gets the same flag (looked up from in-memory index before creating a new one).

Suspicious Submission Detection:

Incremental mode (default): Tracks last_submission_id in whaley_settings. Only checks CTFd submissions with id > last_submission_id, avoiding repeated re-scanning of the same data
Full scan mode: POST /admin/api/flags/check-submissions?full_scan=true re-checks all recent submissions regardless of checkpoint
Fetches newest 5 pages (up to 250 submissions) from CTFd, cross-references provided flag against flag_lookup
Ownership comparison: user_id in user mode, team_id in team mode
Deduplication via SHA-256 unique key: hash(submitter_identity|owner_identity|flag_hash) stored in DB with unique index
New suspicious entries inserted into suspicious_submissions table immediately; admin API supports paginated queries

Startup: initialize() loads all indexes from DB (flag mappings, challenge mapping, suspicious unique keys, last submission checkpoint) into memory for fast lookups. Lazy — called on first await get_flag_manager().

8. Instance Forensics (`forensics.py`)

Captures Docker container logs for debugging and post-mortem analysis.

Capture Modes:

Auto Capture: on instance termination (configurable via FORENSICS_AUTO_CAPTURE)
Live Capture: on-demand from running instances

Capture Process:

Get container IDs via docker compose ps -q
For each container: docker inspect for name, docker logs --tail --timestamps for content
Enforce size limit (per capture) and tail line limit (per container)
Write header with metadata followed by per-container sections
Plain text or gzip-compressed output

Index: JSON-based index.json in forensics log directory. Thread-safe via Lock.

Retention: Auto-cleanup of logs older than FORENSICS_RETENTION_HOURS (default 168h / 7 days).

Concurrency: Semaphore-limited (max 5 concurrent captures).

9. Resource Monitoring (`monitoring.py`)

Real-time Docker container resource metrics.

Per-Container: CPU%, memory usage/limit/%, network RX/TX, block read/write, PIDs.

Per-Instance: Aggregated totals across all containers in a compose project.

System: Total/running containers, aggregate CPU/memory, host CPU cores, host memory (from /proc/meminfo on Linux).

CPU Calculation: Standard Docker formula — (cpu_delta / system_cpu_delta) * online_cpus * 100.

10. Event Logger (`logger.py`)

Structured event logging to SQLite/PostgreSQL with memory cache.

Event Types: INSTANCE_SPAWN, INSTANCE_SPAWN_FAILED, INSTANCE_STOP, INSTANCE_EXTEND, INSTANCE_EXPIRED, USER_LOGIN, USER_LOGIN_FAILED, AUTH_FAILURE, FLAG_CREATED, FLAG_DELETED, SUSPICIOUS_SUBMISSION, SYSTEM_START, SYSTEM_STOP.

Architecture:

Async log() writes to DB via background task, also stores in memory cache (max 1000 entries)
get_entries() queries DB with pagination and filtering
get_stats() returns totals, counts by type, unique users, 24h activity

11. Discord Webhook (`discord_webhook.py`)

Sends rich Discord embeds for lifecycle events:

Spawn: green embed with instance details, routing info, connection hints
Spawn Failure: red embed with challenge, requester, failure reason
Extend: yellow embed with extension amount, new expiry
Stop: orange embed with stop reason (user/admin/expired)

Disabled when DISCORD_WEBHOOK_URL is empty.

12. Runtime Metrics (`main.py` inline)

High-frequency operational metrics tracked in memory for Prometheus export:

Spawn latency histogram: 9 buckets (0.25s to 60s), inflight count
Operation counters: tuples of (operation, outcome, reason_label, challenge_id)
Failure reason classifiers: _classify_spawn_failure_reason() and _classify_operation_failure_reason() bucket failures into low-cardinality labels

Request Flows

Spawn Instance (`POST /instances/spawn`)

Client Request
  → User rate limit check (sliding window, 10/min)
  → Challenge active check
  → get_current_user() dependency
  → DockerManager.spawn_instance()
    → Determine owner_id (team_id or user_id)
    → Spawn semaphore acquire (max 10 concurrent)
    → Distributed lock acquire (spawn:{owner_id})
    → Instance limit check
    → Duplicate running check
    → _do_spawn_instance()
      → Generate instance_id
      → PortManager.allocate_ports_for_user() (with 3 retries)
      → FlagManager.create_flag_for_owner() (if enabled)
      → Build connection hints
      → Compose labels injection (whaley.managed, whaley.owner, whaley.created_at)
      → Resource limits enforcement (memory, CPU, PIDs)
      → Flag injection into challenge files
      → Isolated network creation (if enabled)
      → docker compose up -d --build
      → TraefikRedisProvider.register_instance_route()
      → Store instance in memory
      → Release lock
    → Release semaphore
  → Log event, Discord notification, metrics recording
  → Return PublicSpawnResponse

Stop Instance (`DELETE /instances/{instance_id}`)

Client Request
  → User rate limit check
  → get_current_user() dependency
  → Find instance in memory
  → Ownership check (user or team member)
  → Forensics auto-capture (if enabled)
  → docker compose down
  → TraefikRedisProvider.delete_instance_route()
  → Remove isolated network (if created)
  → DockerClient.remove_compose_project() (clean per-spawn images/volumes)
  → PortManager.release_instance_ports()
  → Remove from memory
  → Log event, Discord notification

Extend Instance (`POST /instances/{instance_id}/extend`)

Client Request
  → User rate limit check
  → get_current_user() dependency
  → DockerManager.extend_instance()
    → Find instance, ownership check
    → Validate: RUNNING status
    → Validate: extend_time configured
    → Validate: half-timeout elapsed since creation
    → Validate: total extension ≤ timeout cap
    → Update expires_at in memory
  → Log event, Discord notification

Resource Cleanup Lifecycle

Whaley has a multi-layered cleanup strategy to prevent resource leaks (orphan containers, networks, images, volumes). Cleanup happens at four levels:

Level 1: Per-Instance Stop Cleanup (Synchronous)

When an instance is stopped (user action, admin force-stop, or expiry), the stop_instance() flow tears down everything tied to that instance:

Forensics auto-capture — if enabled, container logs are dumped before teardown
docker compose down — stops and removes containers for the project
Traefik route deletion — pattern-scans Redis for all http/routers/{id}*, http/services/{id}*, tcp/routers/{id}*, tcp/services/{id}* keys and deletes them
Isolated network removal — removes the per-instance bridge network (if network isolation was enabled)
remove_compose_project() — force-removes any remaining resources by label:
- Containers matching com.docker.compose.project={project_name}
- Networks matching the same project label (with force-disconnect fallback)
- Volumes matching the same project label (optional, default true)
- Per-spawn images — removes images whose tag starts with {project_name}- (e.g., web-challenge-abc123-def456-web:latest). Base/pulled images like nginx:alpine or python:3.11 are never matched since they don't start with a project-name prefix.
Port release — returns allocated ports to the available pool
In-memory state removal — instance deleted from docker_manager.instances dict

Level 2: Background Expiry Loop (Every 60s)

DockerManager.start_cleanup_task() runs a continuous asyncio loop:

Every 60 seconds:
  → Scan in-memory instances for expired RUNNING instances
  → Call stop_instance() for each (triggers Level 1 cleanup)
  → Log INSTANCE_EXPIRED event
  → Send Discord notification

Every ~10 minutes (10 iterations):
  → Orphan sweep via DockerClient.cleanup_whaley_resources()
    → Passes active project names as a safety set
    → 5-minute safety window (never removes resources younger than 300s)

Every ~60 minutes (60 iterations):
  → Forensics retention cleanup (delete logs older than FORENSICS_RETENTION_HOURS)
  → Reset iteration counter

Level 3: Orphan Sweep (`cleanup_whaley_resources()`)

Catches resources that survived a failed teardown (crash, partial cleanup, labels stripped):

Reserved Networks (never touched): bridge, host, none, ctf-instances, mctf-monitoring_default, whaley-redis, whaley_default

Orphan Network Detection:

Scan all Docker networks
Skip reserved names and networks with attached containers
For each network, attempt to derive a project name:
- whaley-{project} → named isolation network → extract {project}
- {project}_default → compose-managed default network → extract {project}
If the project is NOT in the active set → candidate for removal
Safety window: skip networks younger than older_than_seconds (5 min during normal operation, 0 at startup)

Orphan Per-Spawn Image Detection:

Scan all Docker images
Skip base images (contain / in the name part, e.g., docker.io/library/nginx)
Match tags against the Whaley project pattern: {challenge-id}-{num}-{6+hex}-{service}:{tag}
Extract the project name from the match
If the project is NOT in the active set → candidate for removal
Safety window: same older_than_seconds check via image Created timestamp

Level 4: Startup Stale Cleanup

On process restart, cleanup_stale_instances_on_startup() runs before any new instances are spawned:

Discover stale projects via three sources:
- list_compose_projects() — finds all compose projects by com.docker.compose.project container label
- list_whaley_networks() — finds networks with whaley.managed=true label, extracts project names from whaley-{project} naming pattern
- Static name heuristic — _is_managed_instance_project_name() pattern match
Identify managed projects — a project is considered Whaley-managed if it has:
- whaley.managed=true label, OR
- whaley.instance_id label matching project name, OR
- A matching whaley-prefixed network, OR
- A name matching the instance ID pattern ({challenge}-{owner}-{hex})
Tear down each stale project via remove_compose_project() — removes containers, networks, volumes, per-spawn images
Delete Traefik routes for each project name
Release ports for each project name
Final orphan sweep — calls cleanup_whaley_resources(active_project_names=set(), older_than_seconds=0) with empty active set and zero safety window (nothing is running yet, so no risk of race conditions)

Cleanup Summary Diagram

Instance Stop (user/admin/expiry)
  │
  ├─ docker compose down          ← containers gone
  ├─ Traefik route delete         ← Redis keys purged
  ├─ Isolated network remove      ← bridge net gone
  ├─ remove_compose_project()     ← residual containers, networks, volumes, per-spawn images
  └─ Port release                 ← ports back in pool

Background Loop (60s tick)
  │
  ├─ Expired instances → stop (triggers above)
  ├─ Every 10 min → orphan sweep
  │     Networks: whaley-* and *_default without active project → remove
  │     Images:   {project}-{service}:tag without active project → remove
  │     5-min safety window protects in-flight spawns
  └─ Every 60 min → forensics retention cleanup

Startup Stale Cleanup (before any spawn)
  │
  ├─ Discover via compose labels + network labels + name heuristic
  ├─ Tear down each stale project
  ├─ Clear Traefik routes + release ports
  └─ Final orphan sweep (zero safety window, no race risk)

Data & Persistence

In-Memory (process-local, volatile)

Data	Location
Active instances	`docker_manager.instances` dict
Loaded challenge configs	`docker_manager.challenges` dict
Allocated ports	`port_manager.allocated_ports` / `instance_ports`
Rate limit tracking	`_admin_rate_limit`, `_user_rate_limit` dicts in main.py
Runtime metrics	`_spawn_latency_*`, `_runtime_operation_counters`
Flag index	`FlagManager.flag_lookup`, `user_flags`, `owner_flags`
Auth mode cache	`_team_mode_enabled`, `_ctfd_mode_cache`

Database (SQLAlchemy — PostgreSQL default, SQLite fallback)

Table	Purpose
`user_port_mappings`	Persistent port allocation per user+challenge
`event_logs`	Full event audit trail
`challenge_settings`	Active/inactive toggles, per-challenge resource overrides
`whaley_settings`	Global runtime setting overrides (challenge_mapping, last_submission_id)
`instance_states`	Schema exists (for future instance recovery)
`flag_mappings`	Dynamic flag assignments per owner+challenge
`suspicious_submissions`	Detected flag-sharing incidents

File-Persisted

Path	Content
`logs/forensics/index.json`	Forensics log metadata index
`logs/forensics/*.log[.gz]`	Captured container logs
`logs/events.jsonl`	JSON-lines event log output

Note: All dynamic flag data (mappings, suspicious submissions, challenge mapping, last submission checkpoint) is stored exclusively in the database (flag_mappings, suspicious_submissions, and whaley_settings tables). The legacy logs/flag_mappings.json file has been removed.

External State

Docker daemon: containers, networks, images, compose projects (via SDK + CLI)
CTFd: users, teams, dynamic flags, submissions (via REST API)
Redis: lock keys, Traefik KV router/service keys

Frontend Architecture

Build Pipeline

The frontend is a React 18 + TypeScript + Vite application split into two SPAs:

frontend/src/
├── main.tsx → UserApp (mounts on #root)
└── admin.tsx → AdminApp (mounts on #admin-root)

Build output: frontend/ builds into app/static/ via Vite (configured in vite.config.js):

index.html → user SPA entry
admin.html → admin SPA entry
assets/ → hashed JS/CSS/font bundles

Docker build: Multi-stage — first stage runs npm ci && npm run build in Node 20 Alpine, second stage copies built assets into the Python image.

User Application (`UserApp.tsx`)

CTFd mode: Shows token login panel, stores token in sessionStorage
No-auth mode: Auto-authenticates
Challenge Deck: Lists active challenges with category badges, deploy buttons
Active Instances: Lifecycle cards with endpoint copy, connection hints, countdown timers, extend/stop buttons
Auto-refresh: Instances and health polled every 10s, clock ticks every 1s

Admin Application (`AdminApp.tsx`)

Six tabs, URL hash-based navigation:

Tab	Component	Features
Dashboard	`DashboardPage`	Stats cards, active instances list, force-stop
Logs	`LogsPage`	Event log viewer (filtered/paginated), port mappings, forensics (stats, toggle, live capture, log viewer)
Flags	`FlagsPage`	Flag mappings, suspicious submissions, CTFd sync wizard
Challenges	`ChallengesPage`	Upload zip, file browser/editor, active toggle, reload config
Monitoring	`MonitoringPage`	System metrics, per-instance container CPU/RAM, high-usage filter
Settings	`SettingsPage`	30+ editable settings with type-aware inputs, sectioned layout, change tracking

Shared UI Components

All UI components live in frontend/src/shared/components/ui/:

Badge — color-coded status tags (neutral/success/warning/danger/info)
Button — variants (primary/secondary/danger/ghost), sizes (sm/md)
Card — Dark gradient container with border
EmptyState — Placeholder for empty/no-data states
Input, Select, Textarea — Form inputs with consistent dark styling
Loader — Spinning border animation
Modal — Overlay dialog with backdrop blur
Tabs — Horizontal tab navigation

Design System

Colors: graphite #0D0D0D, anthracite #262626, steel #737373, deepviolet #200F38, mist #F2F2F2
Fonts: Chakra Petch (display), Public Sans (body), IBM Plex Mono (monospace)
Styling: Tailwind CSS with custom theme, dark background with purple accent
Background: Radial gradient with CSS grid overlay effect

Build & Deployment Pipeline

Docker Multi-Stage Build

# Stage 1: Node 20 Alpine — builds frontend
FROM node:20-alpine AS frontend-builder
COPY frontend/ ./
RUN npm ci --include=dev
RUN npm install -g vite
RUN npm run build

# Stage 2: Python 3.11 Slim — runs backend
FROM python:3.11-slim
# Install Docker CLI + compose plugin
# Copy Python deps + app code
# Overlay compiled SPA assets from Stage 1
COPY --from=frontend-builder /build/app/static/index.html ./app/static/
COPY --from=frontend-builder /build/app/static/admin.html ./app/static/
COPY --from=frontend-builder /build/app/static/assets ./app/static/
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose Deployment

Services:
  redis        — Redis 7 Alpine, append-only persistence, healthcheck
  postgres     — PostgreSQL 16 Alpine, persistent volume, healthcheck
  instancer    — Whaley app, depends on healthy redis + postgres

Networks:
  ctf-instances — bridge network for inter-service communication

Volumes:
  redis_data    — Redis AOF persistence
  postgres_data — PostgreSQL data persistence
  /var/run/docker.sock — Mounted for container management
  ./challenges  — Challenge definitions (rw)
  ./logs        — Event logs, forensics
  ./data        — Forensics, event logs, misc data

Security Guardrails

Implemented Controls

Control	Mechanism
Admin authentication	`X-Admin-Key` header, per-IP rate limiting (default 150/min)
User rate limiting	Sliding window, 10 requests/min for spawn/stop/extend
Metrics protection	Bearer token via `METRICS_SECRET`, constant-time comparison
Path traversal prevention	Symlink resolution + containment check for all file operations
Zip upload protection	Max size (50MB), max entries (1000), max extracted (200MB), zip-slip validation
Security headers	CSP, X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Referrer-Policy
Network isolation	Per-instance bridge network with optional ICC disabled
Resource caps	Memory, CPU, PID limits enforced on all containers
Fork bomb protection	`CONTAINER_PIDS_LIMIT` (default 256)
Ownership enforcement	Instance access checked against user identity + team membership
Trusted proxy support	CIDR-aware IP extraction for admin endpoints
String escaping	Prometheus label values escaped per exposition format spec

Known Considerations

CORS: Allows all origins (allow_credentials=false)
Admin key storage: Browser localStorage (admin UI)
No-auth IP trust: Uses forwarded headers directly (not trusted-proxy filtered)
Monitoring host checks: Linux-specific (/proc/meminfo, nproc)

Prometheus Metrics

Exposed at GET /metrics (protected by METRICS_SECRET Bearer token).

Metric Families (30+)

Category	Metrics
Instances	Total by status, owner, team, challenge; owner saturation ratios
Lifecycle	Per-instance expiry timestamps, age seconds, connection info
Alerts	Instances expiring within 5/10 min, stale STARTING instances (>2 min)
Ports	Allocated count, available count, utilization percent
Challenges	Loaded count, active count
Flags	Total dynamic flags assigned, suspicious submission count
Forensics	Auto-capture on/off, total log count, auto/live breakdown, total size
Runtime Ops	Operation outcome counters (spawn/stop/extend with success/failure + reason)
Spawn Latency	Histogram (9 buckets: 0.25s–60s), inflight count
Event Logs	Total entries, counts by event type, unique users, 24h activity

Label Conventions

outcome: success | failure
operation: spawn | stop | extend
challenge_id: normalized, unknown if not applicable
reason: normalized failure reason (underscore-separated, max 64 chars)

Admin Surface Map

Authentication: `verify_admin_key()` dependency

Checks X-Admin-Key header + per-IP rate limit.

Endpoints by Domain

Dashboard & Logs:

GET  /{admin_path}                          — Admin SPA
GET  /admin/api/stats                       — Event logger stats + instances + ports
GET  /admin/api/logs                        — Paginated event logs (filtered)
GET  /admin/api/instances                   — All active instances
DELETE /admin/api/instances/{id}             — Force-stop (admin override)

Port Management:

GET    /admin/api/user-ports                — All user port mappings
GET    /admin/api/port-stats                — Port usage statistics
DELETE /admin/api/user-ports                — Clear all mappings
DELETE /admin/api/user-ports/{user_id}       — Delete user's ports

Dynamic Flags:

GET    /admin/api/flags                     — Full flags state
POST   /admin/api/flags/check-submissions   — Manual submission check
GET    /admin/api/flags/suspicious          — List suspicious entries
DELETE /admin/api/flags/suspicious          — Clear suspicious list
GET    /admin/api/flags/mappings            — All flag mappings
DELETE /admin/api/flags/user/{user_id}       — Delete user's flags
DELETE /admin/api/flags/{flag_id}            — Delete single flag
POST   /admin/api/flags/sync-challenge       — Map local → CTFd challenge
DELETE /admin/api/flags/mapping/{id}         — Remove mapping
GET    /admin/api/ctfd/challenges            — Fetch CTFd challenges (sync wizard)

Forensics:

GET    /admin/api/forensics/stats           — Forensics statistics
POST   /admin/api/forensics/toggle          — Enable/disable auto-capture
GET    /admin/api/forensics/logs            — List logs (filtered)
GET    /admin/api/forensics/logs/{id}        — Get log content
DELETE /admin/api/forensics/logs/{id}        — Delete specific log
DELETE /admin/api/forensics/logs            — Clear all logs
POST   /admin/api/forensics/live-capture/{id} — On-demand capture
POST   /admin/api/forensics/cleanup          — Manual retention cleanup

Monitoring:

GET /admin/api/monitoring/system            — System-level metrics
GET /admin/api/monitoring/instances         — Per-instance container metrics

Challenge Management:

GET    /admin/api/challenges/list            — List all challenges
POST   /admin/api/challenges/upload          — Zip upload
DELETE /admin/api/challenges/{id}            — Delete challenge directory
GET    /admin/api/challenges/{id}/files      — Browse file tree
GET    /admin/api/challenges/{id}/files/{path} — Read file
PUT    /admin/api/challenges/{id}/files/{path} — Write file
POST   /admin/api/challenges/{id}/files/{path} — Create file
DELETE /admin/api/challenges/{id}/files/{path} — Delete file/directory
POST   /admin/api/challenges/{id}/reload     — Reload config
POST   /admin/api/challenges/{id}/toggle     — Toggle active/inactive
GET    /admin/api/challenges/settings         — All challenge settings
PUT    /admin/api/challenges/{id}/resources   — Set resource overrides

Runtime Settings:

GET    /admin/api/settings                   — Current values with override status
PUT    /admin/api/settings                   — Update settings (validated, persisted, applied)
DELETE /admin/api/settings/{key}             — Reset to default
POST   /admin/api/settings/load              — Reload all from DB

Troubleshooting Playbook

Spawn fails immediately

Check challenge has valid instance.toml and compose file
Verify port range capacity (/admin/api/port-stats)
Check Docker daemon availability and compose output in error message
Verify resource caps aren't overly restrictive
Check Redis connectivity (lock acquisition failures)

User cannot spawn despite low load

Check owner already has same challenge running (duplicate block)
Verify MAX_INSTANCES_PER_USER / MAX_INSTANCES_PER_TEAM limits
Check challenge active flag (admin toggle)
Verify user isn't rate-limited (10 req/min window)
In team mode: verify user belongs to a CTFd team

Team mode behavior seems wrong

Check TEAM_MODE setting (enabled/disabled/auto)
Verify CTFd user_mode detection and API key permissions
Check whether user has team_id from CTFd token validation
Verify team member resolution via admin flags/logs

Dynamic flags not generated

DYNAMIC_FLAGS_ENABLED=true
AUTH_MODE=ctfd (required for dynamic flags)
CTFD_URL and CTFD_API_KEY valid
Challenge mapping exists in admin flags panel
FLAG_PREFIX matches placeholder convention in challenge files

Forensics missing

Auto-capture enabled (check toggle in admin forensics tab)
Capture size/tail limits not too small
Disk permissions for FORENSICS_LOG_DIR
Instance must be running for live capture

Port conflicts or exhaustion

Check port range is large enough for expected concurrent instances
Verify no external processes using ports in the allocation range
Check user_port_mappings table for stale entries
Port allocation retries up to 3 times automatically

Extension Points

Safe Extension Points

New admin APIs: Add routes in main.py with verify_admin_key dependency
New challenge metadata: Extend ChallengeConfig in docker_manager.py
New event types: Add to EventType enum in logger.py, log in appropriate handlers
New monitoring metrics: Extend MonitoringManager data model
New settings: Add to EDITABLE_SETTINGS dict and _load_settings_from_db() type map in main.py
New frontend pages: Add component in frontend/src/admin/pages/, wire into AdminApp.tsx tabs

High-Risk Modification Zones

Spawn/stop critical sections in docker_manager.py — lock ordering and cleanup guarantees
Lock semantics in distributed_lock.py — deadlock prevention, timeout behavior
Port allocation persistence in port_manager.py — conflict safety guarantees
Dynamic flag ownership logic in flag_manager.py — user vs team comparison paths
Traefik route registration in traefik_redis.py — key naming and cleanup completeness

Known Gaps

Instance recovery: Active instance state is held in memory; on restart, the in-memory map is lost. The instance_states ORM table exists but is not wired into active recovery flows.
Auth IP trust asymmetry: Admin IP extraction is trusted-proxy-aware; no-auth user identity uses forwarded headers directly.
Platform assumptions: Monitoring host metrics rely on Linux-specific interfaces (/proc/meminfo, nproc).
Compose orchestration: While many Docker operations use the SDK, compose lifecycle still shells out to docker compose, requiring Docker CLI availability.
Flag manager logging: Some FlagManager logging calls are invoked synchronously (missing await), which may result in log entries being silently dropped.

Uh oh!

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History