This document is a comprehensive architecture reference for Whaley — the dedicated Docker instancer for CTF competitions. It covers every module, data flow, and design decision relevant to maintainers and contributors.
- Project Overview
- Repository Structure
- Runtime Architecture
- Startup & Shutdown Sequence
- Core Subsystems
- Request Flows
- Data & Persistence
- Frontend Architecture
- Build & Deployment Pipeline
- Security Guardrails
- Prometheus Metrics
- Admin Surface Map
- Troubleshooting Playbook
- Extension Points
Whaley provides isolated Docker challenge instances to CTF participants. Each user (or team, in team mode) gets their own set of containers, reachable via dynamically-generated Traefik routes. The system handles the full lifecycle: spawn, extend, stop, and automatic expiry.
| Capability | Description |
|---|---|
| Instance Lifecycle | Spawn / stop / extend with ownership and rate-limit enforcement |
| Dynamic Routing | Per-instance Traefik HTTP/TCP routers written to Redis KV at runtime |
| Port Allocation | Deterministic lane-based port assignment with database persistence |
| Authentication | CTFd Bearer-token validation or IP-based no-auth mode |
| Team Mode | Shared instances and flags per team, with team ownership semantics |
| Dynamic Flags | Per-owner unique flags injected into challenge files at spawn time |
| Suspicious Submission Detection | Cross-references CTFd submissions against flag ownership |
| Instance Forensics | Auto-capture or live-capture container logs on termination |
| Resource Monitoring | Per-container CPU, memory, network, block I/O, and PID metrics |
| Discord Notifications | Rich embed notifications for spawn/stop/extend/failure events |
| Admin Management | Web UI for challenges, settings, logs, flags, forensics, monitoring |
| Prometheus Metrics | 30+ metric families for SLO tracking and operational observability |
whaley/
├── app/ # FastAPI backend
│ ├── __init__.py
│ ├── main.py # FastAPI app, all route handlers, lifespan, metrics
│ ├── config.py # Pydantic Settings (50+ env vars with defaults)
│ ├── models.py # Pydantic request/response models
│ ├── auth.py # CTFd token validation, team mode, no-auth IP identity
│ ├── docker_manager.py # Challenge loading, spawn/stop/extend lifecycle
│ ├── docker_client.py # Docker SDK wrapper (async over docker-py)
│ ├── port_manager.py # Lane-based port allocation with DB persistence
│ ├── traefik_redis.py # Traefik Redis KV dynamic router/service keys
│ ├── distributed_lock.py # Redis distributed lock with asyncio local fallback
│ ├── flag_manager.py # Dynamic flag generation, CTFd registration, submission check
│ ├── forensics.py # Container log capture, indexing, retention
│ ├── monitoring.py # Container/system resource metrics via Docker stats
│ ├── logger.py # Event logging to SQLite/PostgreSQL + memory cache
│ ├── discord_webhook.py # Discord rich-embed notifications for lifecycle events
│ ├── database/
│ │ ├── __init__.py
│ │ ├── connection.py # Async SQLAlchemy engine & session factory
│ │ └── models.py # ORM models (7 tables)
│ └── static/ # Built frontend assets (output of Vite build)
│ ├── index.html # User-facing React SPA
│ ├── admin.html # Admin panel React SPA
│ ├── assets/ # Hashed JS/CSS/font bundles
│ ├── icon.png # Favicon
│ ├── app.js # Legacy user dashboard (no longer used)
│ └── style.css # Legacy stylesheet (no longer used)
├── frontend/ # React + TypeScript + Vite source
│ ├── package.json
│ ├── vite.config.js # Multi-page build (main + admin), output to ../app/static
│ ├── tailwind.config.js
│ ├── tsconfig.json
│ ├── index.html # User SPA entry point
│ ├── admin.html # Admin SPA entry point
│ └── src/
│ ├── main.tsx # User app bootstrap
│ ├── admin.tsx # Admin app bootstrap
│ ├── admin/
│ │ ├── AdminApp.tsx # Admin auth + 6-tab navigation
│ │ ├── pages/
│ │ │ ├── DashboardPage.tsx
│ │ │ ├── ChallengesPage.tsx
│ │ │ ├── FlagsPage.tsx
│ │ │ ├── LogsPage.tsx
│ │ │ ├── MonitoringPage.tsx
│ │ │ └── SettingsPage.tsx
│ │ └── types/
│ ├── user/
│ │ ├── UserApp.tsx # User challenge spawning UI
│ │ └── types.ts
│ └── shared/
│ ├── api/ # HTTP client + admin/user API functions
│ ├── components/ # UI primitives (Badge, Button, Card, Modal, etc.)
│ ├── hooks/ # useConfirm, useToast
│ ├── types/ # Shared TypeScript types
│ └── utils/ # format, time, download helpers
├── challenges/ # Challenge definitions (instance.toml + compose files)
├── data/ # Persistent data directory
├── logs/
│ ├── forensics/ # Captured container logs (plain or gzipped)
│ └── events.jsonl # Event log output
├── docs/
│ ├── ARCHITECTURE.md # This file
│ ├── DOCUMENTATION.md # User/operator documentation
│ ├── DYNAMIC-FLAGS.md # Deep-dive on dynamic flags subsystem
│ └── IDENTITY.md # MCTF 5.0 brand identity notes
├── images/ # Screenshots
├── reports/ # Security audit reports
├── Dockerfile # Multi-stage: Node frontend + Python backend
├── docker-compose.yaml # Production deployment (instancer + Redis)
├── requirements.txt # Python dependencies
├── .env.example # Full configuration reference
├── .env.prod # Production configuration (sensitive)
└── LICENSE
All core managers are module-level singletons, instantiated at import or during lifespan startup:
| Manager | Module | Role |
|---|---|---|
PortManager |
port_manager.py |
Port allocation/release, lane distribution |
DockerManager |
docker_manager.py |
Challenge lifecycle orchestration |
EventLogger |
logger.py |
Event persistence to DB + memory cache |
FlagManager |
flag_manager.py |
Dynamic flag lifecycle |
ForensicsManager |
forensics.py |
Container log capture/retrieval |
MonitoringManager |
monitoring.py |
Resource metrics collection |
DistributedLockManager |
distributed_lock.py |
Redis or local asyncio locks |
DockerClient |
docker_client.py |
Docker SDK wrapper |
TraefikRedisProvider |
traefik_redis.py |
Redis KV router management |
DiscordWebhookNotifier |
discord_webhook.py |
Lifecycle event notifications |
Design Implication: Most runtime state (active instances, challenge configs, port maps) is process-local and memory-backed. Horizontal scaling requires Redis locks and external persistence (PostgreSQL). On restart, in-memory state is rebuilt via cleanup_stale_instances_on_startup().
Each manager exposes a lazy singleton via a get_*() function:
# Example pattern
_flag_manager: Optional[FlagManager] = None
def get_flag_manager() -> FlagManager:
global _flag_manager
if _flag_manager is None:
_flag_manager = FlagManager(...)
return _flag_managerFor managers requiring async initialization, init_*() and close_*() functions are called during the FastAPI lifespan.
1. init_database() # Async SQLAlchemy engine + table creation
2. init_lock_manager(REDIS_URL) # Redis or local lock backend
3. init_event_logger() # Determine max existing log ID
4. port_manager.initialize() # Load persisted port mappings from DB
5. init_auth() # CTFd client initialization
6. docker_manager.load_challenges() # Parse all instance.toml files
7. docker_manager.load_challenge_settings() # Active/inactive + resource overrides
8. _load_settings_from_db() # Apply persisted setting overrides (30+ keys)
9. init_traefik_provider() # Bootstrap permanent Redis KV keys
10. docker_manager.cleanup_stale_instances_on_startup() # Orphan cleanup
11. docker_manager.start_cleanup_task() # Background 60s cleanup loop
12. init_team_mode() # Auto-detect or use configured team mode
13. Start _auto_check_submissions() # Background 60s submission checker
14. Log SYSTEM_START event
1. Log SYSTEM_STOP event
2. Cancel submission checker background task
3. Cancel cleanup background task
4. close_traefik_provider()
5. close_lock_manager()
6. close_database()
Two modes:
CTFd Mode (AUTH_MODE=ctfd):
- Validates
Authorization: Bearer <token>against CTFd/api/v1/users/me - Enriches with team metadata via CTFd team endpoints
- Multi-strategy team member resolution (members endpoint, team details, /teams/me, user list filter)
- Returns
UserInfowith user_id, username, team_id, team_name
No-Auth Mode (AUTH_MODE=none):
- Identifies users by IP from
X-Forwarded-FororX-Real-IPheaders - Creates pseudo-user with ID prefix
user_ - Falls back to "anonymous" if no IP available
Team Mode:
TEAM_MODE=auto: queries CTFd/api/v1/configs/user_modeto detectTEAM_MODE=enabled: forces team mode, requires team membershipTEAM_MODE=disabled: forces user mode- When enabled, users without a CTFd team are refused (403)
Async wrapper over the docker-py SDK. All blocking operations run via loop.run_in_executor().
Network Management:
create_isolated_network()— bridge driver with ICC/internal options, whaley labelsremove_network(),list_whaley_networks()
Image Management:
build_image()— with build args and cache control
Container Management:
run_container(),stop_container(),remove_container()get_container_logs(),get_container_stats()— full resource snapshot
Compose Operations (subprocess bridge):
compose_up()/compose_down()— shells out todocker composelist_compose_projects()/remove_compose_project()— force-cleans per-spawn resourceslist_containers_by_project()— finds containers by compose project label
Resource Cleanup:
cleanup_whaley_resources()— orphan sweep for networks and per-spawn images- Uses a safety window to avoid deleting newly-created resources
- Maintains a reserved network list for infrastructure-level networks
The core orchestrator. This is the largest module (~1400 lines).
Challenge Loading:
- Parses
instance.tomlfrom each subdirectory ofCHALLENGES_DIR - Supports
.yamland.ymlcompose files _lint_compose_file()detects pinned subnets and IPs (non-fatal warnings)- Loads active/inactive status and resource overrides from
ChallengeSettingsDB table
Spawn Critical Section:
- Validate challenge exists and is active
- Enter global spawn semaphore (max 10 concurrent spawns)
- Acquire distributed lock on
spawn:{owner_id}key - Enforce
MAX_INSTANCES_PER_USER/MAX_INSTANCES_PER_TEAM - Prevent duplicate running instance for same owner+challenge
- Generate
instance_id={challenge_id}-{owner_id[:8]}-{uuid_hex[:8]} - Allocate ports via
PortManager(with up to 3 retries on conflict) - Create dynamic flag if enabled
- Set up connection hints (Traefik FQDN or direct host:port)
- Copy challenge to temp directory, inject flags, enforce resource limits
- Run
docker compose upvia SDK - Register Traefik route via Redis provider
- On failure: best-effort cleanup (compose down, route delete, port release, network removal)
Stop Flow:
- Verify ownership (user or team member)
- Optionally auto-capture forensics logs
docker compose down- Delete Traefik route
- Remove isolated network
- Clean per-spawn images/volumes
- Release ports
- Log event, send Discord notification
Extend Policy:
- Extension step comes from
instance.toml(extend_time, default 1800s) - Allowed only after half of
timeouthas elapsed - Total added extension capped at
timeout(max extra =timeout)
Background Cleanup (every 60s):
- Stops expired RUNNING instances
- Every ~10 minutes: sweeps orphan networks/images
- Every ~60 minutes: cleans old forensics logs
Deterministic, lane-based port allocation with database persistence.
Lane-Based Algorithm:
- Port range divided into N lanes (up to 32)
- Blake2b hash of
{user_id}:{challenge_id}:{internal_port}maps to a primary lane - Each lane maintains a cursor for next-available scanning
- Falls back to other lanes in deterministic order if primary lane is full
- Before use, verifies port is actually free via
socket.bind()(scavenger check)
Persistence:
- Port assignments stored in
user_port_mappingsDB table - On respawn, same user+challenge gets the same ports (when available)
- Allocation acquires distributed locks: per-user+challenge + global allocator
Retry Logic:
- Spawn retries up to 3 times on port conflicts
- Each retry re-allocates fresh ports
Writes dynamic router/service configuration into Redis KV for Traefik to consume.
Permanent Keys (bootstrapped at startup):
- Default TCP catch-all route (block-all, priority 1)
- Optional CTFd redirect middleware
- Optional dashboard basic auth users
- Additional keys from YAML file or JSON string
Per-Instance Routes:
- HTTP: Router with
Host({fqdn})rule, TLS, load balancer targeting{backend_host}:{backend_port} - TCP: SNI router with
HostSNI({fqdn}), TLS passthrough, load balancer address - Custom (SSH, etc.): Dedicated entrypoint, optional TLS
Cleanup:
delete_instance_route()scans Redis with pattern matching to remove all keys for an instance- Uses Redis pipeline for atomic batch writes
Redis-based distributed lock with graceful fallback to local asyncio.Lock.
- Redis mode: uses
redis.asynciowithSET NXsemantics - Local mode: transparent
asyncio.Lockper name acquire()is an async context manager with configurable timeoutacquire_multiple()acquires locks in sorted order to prevent deadlocksis_distributedproperty signals whether Redis is active
Complete dynamic flag lifecycle management. All data is stored in the database (FlagMappingModel, SuspiciousSubmissionModel, and WhaleySettings models); the legacy logs/flag_mappings.json file has been removed.
Full technical details (extraction algorithm, injection regex, ownership semantics, detection sequence diagrams) are in DYNAMIC-FLAGS.md.
Flag Format: {FLAG_PREFIX}{base_content}_{16 hex chars} (default: FLAG{...})
base_contentis the inner text extracted from the firstPREFIX{...}placeholder found in challenge files- If no placeholder exists, falls back to fully random:
PREFIX{<32 hex chars>}
Spawn-Time Flow:
_extract_base_flag_content()— scans forPREFIX{...}, extracts the inner text (priority: flag files > config > source files)generate_flag(base_content)→FLAG{base_content_<16hex>}create_flag_for_owner()— creates flag in CTFd via API, inserts intoflag_mappingsDB table, updates in-memory indexes (user_flags,owner_flags,flag_lookup)_inject_flag_into_files()— replaces everyPREFIX{...}occurrence with the dynamic flag (regexPREFIX{[^}\n]+})- Per-challenge override: If
disable_dynamic_flags = trueis set ininstance.toml, the spawn skips flag creation entirely. Existing CTFd challenge mappings are pruned on load/reload. The admin Flags panel prevents mapping the challenge while this is set.
Flag Reuse: Same owner+challenge always gets the same flag (looked up from in-memory index before creating a new one).
Suspicious Submission Detection:
- Incremental mode (default): Tracks
last_submission_idinwhaley_settings. Only checks CTFd submissions withid > last_submission_id, avoiding repeated re-scanning of the same data - Full scan mode:
POST /admin/api/flags/check-submissions?full_scan=truere-checks all recent submissions regardless of checkpoint - Fetches newest 5 pages (up to 250 submissions) from CTFd, cross-references
providedflag againstflag_lookup - Ownership comparison: user_id in user mode, team_id in team mode
- Deduplication via SHA-256 unique key:
hash(submitter_identity|owner_identity|flag_hash)stored in DB with unique index - New suspicious entries inserted into
suspicious_submissionstable immediately; admin API supports paginated queries
Startup: initialize() loads all indexes from DB (flag mappings, challenge mapping, suspicious unique keys, last submission checkpoint) into memory for fast lookups. Lazy — called on first await get_flag_manager().
Captures Docker container logs for debugging and post-mortem analysis.
Capture Modes:
- Auto Capture: on instance termination (configurable via
FORENSICS_AUTO_CAPTURE) - Live Capture: on-demand from running instances
Capture Process:
- Get container IDs via
docker compose ps -q - For each container:
docker inspectfor name,docker logs --tail --timestampsfor content - Enforce size limit (per capture) and tail line limit (per container)
- Write header with metadata followed by per-container sections
- Plain text or gzip-compressed output
Index: JSON-based index.json in forensics log directory. Thread-safe via Lock.
Retention: Auto-cleanup of logs older than FORENSICS_RETENTION_HOURS (default 168h / 7 days).
Concurrency: Semaphore-limited (max 5 concurrent captures).
Real-time Docker container resource metrics.
Per-Container: CPU%, memory usage/limit/%, network RX/TX, block read/write, PIDs.
Per-Instance: Aggregated totals across all containers in a compose project.
System: Total/running containers, aggregate CPU/memory, host CPU cores, host memory (from /proc/meminfo on Linux).
CPU Calculation: Standard Docker formula — (cpu_delta / system_cpu_delta) * online_cpus * 100.
Structured event logging to SQLite/PostgreSQL with memory cache.
Event Types: INSTANCE_SPAWN, INSTANCE_SPAWN_FAILED, INSTANCE_STOP, INSTANCE_EXTEND, INSTANCE_EXPIRED, USER_LOGIN, USER_LOGIN_FAILED, AUTH_FAILURE, FLAG_CREATED, FLAG_DELETED, SUSPICIOUS_SUBMISSION, SYSTEM_START, SYSTEM_STOP.
Architecture:
- Async
log()writes to DB via background task, also stores in memory cache (max 1000 entries) get_entries()queries DB with pagination and filteringget_stats()returns totals, counts by type, unique users, 24h activity
Sends rich Discord embeds for lifecycle events:
- Spawn: green embed with instance details, routing info, connection hints
- Spawn Failure: red embed with challenge, requester, failure reason
- Extend: yellow embed with extension amount, new expiry
- Stop: orange embed with stop reason (user/admin/expired)
Disabled when DISCORD_WEBHOOK_URL is empty.
High-frequency operational metrics tracked in memory for Prometheus export:
- Spawn latency histogram: 9 buckets (0.25s to 60s), inflight count
- Operation counters: tuples of (operation, outcome, reason_label, challenge_id)
- Failure reason classifiers:
_classify_spawn_failure_reason()and_classify_operation_failure_reason()bucket failures into low-cardinality labels
Client Request
→ User rate limit check (sliding window, 10/min)
→ Challenge active check
→ get_current_user() dependency
→ DockerManager.spawn_instance()
→ Determine owner_id (team_id or user_id)
→ Spawn semaphore acquire (max 10 concurrent)
→ Distributed lock acquire (spawn:{owner_id})
→ Instance limit check
→ Duplicate running check
→ _do_spawn_instance()
→ Generate instance_id
→ PortManager.allocate_ports_for_user() (with 3 retries)
→ FlagManager.create_flag_for_owner() (if enabled)
→ Build connection hints
→ Compose labels injection (whaley.managed, whaley.owner, whaley.created_at)
→ Resource limits enforcement (memory, CPU, PIDs)
→ Flag injection into challenge files
→ Isolated network creation (if enabled)
→ docker compose up -d --build
→ TraefikRedisProvider.register_instance_route()
→ Store instance in memory
→ Release lock
→ Release semaphore
→ Log event, Discord notification, metrics recording
→ Return PublicSpawnResponse
Client Request
→ User rate limit check
→ get_current_user() dependency
→ Find instance in memory
→ Ownership check (user or team member)
→ Forensics auto-capture (if enabled)
→ docker compose down
→ TraefikRedisProvider.delete_instance_route()
→ Remove isolated network (if created)
→ DockerClient.remove_compose_project() (clean per-spawn images/volumes)
→ PortManager.release_instance_ports()
→ Remove from memory
→ Log event, Discord notification
Client Request
→ User rate limit check
→ get_current_user() dependency
→ DockerManager.extend_instance()
→ Find instance, ownership check
→ Validate: RUNNING status
→ Validate: extend_time configured
→ Validate: half-timeout elapsed since creation
→ Validate: total extension ≤ timeout cap
→ Update expires_at in memory
→ Log event, Discord notification
Whaley has a multi-layered cleanup strategy to prevent resource leaks (orphan containers, networks, images, volumes). Cleanup happens at four levels:
When an instance is stopped (user action, admin force-stop, or expiry), the stop_instance() flow tears down everything tied to that instance:
- Forensics auto-capture — if enabled, container logs are dumped before teardown
docker compose down— stops and removes containers for the project- Traefik route deletion — pattern-scans Redis for all
http/routers/{id}*,http/services/{id}*,tcp/routers/{id}*,tcp/services/{id}*keys and deletes them - Isolated network removal — removes the per-instance bridge network (if network isolation was enabled)
remove_compose_project()— force-removes any remaining resources by label:- Containers matching
com.docker.compose.project={project_name} - Networks matching the same project label (with force-disconnect fallback)
- Volumes matching the same project label (optional, default true)
- Per-spawn images — removes images whose tag starts with
{project_name}-(e.g.,web-challenge-abc123-def456-web:latest). Base/pulled images likenginx:alpineorpython:3.11are never matched since they don't start with a project-name prefix.
- Containers matching
- Port release — returns allocated ports to the available pool
- In-memory state removal — instance deleted from
docker_manager.instancesdict
DockerManager.start_cleanup_task() runs a continuous asyncio loop:
Every 60 seconds:
→ Scan in-memory instances for expired RUNNING instances
→ Call stop_instance() for each (triggers Level 1 cleanup)
→ Log INSTANCE_EXPIRED event
→ Send Discord notification
Every ~10 minutes (10 iterations):
→ Orphan sweep via DockerClient.cleanup_whaley_resources()
→ Passes active project names as a safety set
→ 5-minute safety window (never removes resources younger than 300s)
Every ~60 minutes (60 iterations):
→ Forensics retention cleanup (delete logs older than FORENSICS_RETENTION_HOURS)
→ Reset iteration counter
Catches resources that survived a failed teardown (crash, partial cleanup, labels stripped):
Reserved Networks (never touched):
bridge, host, none, ctf-instances, mctf-monitoring_default, whaley-redis, whaley_default
Orphan Network Detection:
- Scan all Docker networks
- Skip reserved names and networks with attached containers
- For each network, attempt to derive a project name:
whaley-{project}→ named isolation network → extract{project}{project}_default→ compose-managed default network → extract{project}
- If the project is NOT in the active set → candidate for removal
- Safety window: skip networks younger than
older_than_seconds(5 min during normal operation, 0 at startup)
Orphan Per-Spawn Image Detection:
- Scan all Docker images
- Skip base images (contain
/in the name part, e.g.,docker.io/library/nginx) - Match tags against the Whaley project pattern:
{challenge-id}-{num}-{6+hex}-{service}:{tag} - Extract the project name from the match
- If the project is NOT in the active set → candidate for removal
- Safety window: same
older_than_secondscheck via imageCreatedtimestamp
On process restart, cleanup_stale_instances_on_startup() runs before any new instances are spawned:
- Discover stale projects via three sources:
list_compose_projects()— finds all compose projects bycom.docker.compose.projectcontainer labellist_whaley_networks()— finds networks withwhaley.managed=truelabel, extracts project names fromwhaley-{project}naming pattern- Static name heuristic —
_is_managed_instance_project_name()pattern match
- Identify managed projects — a project is considered Whaley-managed if it has:
whaley.managed=truelabel, ORwhaley.instance_idlabel matching project name, OR- A matching whaley-prefixed network, OR
- A name matching the instance ID pattern (
{challenge}-{owner}-{hex})
- Tear down each stale project via
remove_compose_project()— removes containers, networks, volumes, per-spawn images - Delete Traefik routes for each project name
- Release ports for each project name
- Final orphan sweep — calls
cleanup_whaley_resources(active_project_names=set(), older_than_seconds=0)with empty active set and zero safety window (nothing is running yet, so no risk of race conditions)
Instance Stop (user/admin/expiry)
│
├─ docker compose down ← containers gone
├─ Traefik route delete ← Redis keys purged
├─ Isolated network remove ← bridge net gone
├─ remove_compose_project() ← residual containers, networks, volumes, per-spawn images
└─ Port release ← ports back in pool
Background Loop (60s tick)
│
├─ Expired instances → stop (triggers above)
├─ Every 10 min → orphan sweep
│ Networks: whaley-* and *_default without active project → remove
│ Images: {project}-{service}:tag without active project → remove
│ 5-min safety window protects in-flight spawns
└─ Every 60 min → forensics retention cleanup
Startup Stale Cleanup (before any spawn)
│
├─ Discover via compose labels + network labels + name heuristic
├─ Tear down each stale project
├─ Clear Traefik routes + release ports
└─ Final orphan sweep (zero safety window, no race risk)
| Data | Location |
|---|---|
| Active instances | docker_manager.instances dict |
| Loaded challenge configs | docker_manager.challenges dict |
| Allocated ports | port_manager.allocated_ports / instance_ports |
| Rate limit tracking | _admin_rate_limit, _user_rate_limit dicts in main.py |
| Runtime metrics | _spawn_latency_*, _runtime_operation_counters |
| Flag index | FlagManager.flag_lookup, user_flags, owner_flags |
| Auth mode cache | _team_mode_enabled, _ctfd_mode_cache |
| Table | Purpose |
|---|---|
user_port_mappings |
Persistent port allocation per user+challenge |
event_logs |
Full event audit trail |
challenge_settings |
Active/inactive toggles, per-challenge resource overrides |
whaley_settings |
Global runtime setting overrides (challenge_mapping, last_submission_id) |
instance_states |
Schema exists (for future instance recovery) |
flag_mappings |
Dynamic flag assignments per owner+challenge |
suspicious_submissions |
Detected flag-sharing incidents |
| Path | Content |
|---|---|
logs/forensics/index.json |
Forensics log metadata index |
logs/forensics/*.log[.gz] |
Captured container logs |
logs/events.jsonl |
JSON-lines event log output |
Note: All dynamic flag data (mappings, suspicious submissions, challenge mapping, last submission checkpoint) is stored exclusively in the database (
flag_mappings,suspicious_submissions, andwhaley_settingstables). The legacylogs/flag_mappings.jsonfile has been removed.
- Docker daemon: containers, networks, images, compose projects (via SDK + CLI)
- CTFd: users, teams, dynamic flags, submissions (via REST API)
- Redis: lock keys, Traefik KV router/service keys
The frontend is a React 18 + TypeScript + Vite application split into two SPAs:
frontend/src/
├── main.tsx → UserApp (mounts on #root)
└── admin.tsx → AdminApp (mounts on #admin-root)
Build output: frontend/ builds into app/static/ via Vite (configured in vite.config.js):
index.html→ user SPA entryadmin.html→ admin SPA entryassets/→ hashed JS/CSS/font bundles
Docker build: Multi-stage — first stage runs npm ci && npm run build in Node 20 Alpine, second stage copies built assets into the Python image.
- CTFd mode: Shows token login panel, stores token in
sessionStorage - No-auth mode: Auto-authenticates
- Challenge Deck: Lists active challenges with category badges, deploy buttons
- Active Instances: Lifecycle cards with endpoint copy, connection hints, countdown timers, extend/stop buttons
- Auto-refresh: Instances and health polled every 10s, clock ticks every 1s
Six tabs, URL hash-based navigation:
| Tab | Component | Features |
|---|---|---|
| Dashboard | DashboardPage |
Stats cards, active instances list, force-stop |
| Logs | LogsPage |
Event log viewer (filtered/paginated), port mappings, forensics (stats, toggle, live capture, log viewer) |
| Flags | FlagsPage |
Flag mappings, suspicious submissions, CTFd sync wizard |
| Challenges | ChallengesPage |
Upload zip, file browser/editor, active toggle, reload config |
| Monitoring | MonitoringPage |
System metrics, per-instance container CPU/RAM, high-usage filter |
| Settings | SettingsPage |
30+ editable settings with type-aware inputs, sectioned layout, change tracking |
All UI components live in frontend/src/shared/components/ui/:
Badge— color-coded status tags (neutral/success/warning/danger/info)Button— variants (primary/secondary/danger/ghost), sizes (sm/md)Card— Dark gradient container with borderEmptyState— Placeholder for empty/no-data statesInput,Select,Textarea— Form inputs with consistent dark stylingLoader— Spinning border animationModal— Overlay dialog with backdrop blurTabs— Horizontal tab navigation
- Colors: graphite #0D0D0D, anthracite #262626, steel #737373, deepviolet #200F38, mist #F2F2F2
- Fonts: Chakra Petch (display), Public Sans (body), IBM Plex Mono (monospace)
- Styling: Tailwind CSS with custom theme, dark background with purple accent
- Background: Radial gradient with CSS grid overlay effect
# Stage 1: Node 20 Alpine — builds frontend
FROM node:20-alpine AS frontend-builder
COPY frontend/ ./
RUN npm ci --include=dev
RUN npm install -g vite
RUN npm run build
# Stage 2: Python 3.11 Slim — runs backend
FROM python:3.11-slim
# Install Docker CLI + compose plugin
# Copy Python deps + app code
# Overlay compiled SPA assets from Stage 1
COPY --from=frontend-builder /build/app/static/index.html ./app/static/
COPY --from=frontend-builder /build/app/static/admin.html ./app/static/
COPY --from=frontend-builder /build/app/static/assets ./app/static/
CMD ["python", "-m", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]Services:
redis — Redis 7 Alpine, append-only persistence, healthcheck
postgres — PostgreSQL 16 Alpine, persistent volume, healthcheck
instancer — Whaley app, depends on healthy redis + postgres
Networks:
ctf-instances — bridge network for inter-service communication
Volumes:
redis_data — Redis AOF persistence
postgres_data — PostgreSQL data persistence
/var/run/docker.sock — Mounted for container management
./challenges — Challenge definitions (rw)
./logs — Event logs, forensics
./data — Forensics, event logs, misc data
| Control | Mechanism |
|---|---|
| Admin authentication | X-Admin-Key header, per-IP rate limiting (default 150/min) |
| User rate limiting | Sliding window, 10 requests/min for spawn/stop/extend |
| Metrics protection | Bearer token via METRICS_SECRET, constant-time comparison |
| Path traversal prevention | Symlink resolution + containment check for all file operations |
| Zip upload protection | Max size (50MB), max entries (1000), max extracted (200MB), zip-slip validation |
| Security headers | CSP, X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Referrer-Policy |
| Network isolation | Per-instance bridge network with optional ICC disabled |
| Resource caps | Memory, CPU, PID limits enforced on all containers |
| Fork bomb protection | CONTAINER_PIDS_LIMIT (default 256) |
| Ownership enforcement | Instance access checked against user identity + team membership |
| Trusted proxy support | CIDR-aware IP extraction for admin endpoints |
| String escaping | Prometheus label values escaped per exposition format spec |
- CORS: Allows all origins (
allow_credentials=false) - Admin key storage: Browser
localStorage(admin UI) - No-auth IP trust: Uses forwarded headers directly (not trusted-proxy filtered)
- Monitoring host checks: Linux-specific (
/proc/meminfo,nproc)
Exposed at GET /metrics (protected by METRICS_SECRET Bearer token).
| Category | Metrics |
|---|---|
| Instances | Total by status, owner, team, challenge; owner saturation ratios |
| Lifecycle | Per-instance expiry timestamps, age seconds, connection info |
| Alerts | Instances expiring within 5/10 min, stale STARTING instances (>2 min) |
| Ports | Allocated count, available count, utilization percent |
| Challenges | Loaded count, active count |
| Flags | Total dynamic flags assigned, suspicious submission count |
| Forensics | Auto-capture on/off, total log count, auto/live breakdown, total size |
| Runtime Ops | Operation outcome counters (spawn/stop/extend with success/failure + reason) |
| Spawn Latency | Histogram (9 buckets: 0.25s–60s), inflight count |
| Event Logs | Total entries, counts by event type, unique users, 24h activity |
outcome:success|failureoperation:spawn|stop|extendchallenge_id: normalized,unknownif not applicablereason: normalized failure reason (underscore-separated, max 64 chars)
Checks X-Admin-Key header + per-IP rate limit.
Dashboard & Logs:
GET /{admin_path} — Admin SPA
GET /admin/api/stats — Event logger stats + instances + ports
GET /admin/api/logs — Paginated event logs (filtered)
GET /admin/api/instances — All active instances
DELETE /admin/api/instances/{id} — Force-stop (admin override)
Port Management:
GET /admin/api/user-ports — All user port mappings
GET /admin/api/port-stats — Port usage statistics
DELETE /admin/api/user-ports — Clear all mappings
DELETE /admin/api/user-ports/{user_id} — Delete user's ports
Dynamic Flags:
GET /admin/api/flags — Full flags state
POST /admin/api/flags/check-submissions — Manual submission check
GET /admin/api/flags/suspicious — List suspicious entries
DELETE /admin/api/flags/suspicious — Clear suspicious list
GET /admin/api/flags/mappings — All flag mappings
DELETE /admin/api/flags/user/{user_id} — Delete user's flags
DELETE /admin/api/flags/{flag_id} — Delete single flag
POST /admin/api/flags/sync-challenge — Map local → CTFd challenge
DELETE /admin/api/flags/mapping/{id} — Remove mapping
GET /admin/api/ctfd/challenges — Fetch CTFd challenges (sync wizard)
Forensics:
GET /admin/api/forensics/stats — Forensics statistics
POST /admin/api/forensics/toggle — Enable/disable auto-capture
GET /admin/api/forensics/logs — List logs (filtered)
GET /admin/api/forensics/logs/{id} — Get log content
DELETE /admin/api/forensics/logs/{id} — Delete specific log
DELETE /admin/api/forensics/logs — Clear all logs
POST /admin/api/forensics/live-capture/{id} — On-demand capture
POST /admin/api/forensics/cleanup — Manual retention cleanup
Monitoring:
GET /admin/api/monitoring/system — System-level metrics
GET /admin/api/monitoring/instances — Per-instance container metrics
Challenge Management:
GET /admin/api/challenges/list — List all challenges
POST /admin/api/challenges/upload — Zip upload
DELETE /admin/api/challenges/{id} — Delete challenge directory
GET /admin/api/challenges/{id}/files — Browse file tree
GET /admin/api/challenges/{id}/files/{path} — Read file
PUT /admin/api/challenges/{id}/files/{path} — Write file
POST /admin/api/challenges/{id}/files/{path} — Create file
DELETE /admin/api/challenges/{id}/files/{path} — Delete file/directory
POST /admin/api/challenges/{id}/reload — Reload config
POST /admin/api/challenges/{id}/toggle — Toggle active/inactive
GET /admin/api/challenges/settings — All challenge settings
PUT /admin/api/challenges/{id}/resources — Set resource overrides
Runtime Settings:
GET /admin/api/settings — Current values with override status
PUT /admin/api/settings — Update settings (validated, persisted, applied)
DELETE /admin/api/settings/{key} — Reset to default
POST /admin/api/settings/load — Reload all from DB
- Check challenge has valid
instance.tomland compose file - Verify port range capacity (
/admin/api/port-stats) - Check Docker daemon availability and compose output in error message
- Verify resource caps aren't overly restrictive
- Check Redis connectivity (lock acquisition failures)
- Check owner already has same challenge running (duplicate block)
- Verify
MAX_INSTANCES_PER_USER/MAX_INSTANCES_PER_TEAMlimits - Check challenge active flag (admin toggle)
- Verify user isn't rate-limited (10 req/min window)
- In team mode: verify user belongs to a CTFd team
- Check
TEAM_MODEsetting (enabled/disabled/auto) - Verify CTFd
user_modedetection and API key permissions - Check whether user has
team_idfrom CTFd token validation - Verify team member resolution via admin flags/logs
DYNAMIC_FLAGS_ENABLED=trueAUTH_MODE=ctfd(required for dynamic flags)CTFD_URLandCTFD_API_KEYvalid- Challenge mapping exists in admin flags panel
FLAG_PREFIXmatches placeholder convention in challenge files
- Auto-capture enabled (check toggle in admin forensics tab)
- Capture size/tail limits not too small
- Disk permissions for
FORENSICS_LOG_DIR - Instance must be running for live capture
- Check port range is large enough for expected concurrent instances
- Verify no external processes using ports in the allocation range
- Check
user_port_mappingstable for stale entries - Port allocation retries up to 3 times automatically
- New admin APIs: Add routes in
main.pywithverify_admin_keydependency - New challenge metadata: Extend
ChallengeConfigindocker_manager.py - New event types: Add to
EventTypeenum inlogger.py, log in appropriate handlers - New monitoring metrics: Extend
MonitoringManagerdata model - New settings: Add to
EDITABLE_SETTINGSdict and_load_settings_from_db()type map inmain.py - New frontend pages: Add component in
frontend/src/admin/pages/, wire intoAdminApp.tsxtabs
- Spawn/stop critical sections in
docker_manager.py— lock ordering and cleanup guarantees - Lock semantics in
distributed_lock.py— deadlock prevention, timeout behavior - Port allocation persistence in
port_manager.py— conflict safety guarantees - Dynamic flag ownership logic in
flag_manager.py— user vs team comparison paths - Traefik route registration in
traefik_redis.py— key naming and cleanup completeness
- Instance recovery: Active instance state is held in memory; on restart, the in-memory map is lost. The
instance_statesORM table exists but is not wired into active recovery flows. - Auth IP trust asymmetry: Admin IP extraction is trusted-proxy-aware; no-auth user identity uses forwarded headers directly.
- Platform assumptions: Monitoring host metrics rely on Linux-specific interfaces (
/proc/meminfo,nproc). - Compose orchestration: While many Docker operations use the SDK, compose lifecycle still shells out to
docker compose, requiring Docker CLI availability. - Flag manager logging: Some
FlagManagerlogging calls are invoked synchronously (missingawait), which may result in log entries being silently dropped.