Skip to content

Commit cbe9329

Browse files
committed
Update validator operations runbook
1 parent 57801df commit cbe9329

1 file changed

Lines changed: 58 additions & 93 deletions

File tree

docs/operations/validator.md

Lines changed: 58 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,125 +1,90 @@
11
# Validator Operations
22

3-
![Platform Banner](../../assets/banner.jpg)
3+
Run these commands from the repository root. This runbook is only for normal
4+
validator Kubernetes installation and operation.
45

5-
## Scope
6+
## Install Or Update
67

7-
Run these commands from the repository root. They are the local validation surfaces used for the corrected Python `platform-network` work. If Docker, Helm, kubeconform, kind, kubectl, or a Python tool is missing, record the blocker in evidence and don't mark that surface as tested.
8-
9-
## Python Validation
10-
11-
```bash
12-
uv sync --extra dev --extra master
13-
uv run ruff check .
14-
uv run ruff format --check .
15-
uv run mypy src tests
16-
uv run pytest --cov=platform_network --cov-report=term-missing --cov-fail-under=80
17-
```
18-
19-
Known baseline notes belong in evidence. Current Task 12 evidence records these Python quality gates as passing without changing the documented gates: Ruff check, Ruff format check, mypy, and full coverage. Historical Task 11 evidence recorded Ruff format and mypy blockers, but those blockers are resolved in the current validation state.
20-
21-
## Docker Compose Validation
22-
23-
Validate the base stack, the development stack, and the local or staging Watchtower overlay:
8+
Automatic Kubernetes install:
249

2510
```bash
26-
docker compose -f docker/compose.yml config --quiet
27-
docker compose -f docker/compose.dev.yml config --quiet
28-
docker compose -f docker/compose.yml -f docker/compose.watchtower.yml config --quiet
11+
./scripts/install-validator.sh
2912
```
3013

31-
Clean up local Compose validation resources with:
14+
Dry-run the generated objects without changing the cluster:
3215

3316
```bash
34-
docker compose -f docker/compose.yml -f docker/compose.watchtower.yml down --remove-orphans
17+
./scripts/install-validator.sh --dry-run --skip-hotkey-import
3518
```
3619

37-
The Compose files intentionally expose local development defaults. Production safety comes from the production policy checks, production Helm values, external PostgreSQL, and digest-pinned images.
38-
39-
## Local and Staging Watchtower Overlay
40-
41-
`docker/compose.watchtower.yml` is an explicit local and staging overlay for updating the Docker Compose control plane. It uses the maintained `nickfedor/watchtower:1.17.1` image for Docker 29 API compatibility and runs Watchtower with `--label-enable`, so containers are ignored unless they carry `com.centurylinklabs.watchtower.enable=true`.
42-
43-
The only Compose services opted in are:
44-
45-
- `master-admin`
46-
- `master-proxy`
47-
- `platform-docker-broker`
48-
- `validator`
49-
- `gpu-agent`
50-
51-
Challenge containers, broker-created job containers, database services, and Kubernetes manifests must not receive Watchtower labels. Production Kubernetes uses Helm and Kubernetes rollout controls instead of Watchtower.
52-
53-
Before using the overlay in local or staging environments, render it with the Compose command above and confirm the watched services are healthy after each update. Keep a rollback image tag available. Watchtower can replace containers, but it doesn't prove application health or perform production rollback orchestration.
54-
55-
## Docker Socket Risk
56-
57-
The local Compose control plane mounts `/var/run/docker.sock` for services that need to create local Docker containers or inspect/update local Compose services. The host Docker socket is root-equivalent host access. Treat these socket mounts as local control-plane risk, not production isolation.
58-
59-
The Compose labels `platform.security.docker-socket` and `platform.security.docker-socket-risk` must stay on socket-owning services. Broker-created challenge containers and Kubernetes jobs must not receive the host Docker socket.
60-
61-
## Helm and Kubernetes Validation
62-
63-
Validate the chart with default values and the production policy fixture:
20+
Stop only installer-managed validator objects:
6421

6522
```bash
66-
helm lint deploy/helm/platform
67-
helm template platform deploy/helm/platform > /tmp/platform-default.yaml
68-
kubeconform -strict -summary /tmp/platform-default.yaml
69-
helm template platform deploy/helm/platform -f deploy/helm/platform/values.production.example.yaml > /tmp/platform-production.yaml
70-
kubeconform -strict -summary /tmp/platform-production.yaml
23+
./scripts/install-validator.sh --cleanup
7124
```
7225

73-
Run kind-backed server dry-run validation when Docker and kind are available:
26+
## Runtime Commands
7427

7528
```bash
76-
kind delete cluster --name platform-validation
77-
kind create cluster --name platform-validation
78-
kind get kubeconfig --name platform-validation > /tmp/platform-validation-kubeconfig
79-
KUBECONFIG=/tmp/platform-validation-kubeconfig kubectl apply --dry-run=server -f /tmp/platform-default.yaml
80-
KUBECONFIG=/tmp/platform-validation-kubeconfig kubectl apply --dry-run=server -f /tmp/platform-production.yaml
81-
kind delete cluster --name platform-validation
29+
kubectl -n platform-validator get deployment platform-validator
30+
kubectl -n platform-validator get pods
31+
kubectl -n platform-validator logs -f deployment/platform-validator
32+
kubectl -n platform-validator describe deployment platform-validator
8233
```
8334

84-
Remove `/tmp/platform-validation-kubeconfig` after use if it contains live cluster access. Never commit kubeconfigs or paste them into evidence.
85-
86-
## Database, Image, and TLS Policy
87-
88-
Use different policy expectations for local validation and production operations:
35+
## Secret Handling
8936

90-
- Local development and tests may use the default SQLite database URL and local mutable images. These defaults are intended for fast iteration only.
91-
- Production and Kubernetes deployments must provide an external PostgreSQL database secret or URL before the control plane starts. Do not use SQLite for production or Kubernetes master state.
92-
- Production images must be pinned with a semver tag and `sha256` digest. Do not deploy `latest`, untagged images, or images without a digest in production.
93-
- Production remote GPU servers and Kubernetes targets must keep `verify_tls=true`. Disable TLS verification only for local test endpoints that are not part of production.
94-
- Production Kubernetes agent targets must use HTTPS and `verify_tls=true`. Multi-server routing should trust only enabled, healthy, non-draining targets with available GPU capacity, and it should clear stale persisted assignments when those checks fail.
37+
The only secret requested during install is the validator hotkey mnemonic. Never
38+
enter coldkey material. Do not store mnemonics in `.env`, shell history, support
39+
threads, screenshots, or evidence logs.
9540

96-
For Helm, render production values with `deploy/helm/platform/values.production.example.yaml` and verify failures for unsafe overrides such as `image.tag=latest`, missing `image.digest`, missing database secret references, or target `verify_tls=false`.
41+
The installer creates a Kubernetes Secret named `platform-validator-wallet` from
42+
generated hotkey files and deletes the temporary local wallet directory when it
43+
exits. Kubernetes Secrets are readable to cluster admins and any subject with
44+
Secret read RBAC; use a dedicated namespace and enable encryption at rest for
45+
production clusters.
9746

98-
## Kubernetes PID and Swap Policy
47+
## Registry And Wallet Defaults
9948

100-
Kubernetes jobs and challenge workloads map CPU and memory to PodSpec requests and limits. Docker-only `pids_limit`, `memory_swap`, and custom Docker network modes are rejected for Kubernetes requests because Kubernetes won't enforce those fields through this PodSpec path. If a production cluster needs PID or swap ceilings, document the cluster or admission policy that enforces them.
49+
```text
50+
PLATFORM_VALIDATOR_REGISTRY_URL=https://chain.platform.network
51+
PLATFORM_NAMESPACE=platform-validator
52+
PLATFORM_WALLET_NAME=platform-validator
53+
PLATFORM_WALLET_HOTKEY=validator
54+
PLATFORM_DATABASE_URL=postgresql+asyncpg://platform:<password>@postgres.platform.svc.cluster.local/platform
55+
PLATFORM_BROKER_ALLOWED_IMAGES=ghcr.io/platformnetwork/,registry.example.com/platform/
56+
```
10157

102-
## Broker Archive and Cleanup Checks
58+
The validator pod sees the hotkey at:
10359

104-
Broker archive input is untrusted. Validation evidence for broker changes should show that Docker and Kubernetes paths reject absolute paths, parent traversal, links, device members, malformed images, and unsafe mount sources before creating runtime resources.
60+
```text
61+
/var/lib/platform/wallets/platform-validator/hotkeys/validator
62+
```
10563

106-
Kubernetes broker cleanup evidence should cover deletion attempts for the Job, NetworkPolicy, and mount Secret on success, failure, timeout, apply-error, wait-error, and log-error paths. Keep archive payloads and bearer credentials out of evidence logs.
64+
## Kubernetes Scope
10765

108-
## Evidence Expectations
66+
The installer applies only namespaced resources needed by the validator:
67+
Namespace, ServiceAccount, Role, RoleBinding, PVC, ConfigMap, Secret, and
68+
Deployment. Cleanup removes only the installer-managed validator Deployment,
69+
ConfigMap, Secret, Role, RoleBinding, and ServiceAccount. The PVC is preserved
70+
intentionally so validator state is not destroyed by an update; delete it manually
71+
only when you intentionally want to erase local validator state.
10972

110-
Save validation output in a local, gitignored evidence directory with task-scoped names. Evidence should include:
73+
Kubernetes mode requires an external PostgreSQL `PLATFORM_DATABASE_URL` and
74+
registry-scoped `PLATFORM_BROKER_ALLOWED_IMAGES`. SQLite URLs, wildcards, and
75+
broad prefixes such as `platformnetwork/` fail settings validation.
11176

112-
- command logs for Python, Compose, Helm, kubeconform, kind, and kubectl dry-run surfaces that were actually executed;
113-
- policy guard output showing Watchtower scope, Docker socket risk wording, production PostgreSQL, semver plus digest images, `verify_tls=true`, Kubernetes PID boundary, multi-server target trust, and cleanup commands are documented;
114-
- explicit limitations for unavailable tools or historical blockers, including the resolved Task 11 Ruff format and mypy blockers only when labeled historical or resolved, plus current Task 12 evidence showing Ruff check, Ruff format check, mypy, and full coverage passing;
115-
- a redaction or grep check showing evidence does not contain bearer tokens, private keys, kubeconfigs, credentialed database URLs, private registry credentials, or Docker registry auth.
77+
## Validation
11678

117-
## Master Deployment Checklist
79+
```bash
80+
bash -n scripts/install-validator.sh
81+
./scripts/install-validator.sh --dry-run --skip-hotkey-import
82+
uv run pytest tests/unit/test_validator_install_docs.py
83+
uv run ruff check .
84+
uv run ruff format --check .
85+
uv run mypy src tests
86+
uv run pytest --cov=platform_network --cov-report=term-missing --cov-fail-under=80
87+
```
11888

119-
1. Configure `config/master.example.yaml` or provide environment overrides.
120-
2. Provide an admin token file.
121-
3. Run Alembic migrations.
122-
4. Start the master API.
123-
5. Start the proxy API.
124-
6. Register and activate challenge images.
125-
7. Monitor logs, Sentry, and OpenTelemetry.
89+
If Kubernetes or a Python tool is unavailable, record the missing tool as a
90+
blocker instead of marking that surface as tested.

0 commit comments

Comments
 (0)