|
1 | 1 | # Validator Operations |
2 | 2 |
|
3 | | - |
| 3 | +Run these commands from the repository root. This runbook is only for normal |
| 4 | +validator Kubernetes installation and operation. |
4 | 5 |
|
5 | | -## Scope |
| 6 | +## Install Or Update |
6 | 7 |
|
7 | | -Run these commands from the repository root. They are the local validation surfaces used for the corrected Python `platform-network` work. If Docker, Helm, kubeconform, kind, kubectl, or a Python tool is missing, record the blocker in evidence and don't mark that surface as tested. |
8 | | - |
9 | | -## Python Validation |
10 | | - |
11 | | -```bash |
12 | | -uv sync --extra dev --extra master |
13 | | -uv run ruff check . |
14 | | -uv run ruff format --check . |
15 | | -uv run mypy src tests |
16 | | -uv run pytest --cov=platform_network --cov-report=term-missing --cov-fail-under=80 |
17 | | -``` |
18 | | - |
19 | | -Known baseline notes belong in evidence. Current Task 12 evidence records these Python quality gates as passing without changing the documented gates: Ruff check, Ruff format check, mypy, and full coverage. Historical Task 11 evidence recorded Ruff format and mypy blockers, but those blockers are resolved in the current validation state. |
20 | | - |
21 | | -## Docker Compose Validation |
22 | | - |
23 | | -Validate the base stack, the development stack, and the local or staging Watchtower overlay: |
| 8 | +Automatic Kubernetes install: |
24 | 9 |
|
25 | 10 | ```bash |
26 | | -docker compose -f docker/compose.yml config --quiet |
27 | | -docker compose -f docker/compose.dev.yml config --quiet |
28 | | -docker compose -f docker/compose.yml -f docker/compose.watchtower.yml config --quiet |
| 11 | +./scripts/install-validator.sh |
29 | 12 | ``` |
30 | 13 |
|
31 | | -Clean up local Compose validation resources with: |
| 14 | +Dry-run the generated objects without changing the cluster: |
32 | 15 |
|
33 | 16 | ```bash |
34 | | -docker compose -f docker/compose.yml -f docker/compose.watchtower.yml down --remove-orphans |
| 17 | +./scripts/install-validator.sh --dry-run --skip-hotkey-import |
35 | 18 | ``` |
36 | 19 |
|
37 | | -The Compose files intentionally expose local development defaults. Production safety comes from the production policy checks, production Helm values, external PostgreSQL, and digest-pinned images. |
38 | | - |
39 | | -## Local and Staging Watchtower Overlay |
40 | | - |
41 | | -`docker/compose.watchtower.yml` is an explicit local and staging overlay for updating the Docker Compose control plane. It uses the maintained `nickfedor/watchtower:1.17.1` image for Docker 29 API compatibility and runs Watchtower with `--label-enable`, so containers are ignored unless they carry `com.centurylinklabs.watchtower.enable=true`. |
42 | | - |
43 | | -The only Compose services opted in are: |
44 | | - |
45 | | -- `master-admin` |
46 | | -- `master-proxy` |
47 | | -- `platform-docker-broker` |
48 | | -- `validator` |
49 | | -- `gpu-agent` |
50 | | - |
51 | | -Challenge containers, broker-created job containers, database services, and Kubernetes manifests must not receive Watchtower labels. Production Kubernetes uses Helm and Kubernetes rollout controls instead of Watchtower. |
52 | | - |
53 | | -Before using the overlay in local or staging environments, render it with the Compose command above and confirm the watched services are healthy after each update. Keep a rollback image tag available. Watchtower can replace containers, but it doesn't prove application health or perform production rollback orchestration. |
54 | | - |
55 | | -## Docker Socket Risk |
56 | | - |
57 | | -The local Compose control plane mounts `/var/run/docker.sock` for services that need to create local Docker containers or inspect/update local Compose services. The host Docker socket is root-equivalent host access. Treat these socket mounts as local control-plane risk, not production isolation. |
58 | | - |
59 | | -The Compose labels `platform.security.docker-socket` and `platform.security.docker-socket-risk` must stay on socket-owning services. Broker-created challenge containers and Kubernetes jobs must not receive the host Docker socket. |
60 | | - |
61 | | -## Helm and Kubernetes Validation |
62 | | - |
63 | | -Validate the chart with default values and the production policy fixture: |
| 20 | +Stop only installer-managed validator objects: |
64 | 21 |
|
65 | 22 | ```bash |
66 | | -helm lint deploy/helm/platform |
67 | | -helm template platform deploy/helm/platform > /tmp/platform-default.yaml |
68 | | -kubeconform -strict -summary /tmp/platform-default.yaml |
69 | | -helm template platform deploy/helm/platform -f deploy/helm/platform/values.production.example.yaml > /tmp/platform-production.yaml |
70 | | -kubeconform -strict -summary /tmp/platform-production.yaml |
| 23 | +./scripts/install-validator.sh --cleanup |
71 | 24 | ``` |
72 | 25 |
|
73 | | -Run kind-backed server dry-run validation when Docker and kind are available: |
| 26 | +## Runtime Commands |
74 | 27 |
|
75 | 28 | ```bash |
76 | | -kind delete cluster --name platform-validation |
77 | | -kind create cluster --name platform-validation |
78 | | -kind get kubeconfig --name platform-validation > /tmp/platform-validation-kubeconfig |
79 | | -KUBECONFIG=/tmp/platform-validation-kubeconfig kubectl apply --dry-run=server -f /tmp/platform-default.yaml |
80 | | -KUBECONFIG=/tmp/platform-validation-kubeconfig kubectl apply --dry-run=server -f /tmp/platform-production.yaml |
81 | | -kind delete cluster --name platform-validation |
| 29 | +kubectl -n platform-validator get deployment platform-validator |
| 30 | +kubectl -n platform-validator get pods |
| 31 | +kubectl -n platform-validator logs -f deployment/platform-validator |
| 32 | +kubectl -n platform-validator describe deployment platform-validator |
82 | 33 | ``` |
83 | 34 |
|
84 | | -Remove `/tmp/platform-validation-kubeconfig` after use if it contains live cluster access. Never commit kubeconfigs or paste them into evidence. |
85 | | - |
86 | | -## Database, Image, and TLS Policy |
87 | | - |
88 | | -Use different policy expectations for local validation and production operations: |
| 35 | +## Secret Handling |
89 | 36 |
|
90 | | -- Local development and tests may use the default SQLite database URL and local mutable images. These defaults are intended for fast iteration only. |
91 | | -- Production and Kubernetes deployments must provide an external PostgreSQL database secret or URL before the control plane starts. Do not use SQLite for production or Kubernetes master state. |
92 | | -- Production images must be pinned with a semver tag and `sha256` digest. Do not deploy `latest`, untagged images, or images without a digest in production. |
93 | | -- Production remote GPU servers and Kubernetes targets must keep `verify_tls=true`. Disable TLS verification only for local test endpoints that are not part of production. |
94 | | -- Production Kubernetes agent targets must use HTTPS and `verify_tls=true`. Multi-server routing should trust only enabled, healthy, non-draining targets with available GPU capacity, and it should clear stale persisted assignments when those checks fail. |
| 37 | +The only secret requested during install is the validator hotkey mnemonic. Never |
| 38 | +enter coldkey material. Do not store mnemonics in `.env`, shell history, support |
| 39 | +threads, screenshots, or evidence logs. |
95 | 40 |
|
96 | | -For Helm, render production values with `deploy/helm/platform/values.production.example.yaml` and verify failures for unsafe overrides such as `image.tag=latest`, missing `image.digest`, missing database secret references, or target `verify_tls=false`. |
| 41 | +The installer creates a Kubernetes Secret named `platform-validator-wallet` from |
| 42 | +generated hotkey files and deletes the temporary local wallet directory when it |
| 43 | +exits. Kubernetes Secrets are readable to cluster admins and any subject with |
| 44 | +Secret read RBAC; use a dedicated namespace and enable encryption at rest for |
| 45 | +production clusters. |
97 | 46 |
|
98 | | -## Kubernetes PID and Swap Policy |
| 47 | +## Registry And Wallet Defaults |
99 | 48 |
|
100 | | -Kubernetes jobs and challenge workloads map CPU and memory to PodSpec requests and limits. Docker-only `pids_limit`, `memory_swap`, and custom Docker network modes are rejected for Kubernetes requests because Kubernetes won't enforce those fields through this PodSpec path. If a production cluster needs PID or swap ceilings, document the cluster or admission policy that enforces them. |
| 49 | +```text |
| 50 | +PLATFORM_VALIDATOR_REGISTRY_URL=https://chain.platform.network |
| 51 | +PLATFORM_NAMESPACE=platform-validator |
| 52 | +PLATFORM_WALLET_NAME=platform-validator |
| 53 | +PLATFORM_WALLET_HOTKEY=validator |
| 54 | +PLATFORM_DATABASE_URL=postgresql+asyncpg://platform:<password>@postgres.platform.svc.cluster.local/platform |
| 55 | +PLATFORM_BROKER_ALLOWED_IMAGES=ghcr.io/platformnetwork/,registry.example.com/platform/ |
| 56 | +``` |
101 | 57 |
|
102 | | -## Broker Archive and Cleanup Checks |
| 58 | +The validator pod sees the hotkey at: |
103 | 59 |
|
104 | | -Broker archive input is untrusted. Validation evidence for broker changes should show that Docker and Kubernetes paths reject absolute paths, parent traversal, links, device members, malformed images, and unsafe mount sources before creating runtime resources. |
| 60 | +```text |
| 61 | +/var/lib/platform/wallets/platform-validator/hotkeys/validator |
| 62 | +``` |
105 | 63 |
|
106 | | -Kubernetes broker cleanup evidence should cover deletion attempts for the Job, NetworkPolicy, and mount Secret on success, failure, timeout, apply-error, wait-error, and log-error paths. Keep archive payloads and bearer credentials out of evidence logs. |
| 64 | +## Kubernetes Scope |
107 | 65 |
|
108 | | -## Evidence Expectations |
| 66 | +The installer applies only namespaced resources needed by the validator: |
| 67 | +Namespace, ServiceAccount, Role, RoleBinding, PVC, ConfigMap, Secret, and |
| 68 | +Deployment. Cleanup removes only the installer-managed validator Deployment, |
| 69 | +ConfigMap, Secret, Role, RoleBinding, and ServiceAccount. The PVC is preserved |
| 70 | +intentionally so validator state is not destroyed by an update; delete it manually |
| 71 | +only when you intentionally want to erase local validator state. |
109 | 72 |
|
110 | | -Save validation output in a local, gitignored evidence directory with task-scoped names. Evidence should include: |
| 73 | +Kubernetes mode requires an external PostgreSQL `PLATFORM_DATABASE_URL` and |
| 74 | +registry-scoped `PLATFORM_BROKER_ALLOWED_IMAGES`. SQLite URLs, wildcards, and |
| 75 | +broad prefixes such as `platformnetwork/` fail settings validation. |
111 | 76 |
|
112 | | -- command logs for Python, Compose, Helm, kubeconform, kind, and kubectl dry-run surfaces that were actually executed; |
113 | | -- policy guard output showing Watchtower scope, Docker socket risk wording, production PostgreSQL, semver plus digest images, `verify_tls=true`, Kubernetes PID boundary, multi-server target trust, and cleanup commands are documented; |
114 | | -- explicit limitations for unavailable tools or historical blockers, including the resolved Task 11 Ruff format and mypy blockers only when labeled historical or resolved, plus current Task 12 evidence showing Ruff check, Ruff format check, mypy, and full coverage passing; |
115 | | -- a redaction or grep check showing evidence does not contain bearer tokens, private keys, kubeconfigs, credentialed database URLs, private registry credentials, or Docker registry auth. |
| 77 | +## Validation |
116 | 78 |
|
117 | | -## Master Deployment Checklist |
| 79 | +```bash |
| 80 | +bash -n scripts/install-validator.sh |
| 81 | +./scripts/install-validator.sh --dry-run --skip-hotkey-import |
| 82 | +uv run pytest tests/unit/test_validator_install_docs.py |
| 83 | +uv run ruff check . |
| 84 | +uv run ruff format --check . |
| 85 | +uv run mypy src tests |
| 86 | +uv run pytest --cov=platform_network --cov-report=term-missing --cov-fail-under=80 |
| 87 | +``` |
118 | 88 |
|
119 | | -1. Configure `config/master.example.yaml` or provide environment overrides. |
120 | | -2. Provide an admin token file. |
121 | | -3. Run Alembic migrations. |
122 | | -4. Start the master API. |
123 | | -5. Start the proxy API. |
124 | | -6. Register and activate challenge images. |
125 | | -7. Monitor logs, Sentry, and OpenTelemetry. |
| 89 | +If Kubernetes or a Python tool is unavailable, record the missing tool as a |
| 90 | +blocker instead of marking that surface as tested. |
0 commit comments