dennys246
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 47 additions & 0 deletions b/‎.github/workflows/test.yml‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎docs/plans/README.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/plans/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/plans/reactive_peer_mesh_roadmap.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/plans/reactive_peer_mesh_roadmap.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/troubleshooting/mesh_debug.md‎
Lines changed: 21 additions & 1 deletion b/‎docs/troubleshooting/mesh_debug.md‎
Lines changed: 21 additions & 1 deletion
diff --git a/‎docs/troubleshooting/remote_update.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/troubleshooting/remote_update.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/user/cli-reference.md‎
Lines changed: 5 additions & 1 deletion b/‎docs/user/cli-reference.md‎
Lines changed: 5 additions & 1 deletion
@@ -194,4 +194,51 @@ jobs:
             echo "in the same commit so the rule stays coherent."
             exit 1
           fi
+          # Plan 4 C3.5 admin_core single-source-of-truth rules:
+          # update_on_target, restart_on_target, llm_swap_on_target are
+          # the only CLIENT-side functions that POST to the leader's
+          # admin endpoints. Same pattern as install_core (C3.3).
+          ADMIN_UPDATE_MATCHES=$(grep -rn --include="*.py" -F "/v1/admin/update" src/maxim/ tests/ \
+            | grep -vE "^src/maxim/peer/admin_core\.py:" \
+            | grep -vE "^tests/unit/test_admin_core\.py:" \
+            | grep -vE "^src/maxim/runtime/leader_proxy\.py:" \
+            || true)
+          if [ -n "$ADMIN_UPDATE_MATCHES" ]; then
+            echo "ERROR: new caller of /v1/admin/update detected (Plan 4 C3.5)"
+            echo "$ADMIN_UPDATE_MATCHES"
+            echo ""
+            echo "update_on_target in src/maxim/peer/admin_core.py is the"
+            echo "single source of truth for the /v1/admin/update wire shape."
+            echo "Import update_on_target from peer/admin_core instead."
+            exit 1
+          fi
+          ADMIN_RESTART_MATCHES=$(grep -rn --include="*.py" -F "/v1/admin/restart" src/maxim/ tests/ \
+            | grep -vE "^src/maxim/peer/admin_core\.py:" \
+            | grep -vE "^tests/unit/test_admin_core\.py:" \
+            | grep -vE "^src/maxim/runtime/leader_proxy\.py:" \
+            || true)
+          if [ -n "$ADMIN_RESTART_MATCHES" ]; then
+            echo "ERROR: new caller of /v1/admin/restart detected (Plan 4 C3.5)"
+            echo "$ADMIN_RESTART_MATCHES"
+            echo ""
+            echo "restart_on_target in src/maxim/peer/admin_core.py is the"
+            echo "single source of truth for the /v1/admin/restart wire shape."
+            echo "Import restart_on_target from peer/admin_core instead."
+            exit 1
+          fi
+          ADMIN_LLM_MATCHES=$(grep -rn --include="*.py" -F "/v1/admin/llm-swap" src/maxim/ tests/ \
+            | grep -vE "^src/maxim/peer/admin_core\.py:" \
+            | grep -vE "^tests/unit/test_admin_core\.py:" \
+            | grep -vE "^src/maxim/runtime/leader_proxy\.py:" \
+            | grep -vE "^src/maxim/runtime/lane_backends\.py:.*Called by LeaderProxy" \
+            || true)
+          if [ -n "$ADMIN_LLM_MATCHES" ]; then
+            echo "ERROR: new caller of /v1/admin/llm-swap detected (Plan 4 C3.6)"
+            echo "$ADMIN_LLM_MATCHES"
+            echo ""
+            echo "llm_swap_on_target in src/maxim/peer/admin_core.py is the"
+            echo "single source of truth for the /v1/admin/llm-swap wire shape."
+            echo "Import llm_swap_on_target from peer/admin_core instead."
+            exit 1
+          fi
           echo "CI grep invariants clean"
@@ -121,14 +121,14 @@ Earlier archives (2026-04-11/12, S1–S4 shipped 2026-04-12):
 Three tracks run in parallel:
 - **Track A — Substrate:** the bio-inspired research claim. ~~F0 → P0 → P1 → P2 → P3a → P3b → P3.5 → P4~~ ALL SHIPPED → P5 → P6 → P8.
 - **Track B — Prompt layer:** ~~B1~~ SHIPPED → B3 → B4 → B5.
-- **Track C — Infrastructure:** ~~LLM path Plans 1–3.5~~ SHIPPED → Reactive peer mesh (C3.5/C3.6/C4.6 remaining).
+- **Track C — Infrastructure:** ~~LLM path Plans 1–3.5~~ SHIPPED → Reactive peer mesh (~~C3.5/C3.6~~ SHIPPED, C4.6 remaining).
 - **Track D — Behavioral convergence (NEW):** ~~Tier 1 + Tier 2 + Tier 3~~ ALL PASS (41/41 hypotheses) → Scale validation (20+ seeds).
 
 | Version | What ships | What it proves | Status |
 |---|---|---|---|
 | ~~**0.2.x**~~ | Foundations, cleanup, peer flexibility | Friction removed, infrastructure stable | ✅ SHIPPED |
 | **0.3.0** | SEM learning loop, valence annotation, cerebellum activation, concept decomposition, behavioral convergence (Tier 1+2+3), reactive mesh (C4+C4.5) | **Cross-session learning without fine-tuning.** Agent learns from own actions, persists, behaves differently. 41/41 experiments. | ✅ **CURRENT** |
-| **0.4** | Tier 3 at scale (20+ seeds), episode boundary enrichment, P5 stress persistence, peer mesh completion (C3.5/C3.6/C4.6) | Learning is robust under variance + load. Substrate persists at 10k+ nodes. Mesh fully operational. | **NEXT** |
+| **0.4** | Tier 3 at scale (20+ seeds), episode boundary enrichment, P5 stress persistence, peer mesh completion (~~C3.5/C3.6~~ SHIPPED, C4.6) | Learning is robust under variance + load. Substrate persists at 10k+ nodes. Mesh fully operational. | **NEXT** |
 | **0.5** | P6 (extinction vs LRU), P8 (sleep replay), B3 (acting coach), B4 (replanning) | Agent forgets appropriately, consolidates offline, has coherent voice, recovers from failures. | Planned |
 | **1.0** | All phases passing, B4 gating, behavioral convergence at scale with statistical rigor | Cross-session learning at realistic scale, coherent voice, ongoing research program | Target |
 
@@ -139,7 +139,7 @@ Three tracks run in parallel:
 | **D — Tier 3 at scale** | Run organic learning experiment with 20+ seeds, report mean ± std | ~1 session | 0.3 proves the mechanism with 1 run; 0.4 proves it's not a fluke |
 | **A — Episode boundaries** | Tool execution boundary + semantic shift detection (Rules 1-2) | ~200 LOC | Pre-P5 polish, observe_episode_event is now wired |
 | **A — P5 stress persistence** | 10k+ node persistence stress test | ~500 LOC | Validates substrate robustness under realistic load |
-| **C — Peer mesh completion** | C3.5 (`--node update/restart/llm`), C3.6, C4.6 (auto-undrain) | In progress | Complete the reactive mesh story |
+| **C — Peer mesh completion** | ~~C3.5 (`--node update/restart`)~~ SHIPPED, ~~C3.6 (`--node llm`)~~ SHIPPED, C4.6 (auto-undrain) | C4.6 remaining | Complete the reactive mesh story |
 
 ### What 0.3 proved
 
 
@@ -63,10 +63,10 @@ These complete the manage-the-mesh-by-hand surface. Each is a small ship.
 
 - **C3.3 ✅ SHIPPED (PR #128, 2026-04-15):** `maxim peer --node <name> install <extras>` — mesh-aware install composing drain → install → resume around the shared `install_on_target` core in [install_core.py](../../src/maxim/peer/install_core.py). Cross-confirmed pre-merge review found + folded 17 items including the probe-cache URL mismatch (CC1) and drain TOCTOU (CC2). New `drain_node_if_absent` atomic primitive closes the TOCTOU window. Exit code 3 introduced for post-install-resume-failure distinguishability.
 - **C3.4 ✅ SHIPPED (PR #142, 2026-04-17):** `GET /v1/debug/vram` admin endpoint. Returns live nvidia-smi ratio + projected model footprint from `project_vram_usage()` as JSON. 503 when nvidia-smi unavailable. Auth via bearer or localhost. Also lifted `_current_llama_server_n_ctx` to `leader_proxy.py` as canonical probe location, and fixed pre-existing `_is_debug_path`/`_route_debug` desync (deps + install-status bypassed auth gate). 2-lens pre-merge review, 11 new tests.
-- **C3.5:** `maxim peer --node <name> update` and `--node restart` — mesh-aware versions of the existing positional-URL verbs, composing drain/op/resume. Same shape as C3.3; will reuse the `install_core.py` pattern (lift a shared `<op>_on_target` core, extend the CI grep allow-list). Probably one PR for both.
-- **C3.6:** `maxim peer --node <name> llm <model>` — per-node model swap. Today `maxim peer llm <model>` operates on the connected leader only.
+- **C3.5 ✅ SHIPPED (2026-04-17):** `maxim peer --node <name> update [--dry-run] [--force] [--branch <b>]` and `--node restart` — mesh-aware versions of the existing positional-URL verbs, composing drain → op → resume. HTTP wire-level logic extracted from `peer/cli.py` into shared `admin_core.py` (mirrors `install_core.py` pattern). CI grep allow-lists enforce single source of truth for `/v1/admin/update`, `/v1/admin/restart`. 2-lens pre-merge review found 1 cross-confirmed BLOCKING (dry-run bypassed self-guard) + folded 6 total findings. 42 new tests (22 mesh verb + 20 wire-level).
+- **C3.6 ✅ SHIPPED (2026-04-17):** `maxim peer --node <name> llm <model>` — per-node model swap with drain → swap → resume composition. Key enabler for C5 capacity-aware routing (per-node model assignment). CI grep allow-list for `/v1/admin/llm-swap`. Shipped in the same PR as C3.5.
 
-**Estimated effort:** C3.4 + C3.5 + C3.6 ≈ 3 small PRs over 3 sessions. Mostly composition over existing primitives, low review surface each.
+**Stage C3 operator surface COMPLETE.** All planned mesh management verbs shipped.
 
 ### Stage C4: Wire the router to drain state ✅ SHIPPED (PR #148, 2026-04-17)
 
@@ -155,7 +155,7 @@ Standardized small-document (`.md` / `.json`) exchange between mesh nodes. The m
 
 | Version | Includes | Status |
 |---|---|---|
-| **0.4** (in flight) | Plan 4 C3.3 → C3.6 (operator verb surface) + C3.4 VRAM + C4/C4.5 reactive drain | **C3.3-C3.4 SHIPPED**; **C4+C4.5 SHIPPED** (PRs #148, #152); C3.5/C3.6/C4.6 pending |
+| **0.4** (in flight) | Plan 4 C3.3 → C3.6 (operator verb surface) + C3.4 VRAM + C4/C4.5 reactive drain | **C3.3-C3.6 SHIPPED**; **C4+C4.5 SHIPPED** (PRs #148, #152); C4.6 pending |
 | **0.5** | C4.6 auto-undrain + C5 capacity-aware routing + substrate P3a / P4 / B3-B5 | C4.6 design needed |
 | **0.6** | C5 capacity-aware routing + C6 admin API + dashboard + **C9 mesh doc transport** | not started |
 | **0.7+** | C7 security hardening + C8 cross-version compat | not started |
 
@@ -1,7 +1,7 @@
 # Mesh Debug — operator runbook for the Plan 4 Stage C surface
 
 **Audience:** operators running a Maxim mesh (one leader + N peers, or N peers without a single leader).
-**Scope:** the `mesh.yml` declarative config, the `init-mesh` / `add-node` / `remove-node` setup verbs, the `drain` / `resume` / `list-drained` runtime state surface shipped across PRs #112 (C1), #113 (C2), #118 (C3.1), C3.2 follow-up, `--node install` (C3.3, PR #128), and `GET /v1/debug/vram` VRAM observability endpoint (C3.4, PR #142).
+**Scope:** the `mesh.yml` declarative config, the `init-mesh` / `add-node` / `remove-node` setup verbs, the `drain` / `resume` / `list-drained` runtime state surface shipped across PRs #112 (C1), #113 (C2), #118 (C3.1), C3.2 follow-up, `--node install` (C3.3, PR #128), `GET /v1/debug/vram` VRAM observability endpoint (C3.4, PR #142), and `--node update` / `--node restart` (C3.5) + `--node llm` (C3.6).
 
 If you are debugging a network-layer problem (DNS, TCP, TLS, Cloudflare tunnel) start in [peer_leader_connectivity.md](peer_leader_connectivity.md) instead. This doc covers symptoms above the network layer — the cluster is reachable, but routing is doing the wrong thing.
 
@@ -112,6 +112,26 @@ Expected if the install **failed**. Check the exit code (non-zero) and the stder
 
 If the install **succeeded** (exit 0, "Resumed 'X' after install" in stdout) but the node is still drained, check whether you pre-drained it before the install — the was-drained sticky rule means pre-drained nodes never get auto-resumed. Run `maxim peer --node <name> resume` to clear it.
 
+### Walkthrough: update, restart, or swap LLM on a named mesh node (Plan 4 C3.5 + C3.6)
+
+```bash
+maxim peer --node mac-studio update                    # drain → update → resume
+maxim peer --node mac-studio update --dry-run          # preview only, no drain
+maxim peer --node mac-studio update --branch dev       # target a specific branch
+maxim peer --node mac-studio restart                   # drain → restart → resume
+maxim peer --node mac-studio llm qwen2.5-14b           # drain → swap → resume
+```
+
+These three verbs follow **the same drain → op → resume composition pattern** as `--node install` (C3.3). The HTTP wire-level logic lives in shared core functions in [admin_core.py](../../src/maxim/peer/admin_core.py) (single source of truth, CI grep enforced). The mesh CLI wires the drain/resume bookkeeping around the core.
+
+All three verbs share the same invariants:
+- **Self-guard:** refuses operating on `mesh.yml::self` (use the direct `maxim peer update`/`restart`/`llm` verbs instead)
+- **Was-drained sticky semantics:** if the node was already drained, the verb skips drain/resume bookkeeping
+- **Failure leaves drained:** if the operation fails, the node stays drained with a loud hint
+- **Exit code 3:** operation succeeded but resume failed (same as install)
+
+**`--node update --dry-run`** is special: it skips the drain step entirely (preview is read-only). The self-guard still fires even for dry-run.
+
 ---
 
 ## Symptoms → first place to look
 
@@ -52,6 +52,12 @@ maxim peer llm qwen2.5-14b
 
 # Check what model is running:
 maxim peer llm --status
+
+# Mesh-aware versions (use node names from mesh.yml):
+maxim peer --node mac-studio update              # drain → update → resume
+maxim peer --node mac-studio update --dry-run    # preview only, no drain
+maxim peer --node mac-studio restart             # drain → restart → resume
+maxim peer --node mac-studio llm qwen2.5-14b     # drain → swap → resume
 ```
 
 ## Decision tree
 
@@ -136,9 +136,13 @@ Subcommands for managing a remote leader node over a Cloudflare tunnel.
 | `maxim peer remove-node <name>` | Remove a node from `mesh.yml::nodes`. **Side effect:** clears any drain state for `<name>` with a visible "also cleared from drain state" message so removing a drained node doesn't leave an orphan. Refuses if `<name>` is `mesh.yml::self` (you can't delete the running daemon's own identity — the error message documents the workaround: edit `mesh.yml::self` by hand, restart, then re-remove). Refuses if `mesh.yml` doesn't exist or if removing would leave 0 nodes (parser requires ≥1). (Plan 4 Stage C3.2) |
 | `maxim peer --node <name> install <extras_or_packages>` | Mesh-aware install on a named node. Composes **drain → install → resume** around the shared `install_on_target` core in [install_core.py](../../src/maxim/peer/install_core.py). Resolves the target URL + cluster key from `mesh.yml::nodes`, **not** `peer.yml` — note that if the two files have diverged (e.g. after an unrelated cluster-key rotation), this verb sends `mesh.yml::cluster_key` while the positional-URL `maxim peer install` verb sends `peer.yml::api_key`. If you see 401s from one but not the other, the two secrets are out of sync. Accepts comma-separated `KNOWN_EXTRAS` (routed as `pymaxim[<extra>]`) plus raw pip package names in the same token list. **Refuses self-install** (points at local `pip install pymaxim[<extras>]` — no `--force` flag). **Refuses a positional URL** (points at `maxim peer install <extras> <url>` as the no-mesh.yml fallback; URL is redacted to scheme-only in the error message to avoid leaking secrets typed into argv). **Was-drained sticky:** if the operator drained the node BEFORE running install, the verb skips both the drain step AND the auto-resume — the prior drain stays intact after a successful install, so `install` never mutates your drain intent. The was-drained check is atomic with the drain mutation via `drain_node_if_absent` under filelock, closing the TOCTOU window under concurrent admin. **Leaves drained on failure:** if drain succeeds but the install itself fails, the node is left drained with a loud "STILL DRAINED → run `maxim peer --node <name> resume`" message. No auto-resume-on-failure. Exit codes: `0` ok, `1` install failed, `2` refused pre-install (self, unknown, bad tokens, drain failed), **`3` install succeeded but post-install auto-resume failed** (distinguishable from `1` so operators tailing exit codes can tell "upgraded but stuck in drain" from "failed and stuck in drain"). Any mid-install exception (`KeyboardInterrupt` included) still prints the still-drained hint before re-raising. (Plan 4 Stage C3.3) |
 
+| `maxim peer --node <name> update [--dry-run] [--force] [--branch <b>]` | Mesh-aware update on a named node. Composes **drain → update → resume** around the shared `update_on_target` core in [admin_core.py](../../src/maxim/peer/admin_core.py). `--dry-run` previews pending commits without draining (read-only). `--force` stashes dirty tree. Self-guard refuses updating yourself (use `maxim peer update` directly). Same was-drained sticky semantics and exit-code contract as `--node install` (0 ok, 1 failed, 2 refused, 3 resume failed). (Plan 4 Stage C3.5) |
+| `maxim peer --node <name> restart` | Mesh-aware restart on a named node. Composes **drain → restart → resume** around the shared `restart_on_target` core. Two-phase recovery poll waits for the proxy to respond (~90s), then waits for the LLM model to load (~150s for large models). Self-guard refuses restarting yourself (use `maxim peer restart` directly). Same exit-code contract as `--node install`. (Plan 4 Stage C3.5) |
+| `maxim peer --node <name> llm <model>` | Mesh-aware LLM swap on a named node. Composes **drain → swap → resume** around the shared `llm_swap_on_target` core. Key enabler for C5 capacity-aware routing — per-node model assignment means the router can know which node runs which model. Self-guard refuses swapping yourself (use `maxim peer llm` directly). Same exit-code contract as `--node install`. (Plan 4 Stage C3.6) |
+
 | `GET /v1/debug/vram` | (admin endpoint, not a CLI verb) Returns the leader's live VRAM state as JSON: nvidia-smi ratio, utilization, temperature, projected model footprint from `project_vram_usage()`, spillover/warning flags, and recommended n_ctx. Auth via bearer (cluster key) or localhost. Returns 503 if nvidia-smi is unavailable (not a GPU node). Prerequisite for peer-mode doctor VRAM visibility and C5 capacity-aware routing. (Plan 4 Stage C3.4) |
 
-**Future (post-Stage C3.4):** `--node update` / `--node restart` (C3.5), `--node llm` (C3.6), `/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation, router ↔ drain-state coupling (C4 — the actual reactivity gate). Full arc tracked in [docs/plans/reactive_peer_mesh_roadmap.md](../plans/reactive_peer_mesh_roadmap.md).
+**Future (post-Stage C3.6):** `/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation, C4.6 auto-undrain via periodic health probe. Full arc tracked in [docs/plans/reactive_peer_mesh_roadmap.md](../plans/reactive_peer_mesh_roadmap.md).
 
 ### Drain state layer