Skip to content

Commit 783f65f

Browse files
authored
Merge pull request #154 from dennys246/feat/plan4-c3.5-c3.6-mesh-verbs
feat(mesh): C3.5+C3.6 — mesh-aware update, restart, and LLM swap verbs
2 parents b11f427 + f85ca37 commit 783f65f

12 files changed

Lines changed: 1531 additions & 325 deletions

File tree

.github/workflows/test.yml

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,4 +194,51 @@ jobs:
194194
echo "in the same commit so the rule stays coherent."
195195
exit 1
196196
fi
197+
# Plan 4 C3.5 admin_core single-source-of-truth rules:
198+
# update_on_target, restart_on_target, llm_swap_on_target are
199+
# the only CLIENT-side functions that POST to the leader's
200+
# admin endpoints. Same pattern as install_core (C3.3).
201+
ADMIN_UPDATE_MATCHES=$(grep -rn --include="*.py" -F "/v1/admin/update" src/maxim/ tests/ \
202+
| grep -vE "^src/maxim/peer/admin_core\.py:" \
203+
| grep -vE "^tests/unit/test_admin_core\.py:" \
204+
| grep -vE "^src/maxim/runtime/leader_proxy\.py:" \
205+
|| true)
206+
if [ -n "$ADMIN_UPDATE_MATCHES" ]; then
207+
echo "ERROR: new caller of /v1/admin/update detected (Plan 4 C3.5)"
208+
echo "$ADMIN_UPDATE_MATCHES"
209+
echo ""
210+
echo "update_on_target in src/maxim/peer/admin_core.py is the"
211+
echo "single source of truth for the /v1/admin/update wire shape."
212+
echo "Import update_on_target from peer/admin_core instead."
213+
exit 1
214+
fi
215+
ADMIN_RESTART_MATCHES=$(grep -rn --include="*.py" -F "/v1/admin/restart" src/maxim/ tests/ \
216+
| grep -vE "^src/maxim/peer/admin_core\.py:" \
217+
| grep -vE "^tests/unit/test_admin_core\.py:" \
218+
| grep -vE "^src/maxim/runtime/leader_proxy\.py:" \
219+
|| true)
220+
if [ -n "$ADMIN_RESTART_MATCHES" ]; then
221+
echo "ERROR: new caller of /v1/admin/restart detected (Plan 4 C3.5)"
222+
echo "$ADMIN_RESTART_MATCHES"
223+
echo ""
224+
echo "restart_on_target in src/maxim/peer/admin_core.py is the"
225+
echo "single source of truth for the /v1/admin/restart wire shape."
226+
echo "Import restart_on_target from peer/admin_core instead."
227+
exit 1
228+
fi
229+
ADMIN_LLM_MATCHES=$(grep -rn --include="*.py" -F "/v1/admin/llm-swap" src/maxim/ tests/ \
230+
| grep -vE "^src/maxim/peer/admin_core\.py:" \
231+
| grep -vE "^tests/unit/test_admin_core\.py:" \
232+
| grep -vE "^src/maxim/runtime/leader_proxy\.py:" \
233+
| grep -vE "^src/maxim/runtime/lane_backends\.py:.*Called by LeaderProxy" \
234+
|| true)
235+
if [ -n "$ADMIN_LLM_MATCHES" ]; then
236+
echo "ERROR: new caller of /v1/admin/llm-swap detected (Plan 4 C3.6)"
237+
echo "$ADMIN_LLM_MATCHES"
238+
echo ""
239+
echo "llm_swap_on_target in src/maxim/peer/admin_core.py is the"
240+
echo "single source of truth for the /v1/admin/llm-swap wire shape."
241+
echo "Import llm_swap_on_target from peer/admin_core instead."
242+
exit 1
243+
fi
197244
echo "CI grep invariants clean"

docs/plans/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,14 +121,14 @@ Earlier archives (2026-04-11/12, S1–S4 shipped 2026-04-12):
121121
Three tracks run in parallel:
122122
- **Track A — Substrate:** the bio-inspired research claim. ~~F0 → P0 → P1 → P2 → P3a → P3b → P3.5 → P4~~ ALL SHIPPED → P5 → P6 → P8.
123123
- **Track B — Prompt layer:** ~~B1~~ SHIPPED → B3 → B4 → B5.
124-
- **Track C — Infrastructure:** ~~LLM path Plans 1–3.5~~ SHIPPED → Reactive peer mesh (C3.5/C3.6/C4.6 remaining).
124+
- **Track C — Infrastructure:** ~~LLM path Plans 1–3.5~~ SHIPPED → Reactive peer mesh (~~C3.5/C3.6~~ SHIPPED, C4.6 remaining).
125125
- **Track D — Behavioral convergence (NEW):** ~~Tier 1 + Tier 2 + Tier 3~~ ALL PASS (41/41 hypotheses) → Scale validation (20+ seeds).
126126

127127
| Version | What ships | What it proves | Status |
128128
|---|---|---|---|
129129
| ~~**0.2.x**~~ | Foundations, cleanup, peer flexibility | Friction removed, infrastructure stable | ✅ SHIPPED |
130130
| **0.3.0** | SEM learning loop, valence annotation, cerebellum activation, concept decomposition, behavioral convergence (Tier 1+2+3), reactive mesh (C4+C4.5) | **Cross-session learning without fine-tuning.** Agent learns from own actions, persists, behaves differently. 41/41 experiments. |**CURRENT** |
131-
| **0.4** | Tier 3 at scale (20+ seeds), episode boundary enrichment, P5 stress persistence, peer mesh completion (C3.5/C3.6/C4.6) | Learning is robust under variance + load. Substrate persists at 10k+ nodes. Mesh fully operational. | **NEXT** |
131+
| **0.4** | Tier 3 at scale (20+ seeds), episode boundary enrichment, P5 stress persistence, peer mesh completion (~~C3.5/C3.6~~ SHIPPED, C4.6) | Learning is robust under variance + load. Substrate persists at 10k+ nodes. Mesh fully operational. | **NEXT** |
132132
| **0.5** | P6 (extinction vs LRU), P8 (sleep replay), B3 (acting coach), B4 (replanning) | Agent forgets appropriately, consolidates offline, has coherent voice, recovers from failures. | Planned |
133133
| **1.0** | All phases passing, B4 gating, behavioral convergence at scale with statistical rigor | Cross-session learning at realistic scale, coherent voice, ongoing research program | Target |
134134

@@ -139,7 +139,7 @@ Three tracks run in parallel:
139139
| **D — Tier 3 at scale** | Run organic learning experiment with 20+ seeds, report mean ± std | ~1 session | 0.3 proves the mechanism with 1 run; 0.4 proves it's not a fluke |
140140
| **A — Episode boundaries** | Tool execution boundary + semantic shift detection (Rules 1-2) | ~200 LOC | Pre-P5 polish, observe_episode_event is now wired |
141141
| **A — P5 stress persistence** | 10k+ node persistence stress test | ~500 LOC | Validates substrate robustness under realistic load |
142-
| **C — Peer mesh completion** | C3.5 (`--node update/restart/llm`), C3.6, C4.6 (auto-undrain) | In progress | Complete the reactive mesh story |
142+
| **C — Peer mesh completion** | ~~C3.5 (`--node update/restart`)~~ SHIPPED, ~~C3.6 (`--node llm`)~~ SHIPPED, C4.6 (auto-undrain) | C4.6 remaining | Complete the reactive mesh story |
143143

144144
### What 0.3 proved
145145

docs/plans/reactive_peer_mesh_roadmap.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -63,10 +63,10 @@ These complete the manage-the-mesh-by-hand surface. Each is a small ship.
6363

6464
- **C3.3 ✅ SHIPPED (PR #128, 2026-04-15):** `maxim peer --node <name> install <extras>` — mesh-aware install composing drain → install → resume around the shared `install_on_target` core in [install_core.py](../../src/maxim/peer/install_core.py). Cross-confirmed pre-merge review found + folded 17 items including the probe-cache URL mismatch (CC1) and drain TOCTOU (CC2). New `drain_node_if_absent` atomic primitive closes the TOCTOU window. Exit code 3 introduced for post-install-resume-failure distinguishability.
6565
- **C3.4 ✅ SHIPPED (PR #142, 2026-04-17):** `GET /v1/debug/vram` admin endpoint. Returns live nvidia-smi ratio + projected model footprint from `project_vram_usage()` as JSON. 503 when nvidia-smi unavailable. Auth via bearer or localhost. Also lifted `_current_llama_server_n_ctx` to `leader_proxy.py` as canonical probe location, and fixed pre-existing `_is_debug_path`/`_route_debug` desync (deps + install-status bypassed auth gate). 2-lens pre-merge review, 11 new tests.
66-
- **C3.5:** `maxim peer --node <name> update` and `--node restart` — mesh-aware versions of the existing positional-URL verbs, composing drain/op/resume. Same shape as C3.3; will reuse the `install_core.py` pattern (lift a shared `<op>_on_target` core, extend the CI grep allow-list). Probably one PR for both.
67-
- **C3.6:** `maxim peer --node <name> llm <model>` — per-node model swap. Today `maxim peer llm <model>` operates on the connected leader only.
66+
- **C3.5 ✅ SHIPPED (2026-04-17):** `maxim peer --node <name> update [--dry-run] [--force] [--branch <b>]` and `--node restart` — mesh-aware versions of the existing positional-URL verbs, composing drain → op → resume. HTTP wire-level logic extracted from `peer/cli.py` into shared `admin_core.py` (mirrors `install_core.py` pattern). CI grep allow-lists enforce single source of truth for `/v1/admin/update`, `/v1/admin/restart`. 2-lens pre-merge review found 1 cross-confirmed BLOCKING (dry-run bypassed self-guard) + folded 6 total findings. 42 new tests (22 mesh verb + 20 wire-level).
67+
- **C3.6 ✅ SHIPPED (2026-04-17):** `maxim peer --node <name> llm <model>` — per-node model swap with drain → swap → resume composition. Key enabler for C5 capacity-aware routing (per-node model assignment). CI grep allow-list for `/v1/admin/llm-swap`. Shipped in the same PR as C3.5.
6868

69-
**Estimated effort:** C3.4 + C3.5 + C3.6 ≈ 3 small PRs over 3 sessions. Mostly composition over existing primitives, low review surface each.
69+
**Stage C3 operator surface COMPLETE.** All planned mesh management verbs shipped.
7070

7171
### Stage C4: Wire the router to drain state ✅ SHIPPED (PR #148, 2026-04-17)
7272

@@ -155,7 +155,7 @@ Standardized small-document (`.md` / `.json`) exchange between mesh nodes. The m
155155

156156
| Version | Includes | Status |
157157
|---|---|---|
158-
| **0.4** (in flight) | Plan 4 C3.3 → C3.6 (operator verb surface) + C3.4 VRAM + C4/C4.5 reactive drain | **C3.3-C3.4 SHIPPED**; **C4+C4.5 SHIPPED** (PRs #148, #152); C3.5/C3.6/C4.6 pending |
158+
| **0.4** (in flight) | Plan 4 C3.3 → C3.6 (operator verb surface) + C3.4 VRAM + C4/C4.5 reactive drain | **C3.3-C3.6 SHIPPED**; **C4+C4.5 SHIPPED** (PRs #148, #152); C4.6 pending |
159159
| **0.5** | C4.6 auto-undrain + C5 capacity-aware routing + substrate P3a / P4 / B3-B5 | C4.6 design needed |
160160
| **0.6** | C5 capacity-aware routing + C6 admin API + dashboard + **C9 mesh doc transport** | not started |
161161
| **0.7+** | C7 security hardening + C8 cross-version compat | not started |

docs/troubleshooting/mesh_debug.md

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Mesh Debug — operator runbook for the Plan 4 Stage C surface
22

33
**Audience:** operators running a Maxim mesh (one leader + N peers, or N peers without a single leader).
4-
**Scope:** the `mesh.yml` declarative config, the `init-mesh` / `add-node` / `remove-node` setup verbs, the `drain` / `resume` / `list-drained` runtime state surface shipped across PRs #112 (C1), #113 (C2), #118 (C3.1), C3.2 follow-up, `--node install` (C3.3, PR #128), and `GET /v1/debug/vram` VRAM observability endpoint (C3.4, PR #142).
4+
**Scope:** the `mesh.yml` declarative config, the `init-mesh` / `add-node` / `remove-node` setup verbs, the `drain` / `resume` / `list-drained` runtime state surface shipped across PRs #112 (C1), #113 (C2), #118 (C3.1), C3.2 follow-up, `--node install` (C3.3, PR #128), `GET /v1/debug/vram` VRAM observability endpoint (C3.4, PR #142), and `--node update` / `--node restart` (C3.5) + `--node llm` (C3.6).
55

66
If you are debugging a network-layer problem (DNS, TCP, TLS, Cloudflare tunnel) start in [peer_leader_connectivity.md](peer_leader_connectivity.md) instead. This doc covers symptoms above the network layer — the cluster is reachable, but routing is doing the wrong thing.
77

@@ -112,6 +112,26 @@ Expected if the install **failed**. Check the exit code (non-zero) and the stder
112112

113113
If the install **succeeded** (exit 0, "Resumed 'X' after install" in stdout) but the node is still drained, check whether you pre-drained it before the install — the was-drained sticky rule means pre-drained nodes never get auto-resumed. Run `maxim peer --node <name> resume` to clear it.
114114

115+
### Walkthrough: update, restart, or swap LLM on a named mesh node (Plan 4 C3.5 + C3.6)
116+
117+
```bash
118+
maxim peer --node mac-studio update # drain → update → resume
119+
maxim peer --node mac-studio update --dry-run # preview only, no drain
120+
maxim peer --node mac-studio update --branch dev # target a specific branch
121+
maxim peer --node mac-studio restart # drain → restart → resume
122+
maxim peer --node mac-studio llm qwen2.5-14b # drain → swap → resume
123+
```
124+
125+
These three verbs follow **the same drain → op → resume composition pattern** as `--node install` (C3.3). The HTTP wire-level logic lives in shared core functions in [admin_core.py](../../src/maxim/peer/admin_core.py) (single source of truth, CI grep enforced). The mesh CLI wires the drain/resume bookkeeping around the core.
126+
127+
All three verbs share the same invariants:
128+
- **Self-guard:** refuses operating on `mesh.yml::self` (use the direct `maxim peer update`/`restart`/`llm` verbs instead)
129+
- **Was-drained sticky semantics:** if the node was already drained, the verb skips drain/resume bookkeeping
130+
- **Failure leaves drained:** if the operation fails, the node stays drained with a loud hint
131+
- **Exit code 3:** operation succeeded but resume failed (same as install)
132+
133+
**`--node update --dry-run`** is special: it skips the drain step entirely (preview is read-only). The self-guard still fires even for dry-run.
134+
115135
---
116136

117137
## Symptoms → first place to look

docs/troubleshooting/remote_update.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,12 @@ maxim peer llm qwen2.5-14b
5252

5353
# Check what model is running:
5454
maxim peer llm --status
55+
56+
# Mesh-aware versions (use node names from mesh.yml):
57+
maxim peer --node mac-studio update # drain → update → resume
58+
maxim peer --node mac-studio update --dry-run # preview only, no drain
59+
maxim peer --node mac-studio restart # drain → restart → resume
60+
maxim peer --node mac-studio llm qwen2.5-14b # drain → swap → resume
5561
```
5662

5763
## Decision tree

docs/user/cli-reference.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,9 +136,13 @@ Subcommands for managing a remote leader node over a Cloudflare tunnel.
136136
| `maxim peer remove-node <name>` | Remove a node from `mesh.yml::nodes`. **Side effect:** clears any drain state for `<name>` with a visible "also cleared from drain state" message so removing a drained node doesn't leave an orphan. Refuses if `<name>` is `mesh.yml::self` (you can't delete the running daemon's own identity — the error message documents the workaround: edit `mesh.yml::self` by hand, restart, then re-remove). Refuses if `mesh.yml` doesn't exist or if removing would leave 0 nodes (parser requires ≥1). (Plan 4 Stage C3.2) |
137137
| `maxim peer --node <name> install <extras_or_packages>` | Mesh-aware install on a named node. Composes **drain → install → resume** around the shared `install_on_target` core in [install_core.py](../../src/maxim/peer/install_core.py). Resolves the target URL + cluster key from `mesh.yml::nodes`, **not** `peer.yml` — note that if the two files have diverged (e.g. after an unrelated cluster-key rotation), this verb sends `mesh.yml::cluster_key` while the positional-URL `maxim peer install` verb sends `peer.yml::api_key`. If you see 401s from one but not the other, the two secrets are out of sync. Accepts comma-separated `KNOWN_EXTRAS` (routed as `pymaxim[<extra>]`) plus raw pip package names in the same token list. **Refuses self-install** (points at local `pip install pymaxim[<extras>]` — no `--force` flag). **Refuses a positional URL** (points at `maxim peer install <extras> <url>` as the no-mesh.yml fallback; URL is redacted to scheme-only in the error message to avoid leaking secrets typed into argv). **Was-drained sticky:** if the operator drained the node BEFORE running install, the verb skips both the drain step AND the auto-resume — the prior drain stays intact after a successful install, so `install` never mutates your drain intent. The was-drained check is atomic with the drain mutation via `drain_node_if_absent` under filelock, closing the TOCTOU window under concurrent admin. **Leaves drained on failure:** if drain succeeds but the install itself fails, the node is left drained with a loud "STILL DRAINED → run `maxim peer --node <name> resume`" message. No auto-resume-on-failure. Exit codes: `0` ok, `1` install failed, `2` refused pre-install (self, unknown, bad tokens, drain failed), **`3` install succeeded but post-install auto-resume failed** (distinguishable from `1` so operators tailing exit codes can tell "upgraded but stuck in drain" from "failed and stuck in drain"). Any mid-install exception (`KeyboardInterrupt` included) still prints the still-drained hint before re-raising. (Plan 4 Stage C3.3) |
138138

139+
| `maxim peer --node <name> update [--dry-run] [--force] [--branch <b>]` | Mesh-aware update on a named node. Composes **drain → update → resume** around the shared `update_on_target` core in [admin_core.py](../../src/maxim/peer/admin_core.py). `--dry-run` previews pending commits without draining (read-only). `--force` stashes dirty tree. Self-guard refuses updating yourself (use `maxim peer update` directly). Same was-drained sticky semantics and exit-code contract as `--node install` (0 ok, 1 failed, 2 refused, 3 resume failed). (Plan 4 Stage C3.5) |
140+
| `maxim peer --node <name> restart` | Mesh-aware restart on a named node. Composes **drain → restart → resume** around the shared `restart_on_target` core. Two-phase recovery poll waits for the proxy to respond (~90s), then waits for the LLM model to load (~150s for large models). Self-guard refuses restarting yourself (use `maxim peer restart` directly). Same exit-code contract as `--node install`. (Plan 4 Stage C3.5) |
141+
| `maxim peer --node <name> llm <model>` | Mesh-aware LLM swap on a named node. Composes **drain → swap → resume** around the shared `llm_swap_on_target` core. Key enabler for C5 capacity-aware routing — per-node model assignment means the router can know which node runs which model. Self-guard refuses swapping yourself (use `maxim peer llm` directly). Same exit-code contract as `--node install`. (Plan 4 Stage C3.6) |
142+
139143
| `GET /v1/debug/vram` | (admin endpoint, not a CLI verb) Returns the leader's live VRAM state as JSON: nvidia-smi ratio, utilization, temperature, projected model footprint from `project_vram_usage()`, spillover/warning flags, and recommended n_ctx. Auth via bearer (cluster key) or localhost. Returns 503 if nvidia-smi is unavailable (not a GPU node). Prerequisite for peer-mode doctor VRAM visibility and C5 capacity-aware routing. (Plan 4 Stage C3.4) |
140144

141-
**Future (post-Stage C3.4):** `--node update` / `--node restart` (C3.5), `--node llm` (C3.6), `/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation, router ↔ drain-state coupling (C4 — the actual reactivity gate). Full arc tracked in [docs/plans/reactive_peer_mesh_roadmap.md](../plans/reactive_peer_mesh_roadmap.md).
145+
**Future (post-Stage C3.6):** `/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation, C4.6 auto-undrain via periodic health probe. Full arc tracked in [docs/plans/reactive_peer_mesh_roadmap.md](../plans/reactive_peer_mesh_roadmap.md).
142146

143147
### Drain state layer
144148

0 commit comments

Comments
 (0)