You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**0.3.0**| SEM learning loop, valence annotation, cerebellum activation, concept decomposition, behavioral convergence (Tier 1+2+3), reactive mesh (C4+C4.5) |**Cross-session learning without fine-tuning.** Agent learns from own actions, persists, behaves differently. 41/41 experiments. | ✅ **CURRENT**|
131
-
|**0.4**| Tier 3 at scale (20+ seeds), episode boundary enrichment, P5 stress persistence, peer mesh completion (C3.5/C3.6/C4.6) | Learning is robust under variance + load. Substrate persists at 10k+ nodes. Mesh fully operational. |**NEXT**|
131
+
|**0.4**| Tier 3 at scale (20+ seeds), episode boundary enrichment, P5 stress persistence, peer mesh completion (~~C3.5/C3.6~~ SHIPPED, C4.6) | Learning is robust under variance + load. Substrate persists at 10k+ nodes. Mesh fully operational. |**NEXT**|
132
132
|**0.5**| P6 (extinction vs LRU), P8 (sleep replay), B3 (acting coach), B4 (replanning) | Agent forgets appropriately, consolidates offline, has coherent voice, recovers from failures. | Planned |
133
133
|**1.0**| All phases passing, B4 gating, behavioral convergence at scale with statistical rigor | Cross-session learning at realistic scale, coherent voice, ongoing research program | Target |
134
134
@@ -139,7 +139,7 @@ Three tracks run in parallel:
139
139
|**D — Tier 3 at scale**| Run organic learning experiment with 20+ seeds, report mean ± std |~1 session | 0.3 proves the mechanism with 1 run; 0.4 proves it's not a fluke |
140
140
|**A — Episode boundaries**| Tool execution boundary + semantic shift detection (Rules 1-2) |~200 LOC | Pre-P5 polish, observe_episode_event is now wired |
141
141
|**A — P5 stress persistence**| 10k+ node persistence stress test |~500 LOC | Validates substrate robustness under realistic load |
142
-
|**C — Peer mesh completion**| C3.5 (`--node update/restart/llm`), C3.6, C4.6 (auto-undrain) |In progress| Complete the reactive mesh story |
Copy file name to clipboardExpand all lines: docs/plans/reactive_peer_mesh_roadmap.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,10 +63,10 @@ These complete the manage-the-mesh-by-hand surface. Each is a small ship.
63
63
64
64
-**C3.3 ✅ SHIPPED (PR #128, 2026-04-15):**`maxim peer --node <name> install <extras>` — mesh-aware install composing drain → install → resume around the shared `install_on_target` core in [install_core.py](../../src/maxim/peer/install_core.py). Cross-confirmed pre-merge review found + folded 17 items including the probe-cache URL mismatch (CC1) and drain TOCTOU (CC2). New `drain_node_if_absent` atomic primitive closes the TOCTOU window. Exit code 3 introduced for post-install-resume-failure distinguishability.
65
65
-**C3.4 ✅ SHIPPED (PR #142, 2026-04-17):**`GET /v1/debug/vram` admin endpoint. Returns live nvidia-smi ratio + projected model footprint from `project_vram_usage()` as JSON. 503 when nvidia-smi unavailable. Auth via bearer or localhost. Also lifted `_current_llama_server_n_ctx` to `leader_proxy.py` as canonical probe location, and fixed pre-existing `_is_debug_path`/`_route_debug` desync (deps + install-status bypassed auth gate). 2-lens pre-merge review, 11 new tests.
66
-
-**C3.5:**`maxim peer --node <name> update` and `--node restart` — mesh-aware versions of the existing positional-URL verbs, composing drain/op/resume. Same shape as C3.3; will reuse the `install_core.py` pattern (lift a shared `<op>_on_target` core, extend the CI grep allow-list). Probably one PR for both.
67
-
-**C3.6:**`maxim peer --node <name> llm <model>` — per-node model swap. Today `maxim peer llm <model>` operates on the connected leader only.
66
+
-**C3.5 ✅ SHIPPED (2026-04-17):**`maxim peer --node <name> update [--dry-run] [--force] [--branch <b>]` and `--node restart` — mesh-aware versions of the existing positional-URL verbs, composing drain → op → resume. HTTP wire-level logic extracted from `peer/cli.py` into shared `admin_core.py` (mirrors `install_core.py` pattern). CI grep allow-lists enforce single source of truth for `/v1/admin/update`, `/v1/admin/restart`. 2-lens pre-merge review found 1 cross-confirmed BLOCKING (dry-run bypassed self-guard) + folded 6 total findings. 42 new tests (22 mesh verb + 20 wire-level).
67
+
-**C3.6 ✅ SHIPPED (2026-04-17):**`maxim peer --node <name> llm <model>` — per-node model swap with drain → swap → resume composition. Key enabler for C5 capacity-aware routing (per-node model assignment). CI grep allow-list for `/v1/admin/llm-swap`. Shipped in the same PR as C3.5.
68
68
69
-
**Estimated effort:** C3.4 + C3.5 + C3.6 ≈ 3 small PRs over 3 sessions. Mostly composition over existing primitives, low review surface each.
Copy file name to clipboardExpand all lines: docs/troubleshooting/mesh_debug.md
+21-1Lines changed: 21 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Mesh Debug — operator runbook for the Plan 4 Stage C surface
2
2
3
3
**Audience:** operators running a Maxim mesh (one leader + N peers, or N peers without a single leader).
4
-
**Scope:** the `mesh.yml` declarative config, the `init-mesh` / `add-node` / `remove-node` setup verbs, the `drain` / `resume` / `list-drained` runtime state surface shipped across PRs #112 (C1), #113 (C2), #118 (C3.1), C3.2 follow-up, `--node install` (C3.3, PR #128), and `GET /v1/debug/vram` VRAM observability endpoint (C3.4, PR #142).
4
+
**Scope:** the `mesh.yml` declarative config, the `init-mesh` / `add-node` / `remove-node` setup verbs, the `drain` / `resume` / `list-drained` runtime state surface shipped across PRs #112 (C1), #113 (C2), #118 (C3.1), C3.2 follow-up, `--node install` (C3.3, PR #128), `GET /v1/debug/vram` VRAM observability endpoint (C3.4, PR #142), and `--node update` / `--node restart` (C3.5) + `--node llm` (C3.6).
5
5
6
6
If you are debugging a network-layer problem (DNS, TCP, TLS, Cloudflare tunnel) start in [peer_leader_connectivity.md](peer_leader_connectivity.md) instead. This doc covers symptoms above the network layer — the cluster is reachable, but routing is doing the wrong thing.
7
7
@@ -112,6 +112,26 @@ Expected if the install **failed**. Check the exit code (non-zero) and the stder
112
112
113
113
If the install **succeeded** (exit 0, "Resumed 'X' after install" in stdout) but the node is still drained, check whether you pre-drained it before the install — the was-drained sticky rule means pre-drained nodes never get auto-resumed. Run `maxim peer --node <name> resume` to clear it.
114
114
115
+
### Walkthrough: update, restart, or swap LLM on a named mesh node (Plan 4 C3.5 + C3.6)
These three verbs follow **the same drain → op → resume composition pattern** as `--node install` (C3.3). The HTTP wire-level logic lives in shared core functions in [admin_core.py](../../src/maxim/peer/admin_core.py) (single source of truth, CI grep enforced). The mesh CLI wires the drain/resume bookkeeping around the core.
126
+
127
+
All three verbs share the same invariants:
128
+
-**Self-guard:** refuses operating on `mesh.yml::self` (use the direct `maxim peer update`/`restart`/`llm` verbs instead)
129
+
-**Was-drained sticky semantics:** if the node was already drained, the verb skips drain/resume bookkeeping
130
+
-**Failure leaves drained:** if the operation fails, the node stays drained with a loud hint
131
+
-**Exit code 3:** operation succeeded but resume failed (same as install)
132
+
133
+
**`--node update --dry-run`** is special: it skips the drain step entirely (preview is read-only). The self-guard still fires even for dry-run.
Copy file name to clipboardExpand all lines: docs/user/cli-reference.md
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -136,9 +136,13 @@ Subcommands for managing a remote leader node over a Cloudflare tunnel.
136
136
|`maxim peer remove-node <name>`| Remove a node from `mesh.yml::nodes`. **Side effect:** clears any drain state for `<name>` with a visible "also cleared from drain state" message so removing a drained node doesn't leave an orphan. Refuses if `<name>` is `mesh.yml::self` (you can't delete the running daemon's own identity — the error message documents the workaround: edit `mesh.yml::self` by hand, restart, then re-remove). Refuses if `mesh.yml` doesn't exist or if removing would leave 0 nodes (parser requires ≥1). (Plan 4 Stage C3.2) |
137
137
| `maxim peer --node <name> install <extras_or_packages>` | Mesh-aware install on a named node. Composes **drain → install → resume** around the shared `install_on_target` core in [install_core.py](../../src/maxim/peer/install_core.py). Resolves the target URL + cluster key from `mesh.yml::nodes`, **not** `peer.yml` — note that if the two files have diverged (e.g. after an unrelated cluster-key rotation), this verb sends `mesh.yml::cluster_key` while the positional-URL `maxim peer install` verb sends `peer.yml::api_key`. If you see 401s from one but not the other, the two secrets are out of sync. Accepts comma-separated `KNOWN_EXTRAS` (routed as `pymaxim[<extra>]`) plus raw pip package names in the same token list. **Refuses self-install** (points at local `pip install pymaxim[<extras>]` — no `--force` flag). **Refuses a positional URL** (points at `maxim peer install <extras> <url>` as the no-mesh.yml fallback; URL is redacted to scheme-only in the error message to avoid leaking secrets typed into argv). **Was-drained sticky:** if the operator drained the node BEFORE running install, the verb skips both the drain step AND the auto-resume — the prior drain stays intact after a successful install, so `install` never mutates your drain intent. The was-drained check is atomic with the drain mutation via `drain_node_if_absent` under filelock, closing the TOCTOU window under concurrent admin. **Leaves drained on failure:** if drain succeeds but the install itself fails, the node is left drained with a loud "STILL DRAINED → run `maxim peer --node <name> resume`" message. No auto-resume-on-failure. Exit codes: `0` ok, `1` install failed, `2` refused pre-install (self, unknown, bad tokens, drain failed), **`3` install succeeded but post-install auto-resume failed** (distinguishable from `1` so operators tailing exit codes can tell "upgraded but stuck in drain" from "failed and stuck in drain"). Any mid-install exception (`KeyboardInterrupt` included) still prints the still-drained hint before re-raising. (Plan 4 Stage C3.3) |
138
138
139
+
|`maxim peer --node <name> update [--dry-run] [--force] [--branch <b>]`| Mesh-aware update on a named node. Composes **drain → update → resume** around the shared `update_on_target` core in [admin_core.py](../../src/maxim/peer/admin_core.py). `--dry-run` previews pending commits without draining (read-only). `--force` stashes dirty tree. Self-guard refuses updating yourself (use `maxim peer update` directly). Same was-drained sticky semantics and exit-code contract as `--node install` (0 ok, 1 failed, 2 refused, 3 resume failed). (Plan 4 Stage C3.5) |
140
+
|`maxim peer --node <name> restart`| Mesh-aware restart on a named node. Composes **drain → restart → resume** around the shared `restart_on_target` core. Two-phase recovery poll waits for the proxy to respond (~90s), then waits for the LLM model to load (~150s for large models). Self-guard refuses restarting yourself (use `maxim peer restart` directly). Same exit-code contract as `--node install`. (Plan 4 Stage C3.5) |
141
+
|`maxim peer --node <name> llm <model>`| Mesh-aware LLM swap on a named node. Composes **drain → swap → resume** around the shared `llm_swap_on_target` core. Key enabler for C5 capacity-aware routing — per-node model assignment means the router can know which node runs which model. Self-guard refuses swapping yourself (use `maxim peer llm` directly). Same exit-code contract as `--node install`. (Plan 4 Stage C3.6) |
142
+
139
143
|`GET /v1/debug/vram`| (admin endpoint, not a CLI verb) Returns the leader's live VRAM state as JSON: nvidia-smi ratio, utilization, temperature, projected model footprint from `project_vram_usage()`, spillover/warning flags, and recommended n_ctx. Auth via bearer (cluster key) or localhost. Returns 503 if nvidia-smi is unavailable (not a GPU node). Prerequisite for peer-mode doctor VRAM visibility and C5 capacity-aware routing. (Plan 4 Stage C3.4) |
140
144
141
-
**Future (post-Stage C3.4):**`--node update` / `--node restart` (C3.5), `--node llm` (C3.6), `/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation, router ↔ drain-state coupling (C4 — the actual reactivity gate). Full arc tracked in [docs/plans/reactive_peer_mesh_roadmap.md](../plans/reactive_peer_mesh_roadmap.md).
145
+
**Future (post-Stage C3.6):**`/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation, C4.6 auto-undrain via periodic health probe. Full arc tracked in [docs/plans/reactive_peer_mesh_roadmap.md](../plans/reactive_peer_mesh_roadmap.md).
0 commit comments