You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Collapsed verbose per-stage Plan 4 entries into compact shipped summaries.
Added 5 root plan files that were missing from the index (reactive mesh
roadmap, cross-platform file lock, mesh doc transport, pain bus bridge
unification, node security simplification).
Biosystem unification entry condensed from 7 lines to 1 (all waves archived).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
-[tool_refinement_plan.md](tool_refinement_plan.md) — living doc for agent tool surface curation
47
41
-[agent_factory_canonicalization.md](agent_factory_canonicalization.md) — **RUNNING DOC, not scheduled** (2026-04-14). The Option D follow-up to `executor_bootstrap_unification.md` — make `AgentFactory.create_agent` the only door for constructing an agent in Maxim. Becomes a downhill rewrite once Wave 3 `build_bio_stack` merges. Subsumes `sem_execution_hook.md` Stage 2b. Trigger conditions documented inline.
-[llm_path_refinement.md](llm_path_refinement.md) — meta-plan for the LLM routing path refactor. Motivated by two 2026-04-12 peer-leader incidents + an audit that revealed `_OpenAIBackend` has a hidden ~52s retry loop. **Ships as the 0.4 stability version.** Plans 1, 2, 3, 3.5 fully shipped and archived; Plan 3.6 R5 (VRAM spillover detection) shipped; Plan 4 Stages A+B (agent_id observability + recovery-time bench) shipped; **Plan 4 Stage C1+C2+C3.1 shipped** (`mesh.yml` schema + `list-nodes` + drain/resume + `init-mesh`); **C3.2+C3.3+C3.4 shipped** (`add-node`/`remove-node`, `--node install`, `/v1/debug/vram` VRAM endpoint); C3.5/C3.6 + C4 (router-drain coupling) remain in scope. Authoritative architecture reference at [../architecture/llm_routing.md](../architecture/llm_routing.md); stress test protocol at [../experiments/protocols/llm_path_stress_test.md](../experiments/protocols/llm_path_stress_test.md).
50
-
-**✅ Plan 1 (Foundation) — SHIPPED, ARCHIVED** → [archive/llm_path_foundation.md](archive/llm_path_foundation.md). R0 deleted ~1,250 LOC dead mesh (commit `e811787`); R1 shipped `maxim/utils/http.py` with endpoint registry + typed `HTTPError` + `RequestContext` contextvars + `X-Maxim-*` header propagation (PRs #88, #90, #91). See [project_llm_path_r1_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_llm_path_r1_shipped.md).
51
-
-**✅ Plan 2 (Typed Errors + Role Detection) — SHIPPED, ARCHIVED** → [archive/llm_path_typed_errors.md](archive/llm_path_typed_errors.md). R2a-d: role detection at CLI boot, typed `BackendError` hierarchy with `.fix_hint`, two-stage probe, SSRF moved to `utils/net.py` (PRs #92, #93). See [project_llm_path_r2_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_llm_path_r2_shipped.md).
52
-
-**✅ Plan 3 (Fast Failover) — SHIPPED, ARCHIVED** → [archive/llm_path_fast_failover.md](archive/llm_path_fast_failover.md). R2.5 `_MaximPeerBackend` purpose-built single-HTTP-call backend + router typed-exception dispatch + `BACKEND_CLASSES`; R2.6 probe consolidation. **The 52s fail-slow is dead.** PR #94, commit `ce5f034`. Programmatic gate: < 5s p99 against mocked-dead-peer fixture. See [project_llm_path_r3_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_llm_path_r3_shipped.md) for the 10 load-bearing invariants.
53
-
-**✅ Plan 3.5 (Cancellation Hygiene) — SHIPPED, ARCHIVED** → [archive/llm_path_cancellation_hygiene.md](archive/llm_path_cancellation_hygiene.md). R1-R6: cooperative cancellation primitives in `maxim/utils/cancellation.py` + "HTTP fires first" timeout contract (HTTP authoritative at 300s, agent layer strict safety net above). PR #96, commit `6a4f505`. See [project_llm_path_cancellation_hygiene_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_llm_path_cancellation_hygiene_shipped.md).
54
-
-[llm_path_peer_failover.md](llm_path_peer_failover.md) — **Plan 3.6: Peer Failover — PARTIAL SHIP (2026-04-14).****R5 VRAM spillover detection ✅ SHIPPED** (PR #99, commit `2884e58`): doctor `check_vram_pressure` + spawn-time `_check_vram_spillover_risk` + shared `project_vram_usage` math + fix for pre-existing `check_llm_model_active` mutable-global bug. Dynamic headroom `max(1.5, 0.55 × weights_gb)` calibrated to the 2026-04-13 incident. R1–R4 (multi-leader `peer.yml`) **remain draft** — on hold until the user's second GPU comes online. See [project_vram_spillover_detection_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_vram_spillover_detection_shipped.md) for the 5 R5 load-bearing invariants.
55
-
-[llm_path_operator_visibility.md](llm_path_operator_visibility.md) — **Plan 4: Operator Visibility — PARTIAL SHIP (2026-04-14).** Split into three sequential stages:
56
-
-**✅ Stage A — agent_id observability fix** (PR in review on `feat/llm-path-operator-visibility`). Three complementary changes close the Phase D observability gap: router capability-flag kwarg forwarding, `set_context` boundary binding in `LLMWorker._call_llm_with_timeout`, and contextvar fallback in `_normalize_request_context`. 11 new regression tests.
57
-
-**✅ Stage B — recovery-time bench harness** (same PR). New `maxim bench recovery-time` CLI subcommand at `src/maxim/bench/` (NOT `benchmark/` — name collision with `maxim.api.benchmark` public verb). Uses `_MaximPeerBackend` directly to measure peer recovery without sim-cadence workload artifacts. 21 new tests. **Phase D2 hardware validation:** 58.68s recovery window on real RTX 5080 (matches 53s leader self-report + ~5s proxy gap), 750/750 `agent_id` coverage, 199/199 typed `BackendDown` failures, fast-fail p99=614ms. See [llm_path_stress_plan4_20260414.md](../experiments/results/llm_path_stress_plan4_20260414.md) and [bench_recovery_time_rerun.md](../experiments/protocols/bench_recovery_time_rerun.md). See [project_llm_path_operator_visibility_ab_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_llm_path_operator_visibility_ab_shipped.md) for the 8 load-bearing invariants.
58
-
-**✅ Stage C1 — `mesh.yml` schema + `list-nodes` + `--node status|health`** (PR #108, merged 2026-04-14). Hand-rolled `mesh.yml` parser (FROZEN dialect — no PyYAML), `peer/probe_classify.py` shared classifier (single source of truth for probe outcome → CheckResult mapping across mesh_cli + doctor), peer.yml→mesh fallback for zero-breaking-change. 2 review rounds caught 31 findings incl. silent `cluster_key``#` truncation. See [project_plan4_c1_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_plan4_c1_shipped.md).
59
-
-**✅ Stage C2 — drain/resume with runtime state layer** (PR #113, merged 2026-04-14). `~/.maxim/util/drained_nodes.{role}.txt` with `filelock.FileLock` cross-process serialization. Pivoted from Option A1 (config-only drain + TOML migration) to Option B (runtime state) after pre-design review caught 3 criticals. Pre-merge review caught 15 findings incl. 1 triple-confirmed orphan retry_id bug. Renamed in-house `maxim/utils/filelock.py` → `process_lock.py` to avoid name collision with 3rd-party. New `atomic_write_secret` wrapper. See [project_plan4_c2_shipped.md](../../.claude/projects/-Users-dennyschaedig-Scripts-Maxim/memory/project_plan4_c2_shipped.md).
60
-
-**✅ Stage C3.1 — `init-mesh` verb + `mesh.yml` writer infrastructure** (PR #118, merged 2026-04-14). `MeshConfig.to_yaml()` round-trip serializer + `write_mesh_config()` disk-I/O wrapper using `atomic_write_secret`. Strict CI grep allow-list enforces `write_mesh_config` callers (only `mesh_setup.py` + tests). Pre-merge review caught 14 findings incl. 6 cross-confirmed. See `project_plan4_c3.1_shipped.md` (will be added to memory).
61
-
-**🚧 Stage C3.2 — `add-node` + `remove-node` verbs** (PR pending, branch `feat/plan4-c3.2-add-remove-node`, fold complete). Closes the gap C3.1 left open: operators can now grow/shrink `mesh.yml::nodes` from the CLI without hand-editing. Renamed `init_mesh.py` → `mesh_setup.py` to group the 3 setup verbs. `MeshConfig.__post_init__` now validates `self_name in nodes` (hoisted from parser per A1 cross-confirmed fold). Pre-merge review caught 16 findings incl. 4 cross-confirmed. See `project_plan4_c3.2_shipped.md` (will be added to memory).
62
-
-**Stage C3 remaining (DEFERRED):**`--node install` + VRAM precheck, `--node refresh`, `/v1/mesh/*` admin API, per-agent rate limiting, request-trace ring buffer, cluster key rotation. The full scope is still in [llm_path_operator_visibility.md](llm_path_operator_visibility.md) under "Phases".
63
-
- Deferred shell plans (revive on stress-test-defined triggers):
64
-
-[deferred/llm_path_multi_peer_dispatch.md](deferred/llm_path_multi_peer_dispatch.md) — multi-peer reactive overflow with rendezvous-hash distribution. **Partially triggered (2026-04-13)** by the user's RTX 3070 hardware; awaiting Plan 3.6 R1-R4 + Plan 4 Stage C ship.
**Long-term mesh roadmap** (current state → true reactive mesh): see the "Long-term roadmap" section in [llm_path_refinement.md](llm_path_refinement.md). Five concrete steps from leader/peer to peer-to-peer mesh with leader election.
-[reactive_peer_mesh_roadmap.md](reactive_peer_mesh_roadmap.md) — living roadmap for the full reactive peer mesh arc (C3→C9). C3-C4.6 COMPLETE. C5+ remain.
44
+
-[cross_platform_file_lock.md](cross_platform_file_lock.md) — shell plan to unify `utils/process_lock` and `filelock.FileLock`. Blocks nothing.
45
+
-[mesh_doc_transport.md](mesh_doc_transport.md) — shell plan for mesh-to-mesh structured doc exchange (C9). Not started.
46
+
-[pain_bus_bridge_subscriber_unification.md](pain_bus_bridge_subscriber_unification.md) — shell plan for bridge×subscriber attribution-asymmetry fix. Not started.
47
+
-[llm_path_refinement.md](llm_path_refinement.md) — meta-plan for the LLM routing path refactor. Plans 1-3.5 archived; Plan 3.6 R5 shipped; **Plan 4 Stages A+B + C1-C3.6 + C4+C4.5+C4.6 ALL SHIPPED.** Reactive mesh self-healing loop complete. Only stress phases B/C/E remain in scope. Architecture ref: [../architecture/llm_routing.md](../architecture/llm_routing.md).
48
+
-[llm_path_peer_failover.md](llm_path_peer_failover.md) — Plan 3.6 R5 (VRAM spillover) ✅ SHIPPED. R1-R4 (multi-leader) remain draft, on hold until second GPU.
49
+
-[llm_path_operator_visibility.md](llm_path_operator_visibility.md) — Plan 4. **Core stages ALL SHIPPED** (A, B, C1-C3.6, C4, C4.5, C4.6). Remaining deferred scope (admin API, rate limiting, key rotation) tracked in [reactive_peer_mesh_roadmap.md](reactive_peer_mesh_roadmap.md) as C6/C7.
0 commit comments