Merge pull request #238 from dennys246/fix/roy-g3-preflight

dennys246 · web-flow · commit ff3c33aaa498 · 2026-05-11T17:57:46.000-06:00
fix(roy): G3 — peer.yml fallback in _preflight_llm
diff --git a/docs/experiments/14_g3_roy_preflight_probe.md b/docs/experiments/14_g3_roy_preflight_probe.md
@@ -2,7 +2,7 @@
 
 **Date:** 2026-05-11
 **Plan:** [persona_convergence_crucible.md](../plans/persona_convergence_crucible.md) (Roy harness § Roy-0 iteration log)
-**Status:** Shipped; unit-verified end-to-end. Empirical "abort in <2s on unreachable URL" check still recommended on first user-driven run.
+**Status:** Shipped + follow-up fold for peer.yml fallback. The G4 Roy-0 re-run on 2026-05-11 surfaced that the original probe was a no-op for the standard peer-with-peer.yml setup (env vars are exported by `apply_peer_config_to_env` only at lane resolution, which happens AFTER `_preflight_llm`); the follow-up reads `~/.config/maxim/peer.yml` directly when env vars are absent.
 **Companion:** [G4 — substrate-primary cluster_id reward wire](15_g4_cluster_reward_wire.md) (the substrate-primary closure that motivated splitting these as paired PRs).
 
 ## What was caught
@@ -31,6 +31,20 @@ Test seam: production path (no fake `sim_runner`) defaults to `_preflight_llm`.
 | Pre-existing flake | `test_context_index.py::test_similar_text_found` (unrelated; documented as load-order-dependent) |
 | Failure window | Probe budget ≤ ~3.3s on standard health_check timeouts (first 0.8s + retry 2.5s) |
 
+## Follow-up fold: peer.yml fallback (post-Roy-0)
+
+The 2026-05-11 Roy-0 re-run (G4 empirical validation) revealed that the original probe was silently skipping under the canonical peer-leader setup. `apply_peer_config_to_env` in [runtime/lane_backends.py:1073](../../src/maxim/runtime/lane_backends.py) reads `~/.config/maxim/peer.yml` and exports `MAXIM_LANE_LARGE_REMOTE_*` env vars — but only at lane resolution, which fires AFTER `_preflight_llm`. Operator who runs `maxim roy run` with no env vars exported but a valid `peer.yml` got `result.preflight = {"skipped": True, "reason": "MAXIM_LANE_LARGE_REMOTE_URL not set"}`, leaving the broken-leader failure mode uncaught.
+
+The fix: `_preflight_llm` now reads `peer.yml` directly when env vars are absent, falling back to that config source before deciding to skip. Resolution order:
+
+1. `MAXIM_LANE_LARGE_REMOTE_URL` / `_API_KEY` / `_MODEL` env vars (explicit per-session override).
+2. `~/.config/maxim/peer.yml` via `read_peer_config()` (the canonical peer-leader setup).
+3. Otherwise: skip the probe (local-LLM / cloud-only setups don't have the 10-min grind failure mode).
+
+`result.preflight.source` field records which path was used (`"env"` or `"peer.yml"`) so operators can verify their config was picked up. Env always wins when both are present.
+
+Regression guards: `TestPreflightHelper::test_peer_yml_fallback_when_env_not_set` (asserts URL/key/model are read from peer.yml when env is absent) + `TestPreflightHelper::test_env_takes_precedence_over_peer_yml` (env wins when both present).
+
 ## What this does NOT prove
 
 - That a real `maxim roy run` against an intentionally unreachable URL actually aborts in <3s. Unit-verified, not empirically verified. **Recommended first check:** `MAXIM_LANE_LARGE_REMOTE_URL=https://wrong.example.com maxim roy run docs/plans/roy/roy_0_smoke.yaml` against a known-bad URL — should print `aborted_at="preflight"` and exit quickly.
diff --git a/docs/experiments/protocols/14_g3_preflight_reproduction.md b/docs/experiments/protocols/14_g3_preflight_reproduction.md
@@ -28,9 +28,11 @@ Each test exercises one branch:
 | `TestRoyPreflight::test_preflight_pass_runs_full_iteration` | Passing probe → all 3 arms run, `aborted_at=None`, `preflight.outcome="ok"` + `latency_ms` recorded. |
 | `TestRoyPreflight::test_preflight_skipped_when_fake_sim_runner_injected` | Fake `sim_runner` + no explicit `preflight_fn` → probe skipped, `result.preflight == {}`, iteration completes. |
 | `TestRoyPreflight::test_preflight_raising_treated_as_failure` | `preflight_fn` raises → treated as preflight failure (no crash), `preflight.outcome="preflight_raised"`. |
-| `TestPreflightHelper::test_skips_when_no_remote_url_configured` | No `MAXIM_LANE_LARGE_REMOTE_URL` → returns `(True, {skipped: True})`. |
-| `TestPreflightHelper::test_probes_when_remote_url_configured` | URL + key + model env vars → helper calls `_MaximPeerBackend.for_url(url, api_key=k, model=m).health_check()` and surfaces the outcome. |
+| `TestPreflightHelper::test_skips_when_no_remote_url_configured` | No env var AND no peer.yml → returns `(True, {skipped: True})`. |
+| `TestPreflightHelper::test_probes_when_remote_url_configured` | URL + key + model env vars → helper calls `_MaximPeerBackend.for_url(url, api_key=k, model=m).health_check()` and surfaces the outcome. `info.source == "env"`. |
 | `TestPreflightHelper::test_auth_rejected_is_soft_pass` | `auth_rejected` outcome → returns `(True, {soft_pass: True})`. |
+| `TestPreflightHelper::test_peer_yml_fallback_when_env_not_set` | No env vars but peer.yml present → reads URL/key/model from peer.yml; `info.source == "peer.yml"`. (Roy-0 re-run fold.) |
+| `TestPreflightHelper::test_env_takes_precedence_over_peer_yml` | Both env and peer.yml present → env wins; `info.source == "env"`. |
 | `TestPreflightHelper::test_health_check_exception_treated_as_failure` | `health_check` raises → returns `(False, {outcome: "probe_error"})`. |
 
 ## B. Live reproduction (unreachable URL, ~3s)
diff --git a/src/maxim/simulation/roy_runner.py b/src/maxim/simulation/roy_runner.py
@@ -304,14 +304,16 @@ def _preflight_llm() -> tuple[bool, dict[str, Any]]:
     call — the user pays for ~10 min of wall clock to learn the LLM was
     never reachable. This probe makes that failure surface BEFORE priming.
 
-    Resolution:
-      * If ``MAXIM_LANE_LARGE_REMOTE_URL`` is set (peer mode hitting a
-        leader, the canonical Roy setup), probe via the canonical
-        :meth:`_MaximPeerBackend.for_url(...).health_check` entry point.
-      * If the env var is unset (local-LLM leader or cloud-only), skip
-        the probe and return ``ok``: there's no peer URL to probe, and
-        the local llama.cpp / cloud failure modes surface fast enough at
-        first dispatch (no 10-min grind).
+    Resolution (in order):
+      * ``MAXIM_LANE_LARGE_REMOTE_URL`` / ``_API_KEY`` / ``_MODEL`` env
+        vars (the explicit per-session override path).
+      * ``peer.yml`` at ``~/.config/maxim/peer.yml`` (the canonical
+        peer-leader setup; runtime applies it via
+        :func:`apply_peer_config_to_env` at lane resolution, which
+        happens AFTER ``_preflight_llm`` — so we read the file ourselves
+        when the env vars are absent).
+      * Otherwise: skip the probe and return ``ok`` — local llama.cpp /
+        cloud-only setups don't have the 10-min grind failure mode.
 
     The probe is one HTTP call (the health_check method handles its own
     two-stage liveness/readiness budget — do NOT add a retry loop here;
@@ -323,15 +325,39 @@ def _preflight_llm() -> tuple[bool, dict[str, Any]]:
     import os
 
     url = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_URL") or "").strip()
+    api_key = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_API_KEY") or "").strip() or None
+    model = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_MODEL") or "").strip() or None
+    source = "env"
+
+    # Roy-0 re-measurement (2026-05-11) surfaced a gap: the standard
+    # peer-with-peer.yml setup doesn't export the env vars at the shell
+    # level — the runtime applies the file via
+    # ``apply_peer_config_to_env`` only when lanes are resolved (in
+    # ``runtime/lane_backends.py``), which happens after this preflight.
+    # Without the fallback here, peer-with-peer.yml users get a no-op
+    # preflight even when the leader is dead. Read peer.yml directly so
+    # the preflight catches that failure mode too.
+    if not url:
+        try:
+            from maxim.peer.config import read_peer_config
+
+            cfg = read_peer_config()
+            if cfg is not None and cfg.url:
+                url = cfg.url.strip()
+                api_key = api_key or (cfg.api_key or None)
+                model = model or (getattr(cfg, "model", None) or None)
+                source = "peer.yml"
+        except Exception as e:  # noqa: BLE001 — peer.yml read must not crash preflight
+            logger.debug("preflight: peer.yml read failed: %s", e, exc_info=True)
+
     if not url:
         return True, {
             "skipped": True,
-            "reason": "MAXIM_LANE_LARGE_REMOTE_URL not set — local/cloud lane, no peer probe applicable",
+            "reason": (
+                "No MAXIM_LANE_LARGE_REMOTE_URL env var and no peer.yml — local/cloud lane, no peer probe applicable"
+            ),
         }
 
-    api_key = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_API_KEY") or "").strip() or None
-    model = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_MODEL") or "").strip() or None
-
     try:
         from maxim.models.language.maxim_peer_backend import _MaximPeerBackend
     except ImportError as e:
@@ -360,6 +386,7 @@ def _preflight_llm() -> tuple[bool, dict[str, Any]]:
         "outcome": outcome,
         "detail": detail,
         "latency_ms": latency_ms,
+        "source": source,  # "env" | "peer.yml" — which config the probe used
     }
 
     # ``ok`` and ``auth_rejected`` both mean the listener is alive.
diff --git a/tests/integration/test_roy_runner.py b/tests/integration/test_roy_runner.py
@@ -814,7 +814,7 @@ class TestPreflightHelper:
     """
 
     def test_skips_when_no_remote_url_configured(self, monkeypatch):
-        """No ``MAXIM_LANE_LARGE_REMOTE_URL`` → skip with reason.
+        """No env var AND no peer.yml → skip with reason.
 
         Local-LLM leader and cloud-only configurations have no peer URL
         to probe; their failure modes surface fast at first dispatch
@@ -824,10 +824,15 @@ def test_skips_when_no_remote_url_configured(self, monkeypatch):
         from maxim.simulation.roy_runner import _preflight_llm
 
         monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_URL", raising=False)
+        # Roy-0 re-measurement folded in the peer.yml fallback — mock the
+        # config reader so this test still exercises the "no config at
+        # all" path even on operator boxes that have a real peer.yml.
+        monkeypatch.setattr("maxim.peer.config.read_peer_config", lambda: None)
+
         ok, info = _preflight_llm()
         assert ok is True
         assert info.get("skipped") is True
-        assert "MAXIM_LANE_LARGE_REMOTE_URL" in info.get("reason", "")
+        assert "peer.yml" in info.get("reason", "")
 
     def test_probes_when_remote_url_configured(self, monkeypatch):
         """When the URL is set, the helper invokes
@@ -910,6 +915,119 @@ def health_check(self):
         assert info["outcome"] == "auth_rejected"
         assert info.get("soft_pass") is True
 
+    def test_peer_yml_fallback_when_env_not_set(self, monkeypatch):
+        """Roy-0 re-measurement (2026-05-11) caught this gap: when the
+        operator's leader is configured via ``~/.config/maxim/peer.yml``
+        (the canonical peer-leader setup) and no env vars are exported,
+        ``apply_peer_config_to_env`` only runs at lane resolution — which
+        happens AFTER ``_preflight_llm``. Without this fallback the
+        preflight is a no-op for that entire setup class.
+        """
+        from maxim.peer.config import PeerConfig
+        from maxim.simulation import roy_runner as _rr
+
+        monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_URL", raising=False)
+        monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_API_KEY", raising=False)
+        monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_MODEL", raising=False)
+
+        # Patch read_peer_config to return a synthetic config.
+        fake_cfg = PeerConfig(
+            url="https://leader.from.peer.yml",
+            api_key="sk-from-peer-yml",
+            is_cloud=False,
+            model="qwen2.5-14b",
+        )
+
+        def fake_read_peer_config():
+            return fake_cfg
+
+        monkeypatch.setattr("maxim.peer.config.read_peer_config", fake_read_peer_config)
+
+        captured: dict[str, Any] = {}
+
+        class FakeProbeResult:
+            outcome = "ok"
+            detail = ""
+            latency_ms = 25.0
+
+        class FakeBackend:
+            @classmethod
+            def for_url(cls, url, *, api_key=None, model=None):
+                captured["url"] = url
+                captured["api_key"] = api_key
+                captured["model"] = model
+                return cls()
+
+            def health_check(self):
+                return FakeProbeResult()
+
+        import sys
+
+        fake_module = type(sys)("maxim.models.language.maxim_peer_backend")
+        fake_module._MaximPeerBackend = FakeBackend  # type: ignore[attr-defined]
+        monkeypatch.setitem(sys.modules, "maxim.models.language.maxim_peer_backend", fake_module)
+
+        ok, info = _rr._preflight_llm()
+        assert ok is True
+        assert info["outcome"] == "ok"
+        assert info["url"] == "https://leader.from.peer.yml"
+        assert info.get("source") == "peer.yml"
+        # The peer.yml-sourced values were threaded into the probe.
+        assert captured["url"] == "https://leader.from.peer.yml"
+        assert captured["api_key"] == "sk-from-peer-yml"
+        assert captured["model"] == "qwen2.5-14b"
+
+    def test_env_takes_precedence_over_peer_yml(self, monkeypatch):
+        """When both env vars AND peer.yml are present, env wins
+        (matches ``apply_peer_config_to_env``'s ``_setdefault_nonempty``
+        semantics elsewhere in the runtime — env is the per-session
+        override path).
+        """
+        from maxim.peer.config import PeerConfig
+        from maxim.simulation import roy_runner as _rr
+
+        monkeypatch.setenv("MAXIM_LANE_LARGE_REMOTE_URL", "https://leader.from.env")
+        monkeypatch.setenv("MAXIM_LANE_LARGE_REMOTE_API_KEY", "sk-from-env")
+
+        def fake_read_peer_config():
+            return PeerConfig(
+                url="https://leader.from.peer.yml",  # should be ignored
+                api_key="sk-from-peer-yml",
+                is_cloud=False,
+                model="qwen2.5-14b",
+            )
+
+        monkeypatch.setattr("maxim.peer.config.read_peer_config", fake_read_peer_config)
+
+        captured: dict[str, Any] = {}
+
+        class FakeProbeResult:
+            outcome = "ok"
+            detail = ""
+            latency_ms = 25.0
+
+        class FakeBackend:
+            @classmethod
+            def for_url(cls, url, *, api_key=None, model=None):
+                captured["url"] = url
+                captured["api_key"] = api_key
+                return cls()
+
+            def health_check(self):
+                return FakeProbeResult()
+
+        import sys
+
+        fake_module = type(sys)("maxim.models.language.maxim_peer_backend")
+        fake_module._MaximPeerBackend = FakeBackend  # type: ignore[attr-defined]
+        monkeypatch.setitem(sys.modules, "maxim.models.language.maxim_peer_backend", fake_module)
+
+        ok, info = _rr._preflight_llm()
+        assert ok is True
+        assert info.get("source") == "env"
+        assert captured["url"] == "https://leader.from.env"
+        assert captured["api_key"] == "sk-from-env"
+
     def test_health_check_exception_treated_as_failure(self, monkeypatch):
         """If health_check raises (network error, import surprise), the
         helper returns ``(False, info_with_probe_error)`` rather than