Skip to content

Commit ff3c33a

Browse files
authored
Merge pull request #238 from dennys246/fix/roy-g3-preflight
fix(roy): G3 — peer.yml fallback in _preflight_llm
2 parents c93dc60 + 8c69822 commit ff3c33a

4 files changed

Lines changed: 178 additions & 17 deletions

File tree

docs/experiments/14_g3_roy_preflight_probe.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Date:** 2026-05-11
44
**Plan:** [persona_convergence_crucible.md](../plans/persona_convergence_crucible.md) (Roy harness § Roy-0 iteration log)
5-
**Status:** Shipped; unit-verified end-to-end. Empirical "abort in <2s on unreachable URL" check still recommended on first user-driven run.
5+
**Status:** Shipped + follow-up fold for peer.yml fallback. The G4 Roy-0 re-run on 2026-05-11 surfaced that the original probe was a no-op for the standard peer-with-peer.yml setup (env vars are exported by `apply_peer_config_to_env` only at lane resolution, which happens AFTER `_preflight_llm`); the follow-up reads `~/.config/maxim/peer.yml` directly when env vars are absent.
66
**Companion:** [G4 — substrate-primary cluster_id reward wire](15_g4_cluster_reward_wire.md) (the substrate-primary closure that motivated splitting these as paired PRs).
77

88
## What was caught
@@ -31,6 +31,20 @@ Test seam: production path (no fake `sim_runner`) defaults to `_preflight_llm`.
3131
| Pre-existing flake | `test_context_index.py::test_similar_text_found` (unrelated; documented as load-order-dependent) |
3232
| Failure window | Probe budget ≤ ~3.3s on standard health_check timeouts (first 0.8s + retry 2.5s) |
3333

34+
## Follow-up fold: peer.yml fallback (post-Roy-0)
35+
36+
The 2026-05-11 Roy-0 re-run (G4 empirical validation) revealed that the original probe was silently skipping under the canonical peer-leader setup. `apply_peer_config_to_env` in [runtime/lane_backends.py:1073](../../src/maxim/runtime/lane_backends.py) reads `~/.config/maxim/peer.yml` and exports `MAXIM_LANE_LARGE_REMOTE_*` env vars — but only at lane resolution, which fires AFTER `_preflight_llm`. Operator who runs `maxim roy run` with no env vars exported but a valid `peer.yml` got `result.preflight = {"skipped": True, "reason": "MAXIM_LANE_LARGE_REMOTE_URL not set"}`, leaving the broken-leader failure mode uncaught.
37+
38+
The fix: `_preflight_llm` now reads `peer.yml` directly when env vars are absent, falling back to that config source before deciding to skip. Resolution order:
39+
40+
1. `MAXIM_LANE_LARGE_REMOTE_URL` / `_API_KEY` / `_MODEL` env vars (explicit per-session override).
41+
2. `~/.config/maxim/peer.yml` via `read_peer_config()` (the canonical peer-leader setup).
42+
3. Otherwise: skip the probe (local-LLM / cloud-only setups don't have the 10-min grind failure mode).
43+
44+
`result.preflight.source` field records which path was used (`"env"` or `"peer.yml"`) so operators can verify their config was picked up. Env always wins when both are present.
45+
46+
Regression guards: `TestPreflightHelper::test_peer_yml_fallback_when_env_not_set` (asserts URL/key/model are read from peer.yml when env is absent) + `TestPreflightHelper::test_env_takes_precedence_over_peer_yml` (env wins when both present).
47+
3448
## What this does NOT prove
3549

3650
- That a real `maxim roy run` against an intentionally unreachable URL actually aborts in <3s. Unit-verified, not empirically verified. **Recommended first check:** `MAXIM_LANE_LARGE_REMOTE_URL=https://wrong.example.com maxim roy run docs/plans/roy/roy_0_smoke.yaml` against a known-bad URL — should print `aborted_at="preflight"` and exit quickly.

docs/experiments/protocols/14_g3_preflight_reproduction.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,11 @@ Each test exercises one branch:
2828
| `TestRoyPreflight::test_preflight_pass_runs_full_iteration` | Passing probe → all 3 arms run, `aborted_at=None`, `preflight.outcome="ok"` + `latency_ms` recorded. |
2929
| `TestRoyPreflight::test_preflight_skipped_when_fake_sim_runner_injected` | Fake `sim_runner` + no explicit `preflight_fn` → probe skipped, `result.preflight == {}`, iteration completes. |
3030
| `TestRoyPreflight::test_preflight_raising_treated_as_failure` | `preflight_fn` raises → treated as preflight failure (no crash), `preflight.outcome="preflight_raised"`. |
31-
| `TestPreflightHelper::test_skips_when_no_remote_url_configured` | No `MAXIM_LANE_LARGE_REMOTE_URL` → returns `(True, {skipped: True})`. |
32-
| `TestPreflightHelper::test_probes_when_remote_url_configured` | URL + key + model env vars → helper calls `_MaximPeerBackend.for_url(url, api_key=k, model=m).health_check()` and surfaces the outcome. |
31+
| `TestPreflightHelper::test_skips_when_no_remote_url_configured` | No env var AND no peer.yml → returns `(True, {skipped: True})`. |
32+
| `TestPreflightHelper::test_probes_when_remote_url_configured` | URL + key + model env vars → helper calls `_MaximPeerBackend.for_url(url, api_key=k, model=m).health_check()` and surfaces the outcome. `info.source == "env"`. |
3333
| `TestPreflightHelper::test_auth_rejected_is_soft_pass` | `auth_rejected` outcome → returns `(True, {soft_pass: True})`. |
34+
| `TestPreflightHelper::test_peer_yml_fallback_when_env_not_set` | No env vars but peer.yml present → reads URL/key/model from peer.yml; `info.source == "peer.yml"`. (Roy-0 re-run fold.) |
35+
| `TestPreflightHelper::test_env_takes_precedence_over_peer_yml` | Both env and peer.yml present → env wins; `info.source == "env"`. |
3436
| `TestPreflightHelper::test_health_check_exception_treated_as_failure` | `health_check` raises → returns `(False, {outcome: "probe_error"})`. |
3537

3638
## B. Live reproduction (unreachable URL, ~3s)

src/maxim/simulation/roy_runner.py

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -304,14 +304,16 @@ def _preflight_llm() -> tuple[bool, dict[str, Any]]:
304304
call — the user pays for ~10 min of wall clock to learn the LLM was
305305
never reachable. This probe makes that failure surface BEFORE priming.
306306
307-
Resolution:
308-
* If ``MAXIM_LANE_LARGE_REMOTE_URL`` is set (peer mode hitting a
309-
leader, the canonical Roy setup), probe via the canonical
310-
:meth:`_MaximPeerBackend.for_url(...).health_check` entry point.
311-
* If the env var is unset (local-LLM leader or cloud-only), skip
312-
the probe and return ``ok``: there's no peer URL to probe, and
313-
the local llama.cpp / cloud failure modes surface fast enough at
314-
first dispatch (no 10-min grind).
307+
Resolution (in order):
308+
* ``MAXIM_LANE_LARGE_REMOTE_URL`` / ``_API_KEY`` / ``_MODEL`` env
309+
vars (the explicit per-session override path).
310+
* ``peer.yml`` at ``~/.config/maxim/peer.yml`` (the canonical
311+
peer-leader setup; runtime applies it via
312+
:func:`apply_peer_config_to_env` at lane resolution, which
313+
happens AFTER ``_preflight_llm`` — so we read the file ourselves
314+
when the env vars are absent).
315+
* Otherwise: skip the probe and return ``ok`` — local llama.cpp /
316+
cloud-only setups don't have the 10-min grind failure mode.
315317
316318
The probe is one HTTP call (the health_check method handles its own
317319
two-stage liveness/readiness budget — do NOT add a retry loop here;
@@ -323,15 +325,39 @@ def _preflight_llm() -> tuple[bool, dict[str, Any]]:
323325
import os
324326

325327
url = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_URL") or "").strip()
328+
api_key = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_API_KEY") or "").strip() or None
329+
model = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_MODEL") or "").strip() or None
330+
source = "env"
331+
332+
# Roy-0 re-measurement (2026-05-11) surfaced a gap: the standard
333+
# peer-with-peer.yml setup doesn't export the env vars at the shell
334+
# level — the runtime applies the file via
335+
# ``apply_peer_config_to_env`` only when lanes are resolved (in
336+
# ``runtime/lane_backends.py``), which happens after this preflight.
337+
# Without the fallback here, peer-with-peer.yml users get a no-op
338+
# preflight even when the leader is dead. Read peer.yml directly so
339+
# the preflight catches that failure mode too.
340+
if not url:
341+
try:
342+
from maxim.peer.config import read_peer_config
343+
344+
cfg = read_peer_config()
345+
if cfg is not None and cfg.url:
346+
url = cfg.url.strip()
347+
api_key = api_key or (cfg.api_key or None)
348+
model = model or (getattr(cfg, "model", None) or None)
349+
source = "peer.yml"
350+
except Exception as e: # noqa: BLE001 — peer.yml read must not crash preflight
351+
logger.debug("preflight: peer.yml read failed: %s", e, exc_info=True)
352+
326353
if not url:
327354
return True, {
328355
"skipped": True,
329-
"reason": "MAXIM_LANE_LARGE_REMOTE_URL not set — local/cloud lane, no peer probe applicable",
356+
"reason": (
357+
"No MAXIM_LANE_LARGE_REMOTE_URL env var and no peer.yml — local/cloud lane, no peer probe applicable"
358+
),
330359
}
331360

332-
api_key = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_API_KEY") or "").strip() or None
333-
model = (os.environ.get("MAXIM_LANE_LARGE_REMOTE_MODEL") or "").strip() or None
334-
335361
try:
336362
from maxim.models.language.maxim_peer_backend import _MaximPeerBackend
337363
except ImportError as e:
@@ -360,6 +386,7 @@ def _preflight_llm() -> tuple[bool, dict[str, Any]]:
360386
"outcome": outcome,
361387
"detail": detail,
362388
"latency_ms": latency_ms,
389+
"source": source, # "env" | "peer.yml" — which config the probe used
363390
}
364391

365392
# ``ok`` and ``auth_rejected`` both mean the listener is alive.

tests/integration/test_roy_runner.py

Lines changed: 120 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -814,7 +814,7 @@ class TestPreflightHelper:
814814
"""
815815

816816
def test_skips_when_no_remote_url_configured(self, monkeypatch):
817-
"""No ``MAXIM_LANE_LARGE_REMOTE_URL`` → skip with reason.
817+
"""No env var AND no peer.yml → skip with reason.
818818
819819
Local-LLM leader and cloud-only configurations have no peer URL
820820
to probe; their failure modes surface fast at first dispatch
@@ -824,10 +824,15 @@ def test_skips_when_no_remote_url_configured(self, monkeypatch):
824824
from maxim.simulation.roy_runner import _preflight_llm
825825

826826
monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_URL", raising=False)
827+
# Roy-0 re-measurement folded in the peer.yml fallback — mock the
828+
# config reader so this test still exercises the "no config at
829+
# all" path even on operator boxes that have a real peer.yml.
830+
monkeypatch.setattr("maxim.peer.config.read_peer_config", lambda: None)
831+
827832
ok, info = _preflight_llm()
828833
assert ok is True
829834
assert info.get("skipped") is True
830-
assert "MAXIM_LANE_LARGE_REMOTE_URL" in info.get("reason", "")
835+
assert "peer.yml" in info.get("reason", "")
831836

832837
def test_probes_when_remote_url_configured(self, monkeypatch):
833838
"""When the URL is set, the helper invokes
@@ -910,6 +915,119 @@ def health_check(self):
910915
assert info["outcome"] == "auth_rejected"
911916
assert info.get("soft_pass") is True
912917

918+
def test_peer_yml_fallback_when_env_not_set(self, monkeypatch):
919+
"""Roy-0 re-measurement (2026-05-11) caught this gap: when the
920+
operator's leader is configured via ``~/.config/maxim/peer.yml``
921+
(the canonical peer-leader setup) and no env vars are exported,
922+
``apply_peer_config_to_env`` only runs at lane resolution — which
923+
happens AFTER ``_preflight_llm``. Without this fallback the
924+
preflight is a no-op for that entire setup class.
925+
"""
926+
from maxim.peer.config import PeerConfig
927+
from maxim.simulation import roy_runner as _rr
928+
929+
monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_URL", raising=False)
930+
monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_API_KEY", raising=False)
931+
monkeypatch.delenv("MAXIM_LANE_LARGE_REMOTE_MODEL", raising=False)
932+
933+
# Patch read_peer_config to return a synthetic config.
934+
fake_cfg = PeerConfig(
935+
url="https://leader.from.peer.yml",
936+
api_key="sk-from-peer-yml",
937+
is_cloud=False,
938+
model="qwen2.5-14b",
939+
)
940+
941+
def fake_read_peer_config():
942+
return fake_cfg
943+
944+
monkeypatch.setattr("maxim.peer.config.read_peer_config", fake_read_peer_config)
945+
946+
captured: dict[str, Any] = {}
947+
948+
class FakeProbeResult:
949+
outcome = "ok"
950+
detail = ""
951+
latency_ms = 25.0
952+
953+
class FakeBackend:
954+
@classmethod
955+
def for_url(cls, url, *, api_key=None, model=None):
956+
captured["url"] = url
957+
captured["api_key"] = api_key
958+
captured["model"] = model
959+
return cls()
960+
961+
def health_check(self):
962+
return FakeProbeResult()
963+
964+
import sys
965+
966+
fake_module = type(sys)("maxim.models.language.maxim_peer_backend")
967+
fake_module._MaximPeerBackend = FakeBackend # type: ignore[attr-defined]
968+
monkeypatch.setitem(sys.modules, "maxim.models.language.maxim_peer_backend", fake_module)
969+
970+
ok, info = _rr._preflight_llm()
971+
assert ok is True
972+
assert info["outcome"] == "ok"
973+
assert info["url"] == "https://leader.from.peer.yml"
974+
assert info.get("source") == "peer.yml"
975+
# The peer.yml-sourced values were threaded into the probe.
976+
assert captured["url"] == "https://leader.from.peer.yml"
977+
assert captured["api_key"] == "sk-from-peer-yml"
978+
assert captured["model"] == "qwen2.5-14b"
979+
980+
def test_env_takes_precedence_over_peer_yml(self, monkeypatch):
981+
"""When both env vars AND peer.yml are present, env wins
982+
(matches ``apply_peer_config_to_env``'s ``_setdefault_nonempty``
983+
semantics elsewhere in the runtime — env is the per-session
984+
override path).
985+
"""
986+
from maxim.peer.config import PeerConfig
987+
from maxim.simulation import roy_runner as _rr
988+
989+
monkeypatch.setenv("MAXIM_LANE_LARGE_REMOTE_URL", "https://leader.from.env")
990+
monkeypatch.setenv("MAXIM_LANE_LARGE_REMOTE_API_KEY", "sk-from-env")
991+
992+
def fake_read_peer_config():
993+
return PeerConfig(
994+
url="https://leader.from.peer.yml", # should be ignored
995+
api_key="sk-from-peer-yml",
996+
is_cloud=False,
997+
model="qwen2.5-14b",
998+
)
999+
1000+
monkeypatch.setattr("maxim.peer.config.read_peer_config", fake_read_peer_config)
1001+
1002+
captured: dict[str, Any] = {}
1003+
1004+
class FakeProbeResult:
1005+
outcome = "ok"
1006+
detail = ""
1007+
latency_ms = 25.0
1008+
1009+
class FakeBackend:
1010+
@classmethod
1011+
def for_url(cls, url, *, api_key=None, model=None):
1012+
captured["url"] = url
1013+
captured["api_key"] = api_key
1014+
return cls()
1015+
1016+
def health_check(self):
1017+
return FakeProbeResult()
1018+
1019+
import sys
1020+
1021+
fake_module = type(sys)("maxim.models.language.maxim_peer_backend")
1022+
fake_module._MaximPeerBackend = FakeBackend # type: ignore[attr-defined]
1023+
monkeypatch.setitem(sys.modules, "maxim.models.language.maxim_peer_backend", fake_module)
1024+
1025+
ok, info = _rr._preflight_llm()
1026+
assert ok is True
1027+
assert info.get("source") == "env"
1028+
assert captured["url"] == "https://leader.from.env"
1029+
assert captured["api_key"] == "sk-from-env"
1030+
9131031
def test_health_check_exception_treated_as_failure(self, monkeypatch):
9141032
"""If health_check raises (network error, import surprise), the
9151033
helper returns ``(False, info_with_probe_error)`` rather than

0 commit comments

Comments
 (0)