Merge pull request #237 from dennys246/feat/substrate-primary-cluster-reward-wire

dennys246 · web-flow · commit 9050a5559929 · 2026-05-11T17:57:59.000-06:00
feat(substrate): Roy-0 empirical confirmation + summary rendering
diff --git a/docs/experiments/15_g4_cluster_reward_wire.md b/docs/experiments/15_g4_cluster_reward_wire.md
@@ -2,7 +2,7 @@
 
 **Date:** 2026-05-11
 **Plan:** [grounded_language_acquisition.md § Phase 0 G4](../plans/grounded_language_acquisition.md)
-**Status:** Wire shipped; unit-verified end-to-end. Empirical re-measurement on a live Roy-0 run still pending.
+**Status:** Wire shipped; unit-verified end-to-end; **empirically confirmed on a live Roy-0 run (2026-05-11 14:35-14:51)** — `cluster_reward_bias_l2 = 2.4587` on both A-vs-blank pairs, with the expected `sense_food_source` cluster updates at the `+1.0` per-key cap.
 **Companion:** [G3 — Roy preflight probe](14_g3_roy_preflight_probe.md) (paired PRs; G4 branched from G3).
 
 ## What was caught
@@ -70,15 +70,61 @@ The 6 new tests cover the chain end-to-end:
 
 ## What this DOES prove
 
-- The wire exists end-to-end. Substrate-primary tool outcomes now populate `_cluster_reward_bias`.
+- The wire exists end-to-end. Substrate-primary tool outcomes populate `_cluster_reward_bias`.
 - The dict serialises to `aut_nac.json` under the `cluster_reward_bias` JSON key.
 - `substrate_diff` reads the dict and computes L2 + top deltas when both sides have it.
-- Roy `result.json` will carry `nac.cluster_reward_bias.{available, l2, top_deltas}` so operators can read the metric directly.
+- Roy `result.json` carries `nac.cluster_reward_bias.{available, l2, top_deltas}` so operators can read the metric directly.
 
-## What this does NOT prove
+## Live Roy-0 re-measurement (empirical confirmation)
 
-- That a fresh Roy-0 run will actually populate the dict with substantive entries at sim-time. The wire fires per unit test, but the next gate is empirical: with `min_confidence=0.3` on `NAc.recommend_action` and only a few cluster updates per substrate-primary tick, the proposer may still hit the score-threshold gate before cluster bias accumulates. That's a tuning question (path-specific `min_confidence` for substrate-primary?), answered by a real Roy-0 re-measurement.
-- That `reward_bias_l2` (the per-node ATL recognition bias, distinct from cluster_reward_bias) will become non-zero. That dict populates only via `credit_node` from reaction-driven `distribute_reward`, not from tool outcomes — G4 doesn't touch that path.
+Re-ran `maxim roy run docs/plans/roy/roy_0_smoke.yaml` against the same healthy leader Roy-0 (2026-05-10) used. Wall: 926.2s (~15.4 min) — same shape as pre-G4. Priming completed 5/5 stages; all 3 arms completed at the warmup fixture's 3-percept exhaustion (`finish_reason=cancel`), unchanged from pre-G4.
+
+**Headline:**
+
+| Pair | `reward_bias_l2` | `cluster_reward_bias_l2` | `causal_link_count_delta` | `goal_reward_bias_l2` |
+|---|---|---|---|---|
+| **a_vs_b** | 0.0 | **2.4587** | +155 | 0.1918 |
+| **a_vs_c** | 0.0 | **2.4587** | +155 | 0.1918 |
+| b_vs_c | 0.0 | 0.2121 | 0 | 0.1918 |
+
+**Top deltas (`a_vs_b` representative; `a_vs_c` identical shape):**
+
+```
+6× tool:sense_food_source  delta=+1.0  (at the per-key cap `max_cluster_reward_bias=1.0`)
+2× tool:infant_humanoid_pick_up  delta=±0.15  (one positive, one negative — substrate is learning the affordance failed)
+```
+
+The 6 `sense_food_source` updates dominate the L2. Each comes from a distinct EC cluster id (the sensor encoder produces a fresh cluster every time drives shift past the min-delta gate, and arm A had a full 50-turn priming run); the substrate-primary path correctly accumulated cluster-keyed positive bias on the only tool it ever successfully invoked. The two `infant_humanoid_pick_up` entries differentiated because arm A's priming hit a failure case the blank arms didn't see — net signed evidence the wire propagates outcomes faithfully, not just magnitudes.
+
+`b_vs_c` shows `cluster_reward_bias_l2 = 0.21` even though both arms started blank: each arm's 3 test-time turns produced a small number of `infant_humanoid_pick_up` updates that landed on slightly different stochastic cluster ids. Expected stochastic noise floor for blank-vs-blank under this fixture; the A-vs-blank ratio of **11.6×** (2.46 / 0.21) is the meaningful signal.
+
+**What changed vs Roy-0 pre-G4:**
+
+| Metric | Pre-G4 (2026-05-10) | Post-G4 (2026-05-11) |
+|---|---|---|
+| `cluster_reward_bias_l2` (a_vs_b) | n/a (field not serialised) | **2.4587** |
+| `cluster_reward_bias.available` | `false` (field absent) | `true` |
+| `causal_link_count_delta` (a_vs_b) | +133 | +155 |
+| `reward_bias_l2` | 0.0 | 0.0 (expected — different code path) |
+| `goal_reward_bias_l2` | 0.196 | 0.192 |
+| Wall time | ~15 min | 15.4 min |
+
+The `cluster_reward_bias` field is the headline metric the wire was built to make legible to `substrate_diff`. Pre-G4 it didn't exist in `aut_nac.json` at all (`substrate_diff` returned `available=false`); post-G4 it's serialised, populated, and differentiates arm A's primed substrate from blank arms at L2 ≈ 2.46.
+
+## Two latent issues surfaced by the live run
+
+These are minor and tracked as follow-ups on the same PRs:
+
+1. **G3 preflight skipped under peer.yml.** `result.preflight = {"skipped": True, "reason": "MAXIM_LANE_LARGE_REMOTE_URL not set"}` even though `~/.config/maxim/peer.yml` carried a valid leader URL. Cause: `apply_peer_config_to_env` in [runtime/lane_backends.py:1073](../../src/maxim/runtime/lane_backends.py) only runs when lanes are first resolved — that happens after `_preflight_llm`. The preflight is conservative (skip when no URL), which protects local/cloud setups, but it means peer-with-peer.yml users get a no-op preflight. Real broken-leader failure modes are still caught when env vars are exported explicitly.
+2. **`_format_summary` doesn't render `cluster_reward_bias`.** The `summary.md` only shows the old `reward_bias L2 = 0.0000` (correct, but misleading without the new metric next to it). Operators reading `summary.md` instead of `result.json` won't see the headline. Cosmetic fix.
+
+Both follow-ups land on their respective PRs in this same session.
+
+## What this still does NOT prove
+
+- That the wire would still produce non-zero divergence on a held-out test fixture (Roy-0 reuses the priming arc). Roy-1 with a real holdout is the next test.
+- That `min_confidence=0.3` is the right threshold for substrate-primary cold start. The current run had arm A exposing 6 distinct clusters all on `sense_food_source` — that's a single-tool monoculture, not the cluster diversity Phase 0 wants. Tuning question for the next experiment.
+- That `reward_bias_l2` (the per-ATL-node recognition bias from `credit_node`) will become non-zero. That path is reaction-driven via `distribute_reward`, not tool-outcome-driven. G4 doesn't touch it; it stays 0 by design.
 
 ## Reproduction
 
diff --git a/docs/plans/grounded_language_acquisition.md b/docs/plans/grounded_language_acquisition.md
@@ -104,9 +104,16 @@ So: do the cheap thesis-tests first, then earn the right to the expensive build.
   - `substrate_diff.NacDiff` carries `cluster_reward_bias_{available,l2,top_deltas}`; `substrate_diff_to_json` surfaces the field in `result.json` under `nac.cluster_reward_bias` so Roy iterations can read the metric without re-running the diff.
   - 6 regression tests in [tests/integration/test_substrate_primary_aut.py::TestG4ClusterRewardWire](../../tests/integration/test_substrate_primary_aut.py) cover: proposal envelope stashes cluster_id; record_outcome populates `_cluster_reward_bias` on success/failure; cluster_id=None is a no-op; persistence roundtrip preserves the dict; pre-G4 snapshot loads cleanly; substrate_diff surfaces non-zero L2 when arms differ. All 94 tests in the cluster surface pass; full fast suite green (6484 passed; one pre-existing flake `test_context_index.py::test_similar_text_found` unrelated).
 
+  **Empirical confirmation (2026-05-11 Roy-0 re-measurement):**
+  - `cluster_reward_bias_l2 = 2.4587` on both A-vs-blank pairs (vs `n/a` pre-G4 — the field didn't exist in `aut_nac.json`).
+  - `cluster_reward_bias.available = true` on every pair (vs `false` pre-G4).
+  - Top deltas: 6× `tool:sense_food_source` at the `+1.0` per-key cap (six distinct EC cluster ids accumulated during arm A's 50-turn priming) + 2× `tool:infant_humanoid_pick_up` at ±0.15 (arm A hit a failure case blank arms didn't).
+  - `b_vs_c` cluster_reward_bias_l2 = 0.21 (stochastic-cluster noise floor for blank-vs-blank under this fixture); A-vs-blank ratio ≈ **11.6×**.
+  - Full numbers + pre/post comparison: [docs/experiments/15_g4_cluster_reward_wire.md](../experiments/15_g4_cluster_reward_wire.md). Roy-0 iteration log carries the same table.
+
   **What's NOT in this closure (intentional):**
-  - **Empirical re-measurement on a fresh Roy-0 run.** The wire is in place and unit-confirmed end-to-end. A live re-run on the leader would confirm Roy-0's `reward_bias_l2 = 0.0000` → non-zero **and** would expose the next gate (likely the `min_confidence=0.3` threshold at [nac.py:1300](../../src/maxim/decisions/nac.py) — small cluster bias accumulations won't cross it for a while). Deferred to the next user-driven Roy run; the closure here proves the substrate now learns, not how fast.
-  - **Substrate-primary score-threshold tuning.** `min_confidence=0.3` may need to be path-specific (lower for substrate-primary's cold-start regime) — but that's its own measurement question, not the deferred wire.
+  - **Substrate-primary score-threshold tuning.** `min_confidence=0.3` may need to be path-specific (lower for substrate-primary's cold-start regime) — but that's its own measurement question, not the deferred wire. The 2026-05-11 re-run produced 6 cluster updates all on the same tool (`sense_food_source`), suggesting the cold-start path collapses to one drive-affinity match and loops on it. Roy-1 with a diverse holdout fixture is the next test.
+  - **`reward_bias_l2`** (the per-ATL-node recognition bias from `credit_node` via reaction-driven `distribute_reward`) **stays 0** — that's a different code path G4 doesn't touch. Pre-G4 readers conflated it with cluster bias because cluster_reward_bias wasn't serialised; the re-measurement makes the distinction empirically visible.
 
   **Original architectural-gap analysis (preserved for context — the wire that closed is the one this describes):** Roy-0 ran 15 min end-to-end against a healthy leader with `aut_mode=substrate-primary` and produced **zero action proposals** (`proposal=none` × hundreds of loop ticks). Three parallel investigations (static gate trace + commit forensics + persisted-state inspection) converge on the same root cause: **the cluster-keyed reward update wire was explicitly deferred when cluster-keyed action *selection* shipped, so NAc has nothing learned to recommend from.**
 
diff --git a/docs/plans/persona_convergence_crucible.md b/docs/plans/persona_convergence_crucible.md
@@ -237,19 +237,49 @@ What we'd change for Roy-2: [specific next steps]
 
   **What this closure proves:** the wire exists and is unit-confirmed. `NAc.update_cluster_reward` will populate `_cluster_reward_bias` on every substrate-primary tool outcome. `aut_nac.json` will carry the dict so Roy iterations can compare it across arms. `substrate_diff` will report non-zero `cluster_reward_bias_l2` between an arm that learned and a blank arm.
 
-  **What this closure does NOT prove:** that Roy-0 re-run will produce non-zero divergence on a fresh leader. A real re-run is the next empirical step — it would confirm the wire fires at sim-time AND surface the next gate (likely the `min_confidence=0.3` threshold in `NAc.recommend_action`, which won't cross from a few cluster updates alone). That re-measurement is a user-driven Roy run, not a precondition for shipping the wire.
+  **What this closure does NOT prove (when shipped):** that Roy-0 re-run will produce non-zero divergence on a fresh leader. **Empirically confirmed below.**
 
   **Implication for Roy-1:** with G4 closed, Roy-1 on substrate-primary is structurally unblocked. The remaining open question for substrate-primary is "how many cluster updates cross the `min_confidence` gate" — measurement, not architecture. LLM-primary remains the validated alternative for persona-convergence methodology validation if substrate-primary's threshold tuning needs more iterations.
 
+**Roy-0 re-measurement (2026-05-11 14:35-14:51 — G4 wire empirically validated):**
+
+Re-ran the same spec against the same healthy leader after merging the G4 wire onto the leader. 926.2s wall (~15.4 min, unchanged from pre-G4). Priming completed 5/5 stages; all 3 arms completed at the warmup fixture's 3-percept exhaustion (`finish_reason=cancel`, unchanged).
+
+| Pair | `reward_bias_l2` | **`cluster_reward_bias_l2`** | `causal_link_count_delta` |
+|---|---|---|---|
+| **a_vs_b** | 0.0 | **2.4587** | +155 |
+| **a_vs_c** | 0.0 | **2.4587** | +155 |
+| b_vs_c | 0.0 | 0.2121 | 0 |
+
+**A-vs-blank top deltas:** 6× `tool:sense_food_source` at the `+1.0` per-key cap (six distinct EC cluster ids accumulated during arm A's 50-turn priming), plus 2× `tool:infant_humanoid_pick_up` at ±0.15 (one positive, one negative — arm A's priming hit a failure case the blank arms didn't). `b_vs_c` shows the stochastic-cluster-id noise floor for blank-vs-blank under this fixture; **A-vs-blank ratio is ~11.6×**, the meaningful signal.
+
+**Pre-G4 → post-G4 comparison:**
+
+| Metric | Pre-G4 (2026-05-10) | Post-G4 (2026-05-11) |
+|---|---|---|
+| `cluster_reward_bias_l2` (a_vs_b) | n/a (field not serialised) | **2.4587** |
+| `cluster_reward_bias.available` | `false` (field absent in JSON) | `true` |
+| `reward_bias_l2` | 0.0 | 0.0 (expected — different code path; G4 doesn't touch `credit_node`) |
+
+The Phase 0 architectural-gap writeup ([grounded_language_acquisition.md](grounded_language_acquisition.md)) and the [G4 experiment outcome doc](../experiments/15_g4_cluster_reward_wire.md) carry the full empirical detail. Reproduction runbook: [protocols/15_g4_cluster_reward_wire_reproduction.md](../experiments/protocols/15_g4_cluster_reward_wire_reproduction.md).
+
+**Two latent issues surfaced by the live run (tracked as follow-ups on the same PRs):**
+
+- **G3 preflight skipped under peer.yml.** Result reports `preflight = {skipped: True, reason: "MAXIM_LANE_LARGE_REMOTE_URL not set"}` despite `~/.config/maxim/peer.yml` carrying a valid leader URL. `apply_peer_config_to_env` in [lane_backends.py](../../src/maxim/runtime/lane_backends.py) only runs at lane resolution — that happens after `_preflight_llm`. Conservative skip protects local/cloud setups; means peer-with-peer.yml users get a no-op preflight. Real broken-leader failure modes are still caught with explicit env-var setup.
+- **`_format_summary` doesn't surface `cluster_reward_bias`.** `summary.md` shows only the old `reward_bias L2 = 0.0000`. JSON has the right data; rendering is the gap. Cosmetic.
+
 **What to change before Roy-1 (concrete next steps, prioritised):**
 
-1. ~~**G4 (blocking — substrate-primary track)**~~ — **CLOSED in this session.** See above.
-2. **Roy-0 re-measurement (recommended next):** rerun `maxim roy run docs/plans/roy/roy_0_smoke.yaml` against the leader to confirm `cluster_reward_bias_l2 > 0` on the A-vs-blank pair (the first empirical proof of the closed wire). Expect `reward_bias_l2` to remain near 0 (that's the per-node ATL recognition bias, populated by reaction-driven `distribute_reward`, not the G4 wire). If cluster_reward_bias_l2 ALSO comes back 0, the next gate is the score threshold at [nac.py:1300](../../src/maxim/decisions/nac.py) — substrate-primary may need a path-specific `min_confidence` lower than 0.3.
-3. **Roy-1 needs a held-out test fixture distinct from the priming arc.** Reusing cradle_prelinguistic warmup for both means the test scenario doesn't actually test generalisation. Hand-author `scenarios/roy/roy_1_holdout.yaml` with novel + familiar + unrelated stimuli per the methodology table.
+1. ~~**G4 (blocking — substrate-primary track)**~~ — **CLOSED + empirically confirmed.** See re-measurement table above.
+2. **Roy-1 needs a held-out test fixture distinct from the priming arc.** Reusing cradle_prelinguistic warmup for both means the test scenario doesn't actually test generalisation. Hand-author `scenarios/roy/roy_1_holdout.yaml` with novel + familiar + unrelated stimuli per the methodology table.
+3. **Cluster monoculture during priming.** Arm A accumulated 6 distinct cluster ids all on `sense_food_source` — single-tool exposure, not the cluster diversity Phase 0 wants. The substrate-primary cold-start regime is picking one drive-affinity tool and looping on it. Diagnostic for the next experiment: does Roy-1 with a diverse fixture produce cross-tool cluster bias, or does it still collapse to one tool?
 4. **G2 (cosmetic):** gate `simulation/spinner.py` on interactive mode or `stderr.isatty()`. Spinner ANSI pollutes JSONL logs during script runs.
 5. **G5/G6 (environmental):** auto-spawn path mismatch (claude-sonnet GGUF vs qwen2.5 profile) and smollm auto-download blocked by non-TTY. Pre-existing, outside Roy code.
 
-**Artifacts:** [`result.json`](/Users/dennyschaedig/.maxim/roy/roy-0-smoke/result.json) · protocol [`roy_0_smoke.md`](../experiments/protocols/roy_0_smoke.md) · spec [`roy_0_smoke.yaml`](./roy/roy_0_smoke.yaml) · LLM trace `/tmp/roy_0_live.jsonl` (23 peer_backend_call events)
+**Artifacts:**
+- Pre-G4 (2026-05-10): `~/.maxim/roy/roy-0-smoke/result.json` (overwritten by the re-measurement; pre-G4 snapshot lives in `~/.maxim/sim_reports/20260510_*` session dirs). LLM trace `/tmp/roy_0_live.jsonl` (23 peer_backend_call events).
+- Post-G4 (2026-05-11): [`result.json`](/Users/dennyschaedig/.maxim/roy/roy-0-smoke/result.json) carries the new `cluster_reward_bias` field. LLM trace `/tmp/roy_g4_live/roy.jsonl`.
+- Protocol: [`roy_0_smoke.md`](../experiments/protocols/roy_0_smoke.md). Spec: [`roy_0_smoke.yaml`](./roy/roy_0_smoke.yaml).
 <!-- /roy-iteration:roy-0-smoke -->
 
 Empty until Roy-1 runs.
diff --git a/src/maxim/simulation/roy_runner.py b/src/maxim/simulation/roy_runner.py
@@ -873,6 +873,19 @@ def _format_summary(r: RoyIterationResult) -> str:
                 f"  - NAc: reward_bias L2={nac.get('reward_bias_l2', 0.0):.4f}  "
                 f"causal_link Δ={nac.get('causal_link_count_delta', 0):+d}"
             )
+            # G4: cluster-keyed reward bias (Track 2 of grounded_language
+            # _acquisition.md Phase 0+). Surface this on its own line so
+            # operators reading summary.md see the substrate-primary
+            # learning signal even when reward_bias_l2 is 0 (which it
+            # always is for tool-outcome-driven runs — reward_bias is
+            # populated by credit_node from reaction-driven
+            # distribute_reward, not by tool outcomes).
+            crb = nac.get("cluster_reward_bias", {})
+            if crb.get("available"):
+                lines.append(
+                    f"  - NAc cluster_reward_bias L2={crb.get('l2', 0.0):.4f}  "
+                    f"({len(crb.get('top_deltas', []))} keys differ)"
+                )
         if ec.get("available"):
             lines.append(f"  - EC: nodes Δ={ec.get('node_count_delta', 0):+d}")
         if hipp.get("available"):