Skip to content

Commit 9050a55

Browse files
authored
Merge pull request #237 from dennys246/feat/substrate-primary-cluster-reward-wire
feat(substrate): Roy-0 empirical confirmation + summary rendering
2 parents ff3c33a + 35e8536 commit 9050a55

4 files changed

Lines changed: 109 additions & 13 deletions

File tree

docs/experiments/15_g4_cluster_reward_wire.md

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Date:** 2026-05-11
44
**Plan:** [grounded_language_acquisition.md § Phase 0 G4](../plans/grounded_language_acquisition.md)
5-
**Status:** Wire shipped; unit-verified end-to-end. Empirical re-measurement on a live Roy-0 run still pending.
5+
**Status:** Wire shipped; unit-verified end-to-end; **empirically confirmed on a live Roy-0 run (2026-05-11 14:35-14:51)**`cluster_reward_bias_l2 = 2.4587` on both A-vs-blank pairs, with the expected `sense_food_source` cluster updates at the `+1.0` per-key cap.
66
**Companion:** [G3 — Roy preflight probe](14_g3_roy_preflight_probe.md) (paired PRs; G4 branched from G3).
77

88
## What was caught
@@ -70,15 +70,61 @@ The 6 new tests cover the chain end-to-end:
7070

7171
## What this DOES prove
7272

73-
- The wire exists end-to-end. Substrate-primary tool outcomes now populate `_cluster_reward_bias`.
73+
- The wire exists end-to-end. Substrate-primary tool outcomes populate `_cluster_reward_bias`.
7474
- The dict serialises to `aut_nac.json` under the `cluster_reward_bias` JSON key.
7575
- `substrate_diff` reads the dict and computes L2 + top deltas when both sides have it.
76-
- Roy `result.json` will carry `nac.cluster_reward_bias.{available, l2, top_deltas}` so operators can read the metric directly.
76+
- Roy `result.json` carries `nac.cluster_reward_bias.{available, l2, top_deltas}` so operators can read the metric directly.
7777

78-
## What this does NOT prove
78+
## Live Roy-0 re-measurement (empirical confirmation)
7979

80-
- That a fresh Roy-0 run will actually populate the dict with substantive entries at sim-time. The wire fires per unit test, but the next gate is empirical: with `min_confidence=0.3` on `NAc.recommend_action` and only a few cluster updates per substrate-primary tick, the proposer may still hit the score-threshold gate before cluster bias accumulates. That's a tuning question (path-specific `min_confidence` for substrate-primary?), answered by a real Roy-0 re-measurement.
81-
- That `reward_bias_l2` (the per-node ATL recognition bias, distinct from cluster_reward_bias) will become non-zero. That dict populates only via `credit_node` from reaction-driven `distribute_reward`, not from tool outcomes — G4 doesn't touch that path.
80+
Re-ran `maxim roy run docs/plans/roy/roy_0_smoke.yaml` against the same healthy leader Roy-0 (2026-05-10) used. Wall: 926.2s (~15.4 min) — same shape as pre-G4. Priming completed 5/5 stages; all 3 arms completed at the warmup fixture's 3-percept exhaustion (`finish_reason=cancel`), unchanged from pre-G4.
81+
82+
**Headline:**
83+
84+
| Pair | `reward_bias_l2` | `cluster_reward_bias_l2` | `causal_link_count_delta` | `goal_reward_bias_l2` |
85+
|---|---|---|---|---|
86+
| **a_vs_b** | 0.0 | **2.4587** | +155 | 0.1918 |
87+
| **a_vs_c** | 0.0 | **2.4587** | +155 | 0.1918 |
88+
| b_vs_c | 0.0 | 0.2121 | 0 | 0.1918 |
89+
90+
**Top deltas (`a_vs_b` representative; `a_vs_c` identical shape):**
91+
92+
```
93+
6× tool:sense_food_source delta=+1.0 (at the per-key cap `max_cluster_reward_bias=1.0`)
94+
2× tool:infant_humanoid_pick_up delta=±0.15 (one positive, one negative — substrate is learning the affordance failed)
95+
```
96+
97+
The 6 `sense_food_source` updates dominate the L2. Each comes from a distinct EC cluster id (the sensor encoder produces a fresh cluster every time drives shift past the min-delta gate, and arm A had a full 50-turn priming run); the substrate-primary path correctly accumulated cluster-keyed positive bias on the only tool it ever successfully invoked. The two `infant_humanoid_pick_up` entries differentiated because arm A's priming hit a failure case the blank arms didn't see — net signed evidence the wire propagates outcomes faithfully, not just magnitudes.
98+
99+
`b_vs_c` shows `cluster_reward_bias_l2 = 0.21` even though both arms started blank: each arm's 3 test-time turns produced a small number of `infant_humanoid_pick_up` updates that landed on slightly different stochastic cluster ids. Expected stochastic noise floor for blank-vs-blank under this fixture; the A-vs-blank ratio of **11.6×** (2.46 / 0.21) is the meaningful signal.
100+
101+
**What changed vs Roy-0 pre-G4:**
102+
103+
| Metric | Pre-G4 (2026-05-10) | Post-G4 (2026-05-11) |
104+
|---|---|---|
105+
| `cluster_reward_bias_l2` (a_vs_b) | n/a (field not serialised) | **2.4587** |
106+
| `cluster_reward_bias.available` | `false` (field absent) | `true` |
107+
| `causal_link_count_delta` (a_vs_b) | +133 | +155 |
108+
| `reward_bias_l2` | 0.0 | 0.0 (expected — different code path) |
109+
| `goal_reward_bias_l2` | 0.196 | 0.192 |
110+
| Wall time | ~15 min | 15.4 min |
111+
112+
The `cluster_reward_bias` field is the headline metric the wire was built to make legible to `substrate_diff`. Pre-G4 it didn't exist in `aut_nac.json` at all (`substrate_diff` returned `available=false`); post-G4 it's serialised, populated, and differentiates arm A's primed substrate from blank arms at L2 ≈ 2.46.
113+
114+
## Two latent issues surfaced by the live run
115+
116+
These are minor and tracked as follow-ups on the same PRs:
117+
118+
1. **G3 preflight skipped under peer.yml.** `result.preflight = {"skipped": True, "reason": "MAXIM_LANE_LARGE_REMOTE_URL not set"}` even though `~/.config/maxim/peer.yml` carried a valid leader URL. Cause: `apply_peer_config_to_env` in [runtime/lane_backends.py:1073](../../src/maxim/runtime/lane_backends.py) only runs when lanes are first resolved — that happens after `_preflight_llm`. The preflight is conservative (skip when no URL), which protects local/cloud setups, but it means peer-with-peer.yml users get a no-op preflight. Real broken-leader failure modes are still caught when env vars are exported explicitly.
119+
2. **`_format_summary` doesn't render `cluster_reward_bias`.** The `summary.md` only shows the old `reward_bias L2 = 0.0000` (correct, but misleading without the new metric next to it). Operators reading `summary.md` instead of `result.json` won't see the headline. Cosmetic fix.
120+
121+
Both follow-ups land on their respective PRs in this same session.
122+
123+
## What this still does NOT prove
124+
125+
- That the wire would still produce non-zero divergence on a held-out test fixture (Roy-0 reuses the priming arc). Roy-1 with a real holdout is the next test.
126+
- That `min_confidence=0.3` is the right threshold for substrate-primary cold start. The current run had arm A exposing 6 distinct clusters all on `sense_food_source` — that's a single-tool monoculture, not the cluster diversity Phase 0 wants. Tuning question for the next experiment.
127+
- That `reward_bias_l2` (the per-ATL-node recognition bias from `credit_node`) will become non-zero. That path is reaction-driven via `distribute_reward`, not tool-outcome-driven. G4 doesn't touch it; it stays 0 by design.
82128

83129
## Reproduction
84130

docs/plans/grounded_language_acquisition.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,9 +104,16 @@ So: do the cheap thesis-tests first, then earn the right to the expensive build.
104104
- `substrate_diff.NacDiff` carries `cluster_reward_bias_{available,l2,top_deltas}`; `substrate_diff_to_json` surfaces the field in `result.json` under `nac.cluster_reward_bias` so Roy iterations can read the metric without re-running the diff.
105105
- 6 regression tests in [tests/integration/test_substrate_primary_aut.py::TestG4ClusterRewardWire](../../tests/integration/test_substrate_primary_aut.py) cover: proposal envelope stashes cluster_id; record_outcome populates `_cluster_reward_bias` on success/failure; cluster_id=None is a no-op; persistence roundtrip preserves the dict; pre-G4 snapshot loads cleanly; substrate_diff surfaces non-zero L2 when arms differ. All 94 tests in the cluster surface pass; full fast suite green (6484 passed; one pre-existing flake `test_context_index.py::test_similar_text_found` unrelated).
106106

107+
**Empirical confirmation (2026-05-11 Roy-0 re-measurement):**
108+
- `cluster_reward_bias_l2 = 2.4587` on both A-vs-blank pairs (vs `n/a` pre-G4 — the field didn't exist in `aut_nac.json`).
109+
- `cluster_reward_bias.available = true` on every pair (vs `false` pre-G4).
110+
- Top deltas: 6× `tool:sense_food_source` at the `+1.0` per-key cap (six distinct EC cluster ids accumulated during arm A's 50-turn priming) + 2× `tool:infant_humanoid_pick_up` at ±0.15 (arm A hit a failure case blank arms didn't).
111+
- `b_vs_c` cluster_reward_bias_l2 = 0.21 (stochastic-cluster noise floor for blank-vs-blank under this fixture); A-vs-blank ratio ≈ **11.6×**.
112+
- Full numbers + pre/post comparison: [docs/experiments/15_g4_cluster_reward_wire.md](../experiments/15_g4_cluster_reward_wire.md). Roy-0 iteration log carries the same table.
113+
107114
**What's NOT in this closure (intentional):**
108-
- **Empirical re-measurement on a fresh Roy-0 run.** The wire is in place and unit-confirmed end-to-end. A live re-run on the leader would confirm Roy-0's `reward_bias_l2 = 0.0000` → non-zero **and** would expose the next gate (likely the `min_confidence=0.3` threshold at [nac.py:1300](../../src/maxim/decisions/nac.py) — small cluster bias accumulations won't cross it for a while). Deferred to the next user-driven Roy run; the closure here proves the substrate now learns, not how fast.
109-
- **Substrate-primary score-threshold tuning.** `min_confidence=0.3` may need to be path-specific (lower for substrate-primary's cold-start regime) — but that's its own measurement question, not the deferred wire.
115+
- **Substrate-primary score-threshold tuning.** `min_confidence=0.3` may need to be path-specific (lower for substrate-primary's cold-start regime) — but that's its own measurement question, not the deferred wire. The 2026-05-11 re-run produced 6 cluster updates all on the same tool (`sense_food_source`), suggesting the cold-start path collapses to one drive-affinity match and loops on it. Roy-1 with a diverse holdout fixture is the next test.
116+
- **`reward_bias_l2`** (the per-ATL-node recognition bias from `credit_node` via reaction-driven `distribute_reward`) **stays 0** — that's a different code path G4 doesn't touch. Pre-G4 readers conflated it with cluster bias because cluster_reward_bias wasn't serialised; the re-measurement makes the distinction empirically visible.
110117

111118
**Original architectural-gap analysis (preserved for context — the wire that closed is the one this describes):** Roy-0 ran 15 min end-to-end against a healthy leader with `aut_mode=substrate-primary` and produced **zero action proposals** (`proposal=none` × hundreds of loop ticks). Three parallel investigations (static gate trace + commit forensics + persisted-state inspection) converge on the same root cause: **the cluster-keyed reward update wire was explicitly deferred when cluster-keyed action *selection* shipped, so NAc has nothing learned to recommend from.**
112119

docs/plans/persona_convergence_crucible.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -237,19 +237,49 @@ What we'd change for Roy-2: [specific next steps]
237237

238238
**What this closure proves:** the wire exists and is unit-confirmed. `NAc.update_cluster_reward` will populate `_cluster_reward_bias` on every substrate-primary tool outcome. `aut_nac.json` will carry the dict so Roy iterations can compare it across arms. `substrate_diff` will report non-zero `cluster_reward_bias_l2` between an arm that learned and a blank arm.
239239

240-
**What this closure does NOT prove:** that Roy-0 re-run will produce non-zero divergence on a fresh leader. A real re-run is the next empirical step — it would confirm the wire fires at sim-time AND surface the next gate (likely the `min_confidence=0.3` threshold in `NAc.recommend_action`, which won't cross from a few cluster updates alone). That re-measurement is a user-driven Roy run, not a precondition for shipping the wire.
240+
**What this closure does NOT prove (when shipped):** that Roy-0 re-run will produce non-zero divergence on a fresh leader. **Empirically confirmed below.**
241241

242242
**Implication for Roy-1:** with G4 closed, Roy-1 on substrate-primary is structurally unblocked. The remaining open question for substrate-primary is "how many cluster updates cross the `min_confidence` gate" — measurement, not architecture. LLM-primary remains the validated alternative for persona-convergence methodology validation if substrate-primary's threshold tuning needs more iterations.
243243

244+
**Roy-0 re-measurement (2026-05-11 14:35-14:51 — G4 wire empirically validated):**
245+
246+
Re-ran the same spec against the same healthy leader after merging the G4 wire onto the leader. 926.2s wall (~15.4 min, unchanged from pre-G4). Priming completed 5/5 stages; all 3 arms completed at the warmup fixture's 3-percept exhaustion (`finish_reason=cancel`, unchanged).
247+
248+
| Pair | `reward_bias_l2` | **`cluster_reward_bias_l2`** | `causal_link_count_delta` |
249+
|---|---|---|---|
250+
| **a_vs_b** | 0.0 | **2.4587** | +155 |
251+
| **a_vs_c** | 0.0 | **2.4587** | +155 |
252+
| b_vs_c | 0.0 | 0.2121 | 0 |
253+
254+
**A-vs-blank top deltas:**`tool:sense_food_source` at the `+1.0` per-key cap (six distinct EC cluster ids accumulated during arm A's 50-turn priming), plus 2× `tool:infant_humanoid_pick_up` at ±0.15 (one positive, one negative — arm A's priming hit a failure case the blank arms didn't). `b_vs_c` shows the stochastic-cluster-id noise floor for blank-vs-blank under this fixture; **A-vs-blank ratio is ~11.6×**, the meaningful signal.
255+
256+
**Pre-G4 → post-G4 comparison:**
257+
258+
| Metric | Pre-G4 (2026-05-10) | Post-G4 (2026-05-11) |
259+
|---|---|---|
260+
| `cluster_reward_bias_l2` (a_vs_b) | n/a (field not serialised) | **2.4587** |
261+
| `cluster_reward_bias.available` | `false` (field absent in JSON) | `true` |
262+
| `reward_bias_l2` | 0.0 | 0.0 (expected — different code path; G4 doesn't touch `credit_node`) |
263+
264+
The Phase 0 architectural-gap writeup ([grounded_language_acquisition.md](grounded_language_acquisition.md)) and the [G4 experiment outcome doc](../experiments/15_g4_cluster_reward_wire.md) carry the full empirical detail. Reproduction runbook: [protocols/15_g4_cluster_reward_wire_reproduction.md](../experiments/protocols/15_g4_cluster_reward_wire_reproduction.md).
265+
266+
**Two latent issues surfaced by the live run (tracked as follow-ups on the same PRs):**
267+
268+
- **G3 preflight skipped under peer.yml.** Result reports `preflight = {skipped: True, reason: "MAXIM_LANE_LARGE_REMOTE_URL not set"}` despite `~/.config/maxim/peer.yml` carrying a valid leader URL. `apply_peer_config_to_env` in [lane_backends.py](../../src/maxim/runtime/lane_backends.py) only runs at lane resolution — that happens after `_preflight_llm`. Conservative skip protects local/cloud setups; means peer-with-peer.yml users get a no-op preflight. Real broken-leader failure modes are still caught with explicit env-var setup.
269+
- **`_format_summary` doesn't surface `cluster_reward_bias`.** `summary.md` shows only the old `reward_bias L2 = 0.0000`. JSON has the right data; rendering is the gap. Cosmetic.
270+
244271
**What to change before Roy-1 (concrete next steps, prioritised):**
245272

246-
1. ~~**G4 (blocking — substrate-primary track)**~~**CLOSED in this session.** See above.
247-
2. **Roy-0 re-measurement (recommended next):** rerun `maxim roy run docs/plans/roy/roy_0_smoke.yaml` against the leader to confirm `cluster_reward_bias_l2 > 0` on the A-vs-blank pair (the first empirical proof of the closed wire). Expect `reward_bias_l2` to remain near 0 (that's the per-node ATL recognition bias, populated by reaction-driven `distribute_reward`, not the G4 wire). If cluster_reward_bias_l2 ALSO comes back 0, the next gate is the score threshold at [nac.py:1300](../../src/maxim/decisions/nac.py) — substrate-primary may need a path-specific `min_confidence` lower than 0.3.
248-
3. **Roy-1 needs a held-out test fixture distinct from the priming arc.** Reusing cradle_prelinguistic warmup for both means the test scenario doesn't actually test generalisation. Hand-author `scenarios/roy/roy_1_holdout.yaml` with novel + familiar + unrelated stimuli per the methodology table.
273+
1. ~~**G4 (blocking — substrate-primary track)**~~**CLOSED + empirically confirmed.** See re-measurement table above.
274+
2. **Roy-1 needs a held-out test fixture distinct from the priming arc.** Reusing cradle_prelinguistic warmup for both means the test scenario doesn't actually test generalisation. Hand-author `scenarios/roy/roy_1_holdout.yaml` with novel + familiar + unrelated stimuli per the methodology table.
275+
3. **Cluster monoculture during priming.** Arm A accumulated 6 distinct cluster ids all on `sense_food_source` — single-tool exposure, not the cluster diversity Phase 0 wants. The substrate-primary cold-start regime is picking one drive-affinity tool and looping on it. Diagnostic for the next experiment: does Roy-1 with a diverse fixture produce cross-tool cluster bias, or does it still collapse to one tool?
249276
4. **G2 (cosmetic):** gate `simulation/spinner.py` on interactive mode or `stderr.isatty()`. Spinner ANSI pollutes JSONL logs during script runs.
250277
5. **G5/G6 (environmental):** auto-spawn path mismatch (claude-sonnet GGUF vs qwen2.5 profile) and smollm auto-download blocked by non-TTY. Pre-existing, outside Roy code.
251278

252-
**Artifacts:** [`result.json`](/Users/dennyschaedig/.maxim/roy/roy-0-smoke/result.json) · protocol [`roy_0_smoke.md`](../experiments/protocols/roy_0_smoke.md) · spec [`roy_0_smoke.yaml`](./roy/roy_0_smoke.yaml) · LLM trace `/tmp/roy_0_live.jsonl` (23 peer_backend_call events)
279+
**Artifacts:**
280+
- Pre-G4 (2026-05-10): `~/.maxim/roy/roy-0-smoke/result.json` (overwritten by the re-measurement; pre-G4 snapshot lives in `~/.maxim/sim_reports/20260510_*` session dirs). LLM trace `/tmp/roy_0_live.jsonl` (23 peer_backend_call events).
281+
- Post-G4 (2026-05-11): [`result.json`](/Users/dennyschaedig/.maxim/roy/roy-0-smoke/result.json) carries the new `cluster_reward_bias` field. LLM trace `/tmp/roy_g4_live/roy.jsonl`.
282+
- Protocol: [`roy_0_smoke.md`](../experiments/protocols/roy_0_smoke.md). Spec: [`roy_0_smoke.yaml`](./roy/roy_0_smoke.yaml).
253283
<!-- /roy-iteration:roy-0-smoke -->
254284

255285
Empty until Roy-1 runs.

src/maxim/simulation/roy_runner.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -873,6 +873,19 @@ def _format_summary(r: RoyIterationResult) -> str:
873873
f" - NAc: reward_bias L2={nac.get('reward_bias_l2', 0.0):.4f} "
874874
f"causal_link Δ={nac.get('causal_link_count_delta', 0):+d}"
875875
)
876+
# G4: cluster-keyed reward bias (Track 2 of grounded_language
877+
# _acquisition.md Phase 0+). Surface this on its own line so
878+
# operators reading summary.md see the substrate-primary
879+
# learning signal even when reward_bias_l2 is 0 (which it
880+
# always is for tool-outcome-driven runs — reward_bias is
881+
# populated by credit_node from reaction-driven
882+
# distribute_reward, not by tool outcomes).
883+
crb = nac.get("cluster_reward_bias", {})
884+
if crb.get("available"):
885+
lines.append(
886+
f" - NAc cluster_reward_bias L2={crb.get('l2', 0.0):.4f} "
887+
f"({len(crb.get('top_deltas', []))} keys differ)"
888+
)
876889
if ec.get("available"):
877890
lines.append(f" - EC: nodes Δ={ec.get('node_count_delta', 0):+d}")
878891
if hipp.get("available"):

0 commit comments

Comments
 (0)