diff --git a/html-guides/maxim-experiments.html b/html-guides/maxim-experiments.html index 2f394d5b..e4fb7a37 100644 --- a/html-guides/maxim-experiments.html +++ b/html-guides/maxim-experiments.html @@ -2,9 +2,9 @@ {% block title %}Experiments & Results — Maxim Docs | Bio-Inspired Cognitive Architecture{% endblock %} -{% block meta_description %}Experiments proving Maxim's bio-inspired learning pipeline works without LLM fine-tuning. 41/41 hypotheses across 3 tiers: substrate learning (Tier 1), LLM acts on learning (Tier 2), and organic LLM learning (Tier 3). Cross-session affective memory, energy-driven consumable learning, valence annotation, and SEM learning loop.{% endblock %} +{% block meta_description %}Experiments proving Maxim's bio-inspired learning pipeline works without LLM fine-tuning. 41/41 hypotheses across 3 tiers plus 7 Roy persona-convergence iterations (2026-05) localizing the LinguisticEncoder → EC alignment gap. Cross-session affective memory, energy-driven consumable learning, valence annotation, SEM learning loop, and the Roy-4 cheap-gate experiment that cancelled the 1.1 Hebbian binding plan.{% endblock %} -{% block meta_keywords %}Maxim experiments, substrate testing, cross-session learning, affective memory, consumable learning, valence annotation, SEM learning loop, bio-inspired AI, organic learning, cognitive architecture validation, Tier 3 learning{% endblock %} +{% block meta_keywords %}Maxim experiments, substrate testing, cross-session learning, affective memory, consumable learning, valence annotation, SEM learning loop, bio-inspired AI, organic learning, cognitive architecture validation, Tier 3 learning, Roy harness, persona convergence, substrate-primary AUT, encoder alignment{% endblock %} {% block meta_author %}Maxim Project{% endblock %} {% block og_site_name %}Maxim{% endblock %} @@ -46,13 +46,267 @@
Deterministic Validation of the Bio-Inspired Learning Pipeline
-41/41 hypotheses confirmed across 3 testing tiers, plus 3 additional validation experiments (B4 replanning, P6 extinction, P8 sleep replay) shipped in v0.5.0. Tier 1 experiments run on the substrate layer alone (deterministic, no LLM). Tier 2 uses scripted training with real LLM decisions. Tier 3 is the ultimate proof: fully organic LLM-driven training and testing with no scripted reactions. B4 replanning closes the last 1.0 gate besides embodiment. Each hypothesis is stated falsifiably, each result is a pass/fail count, and each experiment includes a reproduction command.
+41/41 hypotheses confirmed across 3 testing tiers, plus 3 additional v0.5.0 validation experiments (B4 replanning, P6 extinction, P8 sleep replay), plus 7 Roy persona-convergence iterations (2026-05). Tier 1 experiments run on the substrate layer alone (deterministic, no LLM). Tier 2 uses scripted training with real LLM decisions. Tier 3 is the ultimate proof: fully organic LLM-driven training and testing with no scripted reactions. The 2026-05 Roy iteration arc extends the surface to "the substrate's learning translates to behavioral divergence under substrate-primary action selection" — seven iterations established the cluster-keyed reward wire is structurally healthy and localized the failing layer to LinguisticEncoder → EC alignment (Roy-2c). Roy-4 cancelled the 1.1 Hebbian binding plan via a pre-registered cheap-gate experiment; Roy-5a (queued) is the disambiguator that scopes the actual fix.
Tier 1 (deterministic, no LLM): isolates the bio-pipeline's learning signal from LLM variance. Tier 2 (scripted training, LLM test): proves the LLM acts on bio-system learning with masked entity names to prevent language priors. Tier 3 (organic LLM training + test): the ultimate proof — the agent learns from its own actions with no scripted reactions, and a fresh control agent fails the same scenario.
The 2026-05 Roy harness iterations (Roy-0 through Roy-4, summarized below) extend the validation surface from "the substrate learns" to "the substrate's learning translates to behavioral divergence under substrate-primary action selection". Seven iterations established that the cluster-keyed reward wire is structurally healthy and the gap is at the encoder-alignment layer; Roy-4 cancelled the 1.1 Hebbian binding plan; Roy-5a (queued) is the cheap disambiguator that scopes the actual fix.
+The Roy harness is the long-running persona-convergence experiment: a three-arm runner (A = primed substrate, neutral prompt; B = blank substrate, persona-injected prompt; C = blank substrate, neutral prompt) that asks whether substrate-acquired bias translates to behavioral divergence under substrate-primary action selection. Earlier experiments (Exp 1–4 below) proved the substrate learns; Roy iterations measure whether that learning changes which action the agent picks on percepts the priming arc didn't directly drill.
+ +Seven iterations have shipped against a live LLM leader (qwen2.5-14b-instruct via cloudflared tunnel). Each iteration is a single-variable change vs the previous; each ships a YAML spec, a held-out fixture, an outcome doc, a reproduction protocol, and an iteration log entry on persona_convergence_crucible.md.
| Iteration | +Date | +Variable | +cluster_reward_bias_l2 (a_vs_b) |
+ Headline finding | +
|---|---|---|---|---|
| Roy-0 | +2026-05-10 | +substrate-primary smoke | +2.4587 | +First end-to-end Roy. Cluster wire fires; A ≈ 11.6× blank-vs-blank noise floor. | +
| Roy-1a | +2026-05-11 | +llm-primary at test, original holdout | +2.4671 | +Wire structurally preserved but behaviorally inert: LLM proposer doesn't consume cluster bias. | +
| Roy-1b | +2026-05-12 | +substrate-primary at test, original holdout | +2.4632 | +Wire consumed but held-out percepts don't fire priming clusters. | +
| Roy-2 | +2026-05-12 | +llm-primary + multi-arc priming | +2.4708 | +Multi-arc priming did NOT widen cluster vocabulary. Clean tool-family divergence (17/3/2 vs 21/5/1/1) via salience-mediated LLM-prompt path only. | +
| Roy-2pc | +2026-05-13 | +substrate-primary + engineered-overlap fixture | +2.4678 | +Byte-identical 2× FAILED infant_humanoid_pick_up across all 3 arms. Engineering semantic overlap is insufficient. |
+
| Roy-2c | +2026-05-13 | +min_confidence=0.0 probe (H1 vs H2) |
+ 2.5661 | +H1 confirmed: LinguisticEncoder → EC alignment is the block. Gate-tuning does NOT rescue the wire (zero sense_food_source calls even at gate=0.0). Wire-A is the architectural fix. |
+
| Roy-4 | +2026-05-13 | +EC-activation instrumentation + Hebbian rule sweep | +2.4678 | +FAIL: zero priming↔test bound edges at every sweep point. Cancels cross_modal_substrate_binding.md Stages 2–6. |
+
Pairwise cluster_reward_bias_l2 reproduces within 5% across all seven iterations on the same priming — the substrate-primary tool-outcome wire is rock-solid. The failing signal is which cluster ID the bias attaches to, not whether the bias forms.
Across every iteration the surviving signal is always tool name — NAc's cluster_reward_bias map has the right tool keys (the priming sense_food_source reward survives all seven iterations). The failing signal is always EC cluster identity — the priming cluster UUIDs are never the active cluster on test percepts that ought to be semantically similar. The substrate is using two different identity schemes for the same concept: a coarse-grained tool-symbol scheme that's stable across encoder drift, AND a fine-grained EC-cluster scheme that's susceptible to it. Wire-A in the 0.9.1 release exploits the surviving granularity by surfacing tool-level bias at the LLM prompt; the architectural fix at the cluster layer needs encoder work, not signal-surfacing.
Roy-4 was the cheap experimental gate for cross_modal_substrate_binding.md's proposed Hebbian binding rule: EC nodes that activate in the same tick window across modalities acquire a binding edge. Validates the 1.1 design BEFORE the implementation is built. Same priming + fixture + arms as Roy-2c with one structural change: MAXIM_EC_TRACE_ACTIVATIONS=1 in the runner environment emits per-tick sim_ec_activation JSONL events from EntorhinalCortex.pattern_complete_or_separate. Post-hoc analyzer (scripts/analyze_roy_4_coactivation.py) computes pairwise co-activation matrix + applies the proposed Hebbian rule.
| Outcome | +Diagnosis | +
|---|---|
At least one test-phase node has a would-have-bound edge to a priming sense_food_source cluster |
+ PASS — binding mechanism would have closed Roy-2c. Greenlight 1.1 Stages 2–6. | +
| No would-have-bound edges between priming and test clusters under the proposed rule | +FAIL — encoder alignment is too severe for Hebbian binding alone. Cancel Stages 2–6; redirect to 1.2+ encoder-replacement direction. | +
Node-set overlap
+37 priming nodes vs 10 / 13 / 9 nodes per arm (A/B/C). Zero EC node-ID overlap between priming and any test arm. Both linguistic and drive modalities reproduce the disjointness — not a channel-mismatch artifact.
+Food-cluster co-firing
+61 priming ticks where any of the 6 food clusters fired. Only 1 tick had a non-food co-firing partner. Of the 7 partner nodes during priming, zero appear in arm A's test-phase active set.
+min_cofire |
+ min_weight |
+ Priming would-have-bound edges | +Matching priming↔test edges | +
|---|---|---|---|
| 1 | +0.01 | +256 | +0 | +
| 2 | +0.01 | +5 | +0 | +
| 3 | +0.01 | +3 | +0 | +
| 5 (default) | +0.5 (default) | +2 | +0 | +
The most permissive rule (min_cofire=1, min_weight=0.01 — "any two nodes that fired in the same tick, ever, with vanishing weight floor") yields 256 priming bound edges. Zero connect a priming food cluster to a test-phase node. The temporal-coincidence signal the binding rule depends on does not exist in the priming trajectory.
Reproduce
+MAXIM_EC_TRACE_ACTIVATIONS=1 MAXIM_LOG_FILE=/tmp/roy_4_ec_trace.jsonl maxim roy run scenarios/roy/roy_4_iteration.yaml
+ Outcome doc: 21_roy_4.md — Cancelled plan: cross_modal_substrate_binding.md — Supersession: roy_5_encoder_alignment_disambiguator.md
+After Roy-4 cancelled the Hebbian binding plan, the user proposed a "linguistic funneling + lexicon" direction. Two parallel pre-merge reviews (architecture lens + bio-fidelity lens) independently rejected the lexicon-as-central-module shape, citing that (a) hand-curated lexicons have no biological warrant when ComponentIndex / ATL.find_or_create / AffordanceDecompositionStrategy already implement per-domain lookup, and (b) the lexicon doesn't actually solve Roy-2c — there are no surface tokens to normalize between cradle sensor/drive snapshots and CLI fixture text.
The reframed plan ships a diagnostic-first ladder. Roy-5a — Stage 1 — computes the priming↔arm-A cosine matrices on existing Roy-4 data (zero new sim runs, hours of work) to localize Roy-2c's gap to one of three sub-hypotheses:
+ +| ID | +max(M_tt) |
+ Hypothesis | +Fix shape | +Cost | +
|---|---|---|---|---|
| H1c | +≥ 0.40 | +Text close, EC threshold + centroid drift miss the completion | +Threshold/centroid sweep | +~80 LOC, days | +
| H1b | +0.20 – 0.40 | +Wrong encoder model for substrate-relevant phrasing | +Encoder A/B | +~200 LOC, weeks | +
| H1a | +< 0.20 | +Sensor-drive and text live in incomparable subspaces | +Stage 3 cradle-arc redesign → then resurrect binding OR promote encoder replacement to 1.2+ | +Months (if needed) | +
Plan: roy_5_encoder_alignment_disambiguator.md — The plan's "explicit non-introductions" section names what's out of scope: no central lexicon module, no LinguisticFactory class, no commitment to any sub-hypothesis fix before the diagnostic verdict. Same framing rule that saved months via Roy-4: cheap experimental gate, then implementation scoped by outcome.
Phase −1 ✓ shipped (NAc action proposal + 11 tests). Phase 0 harness ✓ shipped: --aut-mode substrate-primary CLI flag + cradle-prelinguistic arc variant + motor-only AUT prompt + per-tick telemetry. Roy harness ✓ shipped (2026-05-10): R1 curriculum runner + R2 substrate_diff + R3 three-arm iteration runner + R4 idempotent log generator + R5 process-global invariants. G3 fail-fast LLM preflight ✓ shipped (2026-05-11, PR #235+#238): _MaximPeerBackend.health_check() probe with env-then-peer.yml resolution; aborts in ≤3s on unreachable leader. G4 cluster_id reward wire ✓ shipped (2026-05-11, PR #236+#237): closes the deferred Track 2 wire — substrate-primary tool outcomes now populate NAc._cluster_reward_bias, persist to aut_nac.json, surface in substrate_diff. Empirically validated: live Roy-0 run produced cluster_reward_bias_l2 = 2.4587 on A-vs-blank pairs (~11.6× blank-vs-blank noise floor). Hivemind shareability infrastructure remains: portable substrate-snapshot bundle format + nac.merge() / ec.merge() Bayesian aggregation + provenance tags + identity-bearing concept detection + substrate domains + export/import CLI verbs.
Phase −1 ✓ shipped (NAc action proposal + 11 tests). Phase 0 harness ✓ shipped: --aut-mode substrate-primary CLI flag + cradle-prelinguistic arc variant + motor-only AUT prompt + per-tick telemetry. Roy harness ✓ shipped (2026-05-10): R1 curriculum runner + R2 substrate_diff + R3 three-arm iteration runner + R4 idempotent log generator + R5 process-global invariants. G3 fail-fast LLM preflight ✓ shipped (2026-05-11, PR #235+#238): _MaximPeerBackend.health_check() probe with env-then-peer.yml resolution; aborts in ≤3s on unreachable leader. G4 cluster_id reward wire ✓ shipped (2026-05-11, PR #236+#237): closes the deferred Track 2 wire — substrate-primary tool outcomes now populate NAc._cluster_reward_bias, persist to aut_nac.json, surface in substrate_diff. Empirically validated: live Roy-0 run produced cluster_reward_bias_l2 = 2.4587 on A-vs-blank pairs (~11.6× blank-vs-blank noise floor). Roy iteration arc Roy-1a–Roy-4 ✓ shipped (2026-05-11 → 2026-05-13): six follow-up iterations reproducing the wire 6× on the same priming and localizing the behavioral-expression gap to LinguisticEncoder → EC alignment; Roy-4 (PR #246) cancelled the 1.1 Hebbian binding plan via a pre-registered cheap-gate experiment. 0.9.1 Wire-A ships as the operator-visible interim that surfaces the surviving tool-level signal at the LLM prompt regardless of encoder drift. 1.1+ reframe: roy_5_encoder_alignment_disambiguator.md (PR #247) replaces the cancelled binding plan with a diagnostic-first ladder; Stage 1 (Roy-5a cosine analysis on existing Roy-4 data, zero new sim runs) decodes the gap to one of three sub-hypotheses that scope the 1.1+ fix. Hivemind shareability infrastructure remains: portable substrate-snapshot bundle format + nac.merge() / ec.merge() Bayesian aggregation + provenance tags + identity-bearing concept detection + substrate domains + export/import CLI verbs.
cluster_reward_bias_l2 = 2.4587 on Roy-0 A-vs-blank pairs
│
+ ├──> 2026-05-11 → 2026-05-13 Roy iteration arc (Roy-1a/1b/2/2pc/2c)
+ │ Cluster wire reproduces 6×; H1 confirmed: LinguisticEncoder → EC alignment is the block
+ │ 0.9.1 Wire-A surfaces the surviving tool-level signal at the LLM prompt
+ ├──> 2026-05-13 Roy-4 EC instrumentation + Hebbian sweep (PR #246)
+ │ FAIL — cancels cross_modal_substrate_binding.md Stages 2-6
+ │ Zero priming↔test bound edges at every parameter sweep point
+ ├──> 2026-05-13 roy_5_encoder_alignment_disambiguator.md opened (PR #247)
+ │ Stage 1 (Roy-5a cosine analysis on existing Roy-4 data) queued; verdict scopes 1.1+ fix
+ │
├──> B3.2-B3.3 Acting Coach extensions
├──> B5 substrate-primary harness flagged (Phase −1 ✓, Phase 0 harness ✓, validation 1.1+)
├──> Hivemind shareability infra (export/import, merge(), provenance — 1.0 reservation)
@@ -462,7 +471,7 @@ All architectural gates have been closed and v0.8.0 ships P5 (the final gate). The Roy harness shipped 2026-05-10 with G3 + G4 follow-ups landing 2026-05-11, giving 1.0 a working substrate-primary closed-loop validated end-to-end against a live leader. Remaining work is D1–D3 docs + stale-tagging cleanup; 1.0 is a packaging milestone.
+All architectural gates have been closed and v0.8.0 ships P5 (the final gate). The Roy harness shipped 2026-05-10 with G3 + G4 follow-ups landing 2026-05-11, giving 1.0 a working substrate-primary closed-loop validated end-to-end against a live leader. The 2026-05-13 Roy iteration arc (Roy-1a through Roy-4) extends the validation surface from "the wire fires" to "the wire's behavioral expression is gated at the encoder-alignment layer"; the 1.1 Hebbian binding plan was cancelled by Roy-4 in favor of a diagnostic-first reframe (roy_5_encoder_alignment_disambiguator.md). Remaining 1.0 work is D1–D3 docs + stale-tagging cleanup; 1.0 is a packaging milestone independent of the 1.1+ encoder-alignment research direction.
A-vs-blank ratio ≈ 11.6× over the blank-vs-blank stochastic-cluster noise floor — the first empirical proof the substrate-primary tool-outcome wire fires end-to-end. reward_bias_l2 = 0 is the expected per-ATL-node bias from credit_node (a different code path the tool-outcome wire doesn't touch).
Two G-marked gaps were caught + closed in the same session: G3 (fail-fast LLM preflight probe — aborts in ≤3s when the leader is unreachable, so Roy doesn't grind for 10 minutes on dispatch_exhausted), and G4 (the cluster_id reward-feedback wire deferred when cluster-keyed action selection shipped — substrate-primary tool outcomes now populate NAc._cluster_reward_bias, persist to aut_nac.json, and surface in Roy result.json).
Full Roy methodology + iteration log: persona_convergence_crucible.md. CLI: maxim roy run <spec.yaml> — see the CLI reference for spec shape + subcommands.
After Roy-0 proved the wire fires, six follow-up iterations probed whether the substrate-acquired bias actually changes behavior at test time. Across every iteration the priming-side cluster_reward_bias_l2 reproduces within ~5% (the substrate-primary tool-outcome wire is rock-solid). The failing signal is always which cluster ID the bias attaches to — priming-acquired UUIDs are never the active cluster on test percepts.
| Iter | +Date | +Variable | +cluster_l2 a_vs_b |
+ Finding | +
|---|---|---|---|---|
| Roy-1a | +2026-05-11 | +llm-primary at test | +2.4671 | +Wire structurally preserved, behaviorally inert — LLM proposer doesn't consume cluster bias. | +
| Roy-1b | +2026-05-12 | +substrate-primary at test | +2.4632 | +Wire consumed but held-out percepts don't fire priming clusters. | +
| Roy-2 | +2026-05-12 | +multi-arc priming | +2.4708 | +Multi-arc priming did NOT widen cluster vocabulary. Tool-family divergence via salience-mediated path only. | +
| Roy-2pc | +2026-05-13 | +engineered-overlap fixture | +2.4678 | +Byte-identical action distributions across all 3 arms even with engineered semantic overlap. | +
| Roy-2c | +2026-05-13 | +min_confidence=0.0 probe |
+ 2.5661 | +H1 confirmed — LinguisticEncoder → EC alignment is the block. Gate-tuning does not rescue the wire. | +
| Roy-4 | +2026-05-13 | +EC trace + Hebbian sweep | +2.4678 | +FAIL — zero priming↔test bound edges across the full parameter sweep. Cancels the 1.1 cross-modal binding plan. | +
Cross-iteration pattern: tool name survives, cluster identity doesn't. NAc's cluster_reward_bias map has the right tool keys (priming sense_food_source reward survives all seven iterations) but the wrong cluster keys. The substrate is using two different identity schemes for the same concept — coarse tool-symbol (stable) AND fine EC-cluster (encoder-drift-susceptible). Wire-A in the 0.9.1 release exploits the surviving granularity by surfacing tool-level bias to the LLM prompt; the architectural fix at the cluster layer needs encoder work.
Roy-4 ran the pre-registered cheap-gate experiment for the proposed Hebbian binding rule from cross_modal_substrate_binding.md. Result: zero EC node-ID overlap between priming (37 nodes) and any test arm (10/13/9); priming food clusters fired 61 ticks with only 1 non-food co-firing partner; the most permissive parameter setting (min_cofire=1, min_weight=0.01) yielded 256 priming bound edges but zero priming↔test connections. The temporal-coincidence signal the binding rule depends on doesn't exist in the priming trajectory. Outcome doc: 21_roy_4.md.
The 1.1 plan that supersedes the cancelled binding work is roy_5_encoder_alignment_disambiguator.md — a diagnostic-first reframe written after two parallel pre-merge reviews (architecture + bio-fidelity) independently rejected a user-proposed "central lexicon" direction. Stage 1 (Roy-5a) computes priming↔arm-A cosine matrices on existing Roy-4 data with zero new sim runs; the max cosine value decodes to one of three sub-hypotheses (threshold tuning vs encoder A/B vs cradle-arc redesign) that scopes the implementation.
+ +Full Roy methodology + iteration log: persona_convergence_crucible.md. CLI: maxim roy run <spec.yaml> — see the CLI reference for spec shape + subcommands. Detailed experiment writeups for every iteration: Experiments & Results — Roy iteration arc.