Skip to content

Commit 561e4b4

Browse files
dennys246claude
andcommitted
feat(substrate): 0.4 scale validation + episode boundaries + concept decomposition S2-3 + P5 stress
Episode boundary enrichment Stage 1: - after_tool_execution field on CaptureEvent + tool_execution_rule() - Wired at both agent loop callsites (Section 3 + Section 4) - 5 new tests in TestToolExecutionBoundary Concept decomposition Stage 2 (role-tagged edges): - _classify_relation extracts spatial/possessive/temporal/action/descriptive from spaCy dependency parse on noun chunks - node_relations on CaptureEvent + PendingEpisodeState - apply_hebbian_on_close annotates edges with relation metadata - 4 new tests in TestRoleTaggedEdges, 5 new decomposer relation tests Concept decomposition Stage 3 (ConceptExtractor convergence): - ConceptExtractor accepts optional decomposer parameter - Goal text extracted via NLP noun-phrase chunker when available, falls back to legacy _is_structured_goal heuristic - _decomposer init order fixed (pre-merge review fold: _wire_substrate_encoder before _wire_multi_layer so ConceptExtractor sees the decomposer) P5 stress persistence (Stages 1+2): - 1k-node mechanism: 10 reload cycles, zero F1 degradation - 10k+ node mid-scale: 12k nodes, 3k episodes, 1.15MB, 0.06s load - State size sub-linear growth verified - NAc reward bias round-trip verified Tier 3 scale validation script: - 20+ seed runner with resume support + incremental JSONL saves - Wilcoxon signed-rank (S3 > S1) + Mann-Whitney (S3 > control) - Pass gates: mean S3 teal >= 70%, p < 0.05, S3 escape >= 80% Pre-merge review findings folded (5 fixes): - _decomposer init order (CRITICAL, Arch cross-confirmed) - Scale script import path + JSON exit code (IMPORTANT, Exec) - Relation priority comment corrected (IMPORTANT, cross-confirmed) - Unknown preposition returns None not "spatial" (MINOR, Exec) 5141 tests pass, 0 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a544f1b commit 561e4b4

16 files changed

Lines changed: 1340 additions & 22 deletions
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Tier 3 Scale Validation — Experiment Protocol
2+
3+
**Experiment:** Run organic LLM learning (Exp 4, Tier 3) at 20+ independent seeds to prove the learning effect is statistically robust.
4+
5+
## Quick run
6+
7+
```bash
8+
# Full scale run (~60-100 min, requires leader with qwen2.5-14b):
9+
PYTHONPATH=src python scripts/behavioral_convergence_exp4_scale.py --seeds 20
10+
11+
# Resume from a partial run:
12+
PYTHONPATH=src python scripts/behavioral_convergence_exp4_scale.py --resume-dir /path/to/output
13+
```
14+
15+
## Prerequisites
16+
17+
- Leader online with LLM loaded (`maxim peer llm --status`)
18+
- Network connectivity to leader (`maxim peer version`)
19+
- scipy installed (for statistical tests)
20+
21+
## What each seed does
22+
23+
Each of the 20+ seeds runs the full Exp 4 pipeline independently:
24+
25+
1. **Session 1 (exploration):** Fresh agent, no bio-state. Random exploration.
26+
2. **Session 2 (early learning):** Reloaded from Session 1. Should show preference shift.
27+
3. **Session 3 (convergence):** Reloaded from Session 2. Should converge to teal (antidote).
28+
4. **Fresh control:** No bio-state. Baseline comparison.
29+
30+
Bio-state (hippocampus + NAc) is saved/loaded between sessions within each seed, but seeds are fully isolated from each other.
31+
32+
## Primary metrics
33+
34+
- **Teal rate per session:** fraction of choices that select the antidote vial
35+
- **S3-S1 improvement:** paired difference in teal rate between Session 3 and Session 1
36+
- **Control comparison:** experienced S3 vs fresh control
37+
38+
## Pass gates
39+
40+
| Gate | Criterion | Rationale |
41+
|------|-----------|-----------|
42+
| Mean S3 teal rate | >= 70% | Strong convergence on average |
43+
| Mean S3-S1 improvement | > 0% | Learning occurred |
44+
| Wilcoxon signed-rank (S3 > S1) | p < 0.05 | Statistically significant |
45+
| S3 escape rate | >= 80% | Most seeds converge to escape |
46+
| Control death rate | >= 60% | Fresh agents reliably fail |
47+
| S3 teal > control teal | mean comparison | Experienced beats fresh |
48+
49+
## Statistical tests
50+
51+
- **Wilcoxon signed-rank test:** paired, one-sided (S3 > S1). Non-parametric — no normality assumption on teal rates (they're bounded [0, 1]).
52+
- **Mann-Whitney U:** unpaired, one-sided (S3 > control). Different sample sizes possible if some controls have different choice counts.
53+
54+
## Sources of variance
55+
56+
The only source of per-seed variance is **LLM sampling noise** (temperature 0.4). Vial order shuffling is deterministic per turn (`random.Random(turn * 7 + 13)`). This is intentional — we're measuring whether the bio-system's learned valence is strong enough to overcome LLM sampling noise across 20+ independent trials.
57+
58+
## Resume support
59+
60+
Results are saved incrementally to `partial_results.jsonl` in the output directory. If a run is interrupted, use `--resume-dir` to continue from where it stopped. Completed seeds are skipped automatically.
61+
62+
## Output
63+
64+
- `partial_results.jsonl` — one line per completed seed (incremental)
65+
- `scale_results.json` — full aggregate results with statistics
66+
- `seed_NNN/` — per-seed persist directories (hippocampus.json, nac.json)
67+
68+
## If gates fail
69+
70+
1. **Mean S3 teal rate < 70%:** Check if valence context is being surfaced in the prompt. Inspect a failing seed's persist dir for hippocampus/NAc state. The bio-pipeline may not be producing strong enough valence signal.
71+
72+
2. **Wilcoxon p >= 0.05:** May need more seeds (try 30). Or the effect size is smaller than expected — check the improvement distribution for outlier seeds that regress.
73+
74+
3. **Control death rate < 60%:** The LLM may have a language prior about teal/antidote despite masked names. Check the control choices — if teal is selected by name preference (not learning), the masking is insufficient.
75+
76+
4. **S3 escape rate < 80%:** Some seeds may get stuck in a local minimum (always picking purple/heal but never finding the antidote). Check if those seeds' Session 2 ever tried teal — if not, the exploration in Session 1 may have been too narrow.

docs/plans/substrate_concept_decomposition.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Substrate — Concept decomposition (noun-phrase extraction before EC)
22

3-
**Status:** Stage 1 COMPLETE (shipped `723dbee` 2026-04-16, validated 2026-04-17). Stage 2/3 pending.
3+
**Status:** Stage 1 COMPLETE (shipped `723dbee` 2026-04-16, validated 2026-04-17). Stages 2+3 SHIPPED in 0.4.
44
**Scope:** ~400–600 LOC (protocol + spaCy strategy + encoder integration + tests). New optional dep: `spacy` (MIT license).
55
**Target version:** post-0.3. Ships AFTER P4 Stage 3 proves the base cross-modal claim on bare class names. **P4 Stage 3 PASSED (2026-04-16) — trigger fired.**
66
**Parent:** None (standalone). Extends `similarity/encoder.py``similarity/ec.py` capture path.

docs/plans/substrate_episode_boundary_enrichment.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Substrate — Episode boundary enrichment
22

3-
**Status:** PARTIAL (2026-04-17). Stage 3 (pain/salience spike) SHIPPED via sem_learning_loop.md. observe_episode_event now wired into production agent loop via behavioral_convergence_wiring.md. Stages 1-2 (tool execution + semantic shift) remain — ship before P5.
3+
**Status:** PARTIAL (2026-04-18). Stage 1 (tool execution boundary) SHIPPED in 0.4. Stage 3 (pain/salience spike) SHIPPED via sem_learning_loop.md. Stage 2 (semantic shift) remains — ship before P6.
44
**Scope:** ~200–400 LOC (3 new boundary rules + CaptureEvent extensions + tests).
55
**Target version:** post-0.3. Ships AFTER P4 Stage 3 proves the base substrate claim.
66
**Parent:** None (standalone). Extends `memory/episode.py` boundary rule surface.

docs/plans/substrate_p5_stress_persistence.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Substrate P5 — Robust Cross-Session Persistence Under Stress
22

3-
**Status:** Draft — opens after P4 (CLOSED) + concept decomposition land.
3+
**Status:** Stages 1+2 SHIPPED in 0.4. Stage 3 (10-seed sweep) marked `@pytest.mark.slow`.
44
**Scope:** ~400 LOC + ~100 metric extractor
55
**Target version:** 0.5
66
**Gates:** null (not 1.0-gating, but blocks P6 and P8)

0 commit comments

Comments
 (0)