feat(0.9.1): Stages 0b + 0c — action JSONL telemetry + recommend_action emission by dennys246 · Pull Request #254 · dennys246/Maxim

dennys246 · 2026-05-16T03:59:37Z

Summary

Stages 0b + 0c of release_0_9_1.md. Telemetry instrumentation prereqs for Roy-3 measurement. The wires (Wire 3 / Wire-A / Wire 2 / Wire 1) ship as behavioral changes; 0b/0c ship the observability infrastructure those wires need to be measured.

Per the plan: "Stage 0a-d first because telemetry blocks observation. Roy-2c (one env var) lands before any wire work; Stage 0b-c lands as the structural prerequisite for measuring whether subsequent wires produced behavioral signal."

What ships

Stage	What	Where
0b — Action JSONL telemetry	`ActionRecord` gains `agent_id` / `session_id` / `entity_class` (CC3-additive on the frozen dataclass)	src/maxim/simulation/sinks.py
	`RequestContext` bound at sim orchestrator entry via `context_scope()` on AUT + orch threads	src/maxim/simulation/orchestrator.py
	`InstrumentedExecutor` reads `current_context()` + derives `entity_class` strictly opt-in (params-only after pre-merge fold)	src/maxim/simulation/instrumented_executor.py
	`save_action_log` writes `_format_version: "1.1"` header line + new fields per record	src/maxim/simulation/report.py
	`sim_action` grows `entity_class` kwarg threaded through `sim_log`'s data dict	src/maxim/simulation/sim_logger.py
0c — recommend_action emission	`NAc.recommend_action` emits one `sim_log("NAc_RECOMMEND", ...)` event per call (all 4 early-return paths — Roy-3 must distinguish "no tools available" / "no scores" / "sub-threshold" / "success")	src/maxim/decisions/nac.py
	tick aligned with Stage 0d's `sim_ec_activation` (`int(time.time() - _sim_start)`) so Roy-3 cross-channel joins work	same file

Total: +1,028 / -148, 8 files. 32 unit tests passing across 8 layers.

NAc snapshots at session boundary

Per the plan: "Save NAc snapshots at session boundary (not just final) so reward_bias evolution is plottable." Roy's multi-stage harness pattern (each stage produces its own session_id with its own aut_nac.json) already satisfies this since PR #248 wired EC + ATL into save_aut_state. Reward_bias evolution is plottable across the priming-stage sequence today. Intra-session checkpoints (within a single sim_id) are a follow-up if/when cradle continuous developmental sims need them — flagged by the bio-fidelity reviewer as a 1.0 follow-up, not a 0.9.1 gap.

Two-lens pre-merge review

Per feedback_review_before_ship.md, spawned parallel architecture + bio-fidelity reviews. Folded into commit 011c995 before opening this PR:

3 Critical (architecture lens):

Finding	Fix
`int(time.time())` did not align with Stage 0d's `int(time.time() - _sim_start)` tick space — Roy-3 cross-channel joins would have returned zero matches every time	Imported `sim_logger._sim_start`, subtracted; pinned by `test_tick_aligned_with_sim_logger_start`
`_format_version: "1.0"` vs plan-specified `"1.1"` — readers branching on version would have read the wrong dialect	Bumped to "1.1" + extracted to `_ACTIONS_JSONL_FORMAT_VERSION` constant
Fourth silent-return-None early-exit (`if not available_tools`) skipped emission — Roy-3 couldn't distinguish "no tools available" from "no tools scored above gate"	Added emission with `best_tool=None, best_score=0.0, passed_gate=False`; pinned by `test_emission_on_empty_available_tools_path`

Cross-confirmed (Arch I2 + Bio I1/I2) — entity_class heuristic:

Pre-fold _derive_entity_class included a verb-prefix-strip + role-suffix-strip path that produced noise on non-entity tools (get_status → "status", set_entity_sensor → "entity_sensor", do_something_clever → "something_clever"). Roy-3 normalization would have silently attributed pain events to fake entity classes. Dropped the verb-strip path entirely. Tool authors now opt into Roy-3 attribution by passing entity_class through params. Roy-3 normalization skips None, so being conservative is strictly safer than producing wrong buckets. Docstring adds two bio-fidelity guardrails: explicit "DO NOT consume from substrate write paths" + 1.1 TODO pointing to a declared Tool.entity_class field (tracks feedback_two_identity_schemes.md).

4 other Important findings folded:

Lens	Finding	Fix
Bio I3	`cluster_reward_bias_consulted=None` on empty-scores conflates "cluster unknown" with "cluster known, no tool scored"	`0.0` sentinel when cluster_id known, `None` only when truly absent. Pinned.
Arch I5	Manual `set_context`/`reset_context` vs `context_scope()` helper	Switched to `context_scope()` on both AUT + orch thread bindings — future sim entry points cannot forget the reset
Arch I4	Header-line breaks "every line is a record" reader expectations	Explicit reader contract in `save_action_log` docstring + `test_consumer_can_skip_header_line` regression guard
Arch N2	`except Exception` swallows real bugs in `_emit_recommend_action_event`	Narrowed to `except ImportError` (the only documented non-sim-runtime case)

Plus one bio nice-to-have: comment on the AUT/orch agent_id binding documenting that the current sim-fixed strings are correct for single-AUT topology but NPCs spawned via AgentFactory would need per-spawn context_scope.

Deferred (not blocking): sub-second t ordering (docs only, no code change), third-thread LLM worker context inheritance test (existing Plan 4 A.2 fallback already covers it), nice-to-have polish.

Test plan

python -m pytest tests/unit/test_stage_0b_0c_telemetry.py -q — 32 passed across 8 layers (ActionRecord, entity_class derivation, InstrumentedExecutor, compression, save_action_log, sim_action, sim_recommend_action emission, RequestContext binding).
python -m pytest tests/unit/test_stage_0b_0c_telemetry.py tests/unit/test_nac*.py tests/unit/test_simulation_agent.py tests/unit/test_save_aut_state.py -q — 142 passed (no regression).
python -m pytest tests/ -x -q -m "not slow" --ignore=tests/integration/test_memory_hub.py — 6600 passed (full fast suite before fold).
ruff check + ruff format clean on touched files (2 pre-existing F821/F841 errors in orchestrator.py are unrelated — confirmed against main).
Next: Roy-3 validation iteration uses this telemetry to measure whether the wires reach the LLM proposer's decision pathway. Not this PR.

What's next in 0.9.1

Per release_0_9_1.md:

✅ Stage 0a (Roy-2c probe) — shipped earlier
✅ Stages 0b + 0c (telemetry) — this PR
⏳ Stage 1 (Wire 3: embodiment-state → tool filter)
⏳ Stage 2 (Wire-A: cluster-bias annotation) — PR #253 open
⏳ Stage 3 (Wire 2: Pavlovian percept aversion)
⏳ Stage 4 (Wire 1: risk-sensitive action annotation)
⏳ Stage 5 (Roy-3 validation)

🤖 Generated with Claude Code

…on emission Telemetry instrumentation prereqs from release_0_9_1.md (lifted verbatim from bio_emergent_persona_foundations.md Stage 0). Ships the measurement infrastructure Roy-3 needs to disambiguate whether Wire-A's annotation actually reached the LLM proposer's decision pathway — without this, Roy-3 reads tool counts at the action layer but can't trace WHY each call was made. Stage 0b — Action JSONL telemetry: - ActionRecord (CC3-additive on the frozen dataclass): appends agent_id, session_id, entity_class optional fields with None defaults. Existing third-party ActionSink consumers continue to work unchanged. - RequestContext binding at sim orchestrator entry: AUT thread binds in _aut_worker; main-thread orchestrator binds in the orchestrator-loop try/finally. Per-thread ContextVar scoping means each worker sees its own agent_id ("sim_aut" vs "sim_orchestrator") + the shared session_id from the entry-built timestamp. reset_context in finally guards against per-call leak. - InstrumentedExecutor.execute() reads utils/http.py::current_context() + derives entity_class via best-effort heuristic (params["entity_class"] → params["target"]/"entity"/"object" → tool-name verb-prefix-strip + role-suffix-strip). Verb-only tools (respond, examine) return None — Roy-3 normalization aggregates with None skipped from exposure counts. - save_action_log writes a _format_version header line at the head of actions.jsonl per CC1 contract, plus the three new telemetry fields per record. - RecordingSink._compress_oldest preserves telemetry through compression (tiny fields, full attribution survives audit trail). Stage 0c — recommend_action emission: - NAc.recommend_action emits exactly one sim_log("NAc_RECOMMEND", ...) event per call, including all three early-exit paths (no scores, sub-threshold, success). Per the plan: "the event MUST emit even when recommend_action returns None" — Roy-3 needs to distinguish "gate fired, consumer did nothing" from "consumer didn't run at all." - Fields: tick (int(time.time()) bucket), current_cluster_id, cluster_reward_bias_consulted (the value read from _cluster_reward_bias for the active cluster only — NOT the agent-wide Wire-A aggregate; mismatch between rendered Wire-A signal and consulted recommend_action signal is the H1 failure mode), best_tool, best_score, min_confidence, passed_gate. - _emit_recommend_action_event helper at top of nac.py is fail-soft — non-sim runtime calls (e.g., headless API, unit tests without sim logging enabled) don't crash on missing telemetry plumbing. Stage 0b/0c NAc snapshots interpretation: per-stage save_aut_state calls (already wired since PR #248) satisfy "session boundary" for Roy's multi-stage harness pattern. Each Roy stage produces its own session_id with its own aut_nac.json — reward_bias evolution is plottable across the priming-stage sequence. Intra-session checkpoints (within a single sim_id) are a follow-up if needed. Sim-action interface change: sim_action() grows entity_class + **kwargs keyword-only parameters. Existing 2/3-arg positional calls keep working; new callers can pass entity_class. Field omitted from JSONL when None (avoids null-noise in Roy-3 records). Test surface (27 tests, 8 layers): - Layer 1 (ActionRecord): back-compat shape + new fields populated. - Layer 2 (entity_class derivation): all 4 fallback paths, priority order, verb-only tools → None, non-dict params → None. - Layer 3 (InstrumentedExecutor): context-bound + context-unbound paths; record_block also carries telemetry. - Layer 4 (compression): tiny fields survive _compress_oldest. - Layer 5 (save_action_log): _format_version header, telemetry fields per record, header even with 0 records. - Layer 6 (sim_action): legacy call shape, entity_class threading, None-omission. - Layer 7 (sim_recommend_action): emission on all 3 paths + fail-soft when sim logging disabled. - Layer 8 (RequestContext binding): round-trip + reset-on-exception. All passing. ruff clean on touched files (2 pre-existing ruff errors in orchestrator.py are unrelated to this PR). Frozen contract impact: ActionRecord SHAPE-FROZEN at 1.0 (CC3) — optional fields appended at end with defaults are non-breaking; docstring updated to declare the new fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two-lens pre-merge review of feat/0-9-1-stage-0b-0c-telemetry surfaced 3 Critical architecture findings + 1 cross-confirmed Important finding (both lenses) + 4 Important findings (mixed lens). All folded before opening the PR per feedback_review_before_ship.md. 32 tests passing (up from 27 — 5 new fold-regression guards). Critical (architecture): 1. tick mismatch with Stage 0d. Pre-fold, sim_recommend_action emitted tick=int(time.time()) (raw epoch ~1.7e9); Stage 0d's sim_ec_activation emits tick=int(time.time() - _sim_start) (elapsed seconds). A 1e9 offset would have made Roy-3 left-joins on tick return zero matches every time. Fix imports sim_logger and subtracts _sim_start. Pinned by test_tick_aligned_with_sim_logger_start. 2. _format_version "1.0" → "1.1". Plan's "Cross-cutting: persistence schema" section explicitly pins this at "1.1" (minor bump from pre-0b unversioned per CC1's "0.x" sentinel rule). Pre-fold shipped "1.0" — readers branching on version would have read the wrong dialect. Bumped + extracted to _ACTIONS_JSONL_FORMAT_VERSION module constant so the next bump is a one-line change. 3. Fourth silent-return-None early-exit was missing emission. recommend_action's `if not available_tools: return None` bailed before the emitter — Roy-3 couldn't distinguish "no tools available" from "no tools scored above gate." Pinned by test_emission_on_empty_available_tools_path. Cross-confirmed (architecture I2 + bio I1+I2): entity_class heuristic scoped to strict opt-in. Pre-fold, _derive_entity_class included a tool-name verb-prefix- strip heuristic that produced noise on non-entity tools: get_status → "status", set_entity_sensor → "entity_sensor", do_something_clever → "something_clever". Roy-3 normalization would have silently attributed pain events to fake entity classes. Bio-lens flagged the contamination question; arch-lens flagged the false-positive rate. Fix drops the verb-strip path entirely. Tool authors opt into Roy-3 attribution by passing entity_class through params (entity_class / target / entity / object). The post-fold heuristic is conservative: Roy-3 normalization skips None, so being more conservative is strictly safer than producing wrong buckets — silent miscount is worse than missing data. Docstring adds two bio-fidelity guardrails: - "DO NOT consume this field from any substrate write path" — walls entity_class off from NAc/EC/ATL/Hippocampus/PainBus, making the contamination question structurally unambiguous. - 1.1 TODO pointing to a declared `Tool.entity_class` field as the future shape — tracks the same surface as feedback_two_identity_schemes.md. Tests: test_tool_name_alone_does_not_derive verifies the dropped heuristic; test_non_entity_tools_with_underscores_return_none pins the false-positive regression guard. Bio I3: empty-scores cluster sentinel. On the empty-scores path, cluster_reward_bias_consulted was always None — conflating "agent had no active cluster" with "agent had a cluster but no tools scored." Roy-3 H1 disambiguation needs the distinction (the Wire-A vs recommend_action gap is exactly here). Post-fold: 0.0 sentinel when current_cluster_id is set, None when truly absent. Pinned by test_empty_scores_sentinel_distinguishes_cluster_known_vs_unknown. Architecture I5: use context_scope() helper instead of manual set_context/reset_context. Both AUT and orchestrator thread bindings now use the canonical helper from utils/http.py; future sim entry points cannot forget the reset. Architecture I4: explicit header-skip reader contract in save_action_log docstring. Pinned by test_consumer_can_skip_header_line which simulates the documented "skip _record_kind == 'header'" reader pattern. Bio nice-to-have: comment on the AUT/orch agent_id binding documenting that the current sim-fixed strings are correct for the single-AUT topology but NPCs spawned via AgentFactory in this orchestrator session would need per-spawn context_scope. Architecture nice-to-have: narrowed `except Exception` to `except ImportError` in _emit_recommend_action_event. Non-sim runtime is the only documented swallow case; other exceptions propagate so a real sim_logger bug surfaces. Deferred (architecture I1, I6, N1-N3): sub-second t ordering (documentation, no code), third-thread LLM worker test (existing Plan 4 A.2 inheritance), nice-to-have polish. Tracked in fold review thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dennys246 and others added 2 commits May 15, 2026 16:26

dennys246 merged commit 242235a into main May 16, 2026
5 checks passed

dennys246 deleted the feat/0-9-1-stage-0b-0c-telemetry branch May 16, 2026 04:21

dennys246 mentioned this pull request May 17, 2026

feat(0.9.1): Wire 1 — risk-sensitive action annotation (Stage 4) #257

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.9.1): Stages 0b + 0c — action JSONL telemetry + recommend_action emission#254

feat(0.9.1): Stages 0b + 0c — action JSONL telemetry + recommend_action emission#254
dennys246 merged 2 commits into
mainfrom
feat/0-9-1-stage-0b-0c-telemetry

dennys246 commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dennys246 commented May 16, 2026

Summary

What ships

NAc snapshots at session boundary

Two-lens pre-merge review

Test plan

What's next in 0.9.1

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant