Skip to content

feat(0.9.1): Stages 0b + 0c — action JSONL telemetry + recommend_action emission#254

Merged
dennys246 merged 2 commits into
mainfrom
feat/0-9-1-stage-0b-0c-telemetry
May 16, 2026
Merged

feat(0.9.1): Stages 0b + 0c — action JSONL telemetry + recommend_action emission#254
dennys246 merged 2 commits into
mainfrom
feat/0-9-1-stage-0b-0c-telemetry

Conversation

@dennys246
Copy link
Copy Markdown
Owner

Summary

Stages 0b + 0c of release_0_9_1.md. Telemetry instrumentation prereqs for Roy-3 measurement. The wires (Wire 3 / Wire-A / Wire 2 / Wire 1) ship as behavioral changes; 0b/0c ship the observability infrastructure those wires need to be measured.

Per the plan: "Stage 0a-d first because telemetry blocks observation. Roy-2c (one env var) lands before any wire work; Stage 0b-c lands as the structural prerequisite for measuring whether subsequent wires produced behavioral signal."

What ships

Stage What Where
0b — Action JSONL telemetry ActionRecord gains agent_id / session_id / entity_class (CC3-additive on the frozen dataclass) src/maxim/simulation/sinks.py
RequestContext bound at sim orchestrator entry via context_scope() on AUT + orch threads src/maxim/simulation/orchestrator.py
InstrumentedExecutor reads current_context() + derives entity_class strictly opt-in (params-only after pre-merge fold) src/maxim/simulation/instrumented_executor.py
save_action_log writes _format_version: "1.1" header line + new fields per record src/maxim/simulation/report.py
sim_action grows entity_class kwarg threaded through sim_log's data dict src/maxim/simulation/sim_logger.py
0c — recommend_action emission NAc.recommend_action emits one sim_log("NAc_RECOMMEND", ...) event per call (all 4 early-return paths — Roy-3 must distinguish "no tools available" / "no scores" / "sub-threshold" / "success") src/maxim/decisions/nac.py
tick aligned with Stage 0d's sim_ec_activation (int(time.time() - _sim_start)) so Roy-3 cross-channel joins work same file

Total: +1,028 / -148, 8 files. 32 unit tests passing across 8 layers.

NAc snapshots at session boundary

Per the plan: "Save NAc snapshots at session boundary (not just final) so reward_bias evolution is plottable." Roy's multi-stage harness pattern (each stage produces its own session_id with its own aut_nac.json) already satisfies this since PR #248 wired EC + ATL into save_aut_state. Reward_bias evolution is plottable across the priming-stage sequence today. Intra-session checkpoints (within a single sim_id) are a follow-up if/when cradle continuous developmental sims need them — flagged by the bio-fidelity reviewer as a 1.0 follow-up, not a 0.9.1 gap.

Two-lens pre-merge review

Per feedback_review_before_ship.md, spawned parallel architecture + bio-fidelity reviews. Folded into commit 011c995 before opening this PR:

3 Critical (architecture lens):

Finding Fix
int(time.time()) did not align with Stage 0d's int(time.time() - _sim_start) tick space — Roy-3 cross-channel joins would have returned zero matches every time Imported sim_logger._sim_start, subtracted; pinned by test_tick_aligned_with_sim_logger_start
_format_version: "1.0" vs plan-specified "1.1" — readers branching on version would have read the wrong dialect Bumped to "1.1" + extracted to _ACTIONS_JSONL_FORMAT_VERSION constant
Fourth silent-return-None early-exit (if not available_tools) skipped emission — Roy-3 couldn't distinguish "no tools available" from "no tools scored above gate" Added emission with best_tool=None, best_score=0.0, passed_gate=False; pinned by test_emission_on_empty_available_tools_path

Cross-confirmed (Arch I2 + Bio I1/I2) — entity_class heuristic:

Pre-fold _derive_entity_class included a verb-prefix-strip + role-suffix-strip path that produced noise on non-entity tools (get_status"status", set_entity_sensor"entity_sensor", do_something_clever"something_clever"). Roy-3 normalization would have silently attributed pain events to fake entity classes. Dropped the verb-strip path entirely. Tool authors now opt into Roy-3 attribution by passing entity_class through params. Roy-3 normalization skips None, so being conservative is strictly safer than producing wrong buckets. Docstring adds two bio-fidelity guardrails: explicit "DO NOT consume from substrate write paths" + 1.1 TODO pointing to a declared Tool.entity_class field (tracks feedback_two_identity_schemes.md).

4 other Important findings folded:

Lens Finding Fix
Bio I3 cluster_reward_bias_consulted=None on empty-scores conflates "cluster unknown" with "cluster known, no tool scored" 0.0 sentinel when cluster_id known, None only when truly absent. Pinned.
Arch I5 Manual set_context/reset_context vs context_scope() helper Switched to context_scope() on both AUT + orch thread bindings — future sim entry points cannot forget the reset
Arch I4 Header-line breaks "every line is a record" reader expectations Explicit reader contract in save_action_log docstring + test_consumer_can_skip_header_line regression guard
Arch N2 except Exception swallows real bugs in _emit_recommend_action_event Narrowed to except ImportError (the only documented non-sim-runtime case)

Plus one bio nice-to-have: comment on the AUT/orch agent_id binding documenting that the current sim-fixed strings are correct for single-AUT topology but NPCs spawned via AgentFactory would need per-spawn context_scope.

Deferred (not blocking): sub-second t ordering (docs only, no code change), third-thread LLM worker context inheritance test (existing Plan 4 A.2 fallback already covers it), nice-to-have polish.

Test plan

  • python -m pytest tests/unit/test_stage_0b_0c_telemetry.py -q32 passed across 8 layers (ActionRecord, entity_class derivation, InstrumentedExecutor, compression, save_action_log, sim_action, sim_recommend_action emission, RequestContext binding).
  • python -m pytest tests/unit/test_stage_0b_0c_telemetry.py tests/unit/test_nac*.py tests/unit/test_simulation_agent.py tests/unit/test_save_aut_state.py -q142 passed (no regression).
  • python -m pytest tests/ -x -q -m "not slow" --ignore=tests/integration/test_memory_hub.py6600 passed (full fast suite before fold).
  • ruff check + ruff format clean on touched files (2 pre-existing F821/F841 errors in orchestrator.py are unrelated — confirmed against main).
  • Next: Roy-3 validation iteration uses this telemetry to measure whether the wires reach the LLM proposer's decision pathway. Not this PR.

What's next in 0.9.1

Per release_0_9_1.md:

  1. ✅ Stage 0a (Roy-2c probe) — shipped earlier
  2. Stages 0b + 0c (telemetry) — this PR
  3. ⏳ Stage 1 (Wire 3: embodiment-state → tool filter)
  4. ⏳ Stage 2 (Wire-A: cluster-bias annotation) — PR #253 open
  5. ⏳ Stage 3 (Wire 2: Pavlovian percept aversion)
  6. ⏳ Stage 4 (Wire 1: risk-sensitive action annotation)
  7. ⏳ Stage 5 (Roy-3 validation)

🤖 Generated with Claude Code

dennys246 and others added 2 commits May 15, 2026 16:26
…on emission

Telemetry instrumentation prereqs from release_0_9_1.md (lifted
verbatim from bio_emergent_persona_foundations.md Stage 0). Ships
the measurement infrastructure Roy-3 needs to disambiguate whether
Wire-A's annotation actually reached the LLM proposer's decision
pathway — without this, Roy-3 reads tool counts at the action layer
but can't trace WHY each call was made.

Stage 0b — Action JSONL telemetry:

- ActionRecord (CC3-additive on the frozen dataclass): appends
  agent_id, session_id, entity_class optional fields with None
  defaults. Existing third-party ActionSink consumers continue to
  work unchanged.
- RequestContext binding at sim orchestrator entry: AUT thread
  binds in _aut_worker; main-thread orchestrator binds in the
  orchestrator-loop try/finally. Per-thread ContextVar scoping
  means each worker sees its own agent_id ("sim_aut" vs
  "sim_orchestrator") + the shared session_id from the entry-built
  timestamp. reset_context in finally guards against per-call leak.
- InstrumentedExecutor.execute() reads utils/http.py::current_context()
  + derives entity_class via best-effort heuristic
  (params["entity_class"] → params["target"]/"entity"/"object" →
  tool-name verb-prefix-strip + role-suffix-strip). Verb-only tools
  (respond, examine) return None — Roy-3 normalization aggregates
  with None skipped from exposure counts.
- save_action_log writes a _format_version header line at the head
  of actions.jsonl per CC1 contract, plus the three new telemetry
  fields per record.
- RecordingSink._compress_oldest preserves telemetry through
  compression (tiny fields, full attribution survives audit trail).

Stage 0c — recommend_action emission:

- NAc.recommend_action emits exactly one sim_log("NAc_RECOMMEND", ...)
  event per call, including all three early-exit paths (no scores,
  sub-threshold, success). Per the plan: "the event MUST emit even
  when recommend_action returns None" — Roy-3 needs to distinguish
  "gate fired, consumer did nothing" from "consumer didn't run at all."
- Fields: tick (int(time.time()) bucket), current_cluster_id,
  cluster_reward_bias_consulted (the value read from
  _cluster_reward_bias for the active cluster only — NOT the
  agent-wide Wire-A aggregate; mismatch between rendered Wire-A
  signal and consulted recommend_action signal is the H1 failure
  mode), best_tool, best_score, min_confidence, passed_gate.
- _emit_recommend_action_event helper at top of nac.py is fail-soft —
  non-sim runtime calls (e.g., headless API, unit tests without
  sim logging enabled) don't crash on missing telemetry plumbing.

Stage 0b/0c NAc snapshots interpretation: per-stage save_aut_state
calls (already wired since PR #248) satisfy "session boundary" for
Roy's multi-stage harness pattern. Each Roy stage produces its own
session_id with its own aut_nac.json — reward_bias evolution is
plottable across the priming-stage sequence. Intra-session
checkpoints (within a single sim_id) are a follow-up if needed.

Sim-action interface change: sim_action() grows entity_class +
**kwargs keyword-only parameters. Existing 2/3-arg positional calls
keep working; new callers can pass entity_class. Field omitted
from JSONL when None (avoids null-noise in Roy-3 records).

Test surface (27 tests, 8 layers):

- Layer 1 (ActionRecord): back-compat shape + new fields populated.
- Layer 2 (entity_class derivation): all 4 fallback paths, priority
  order, verb-only tools → None, non-dict params → None.
- Layer 3 (InstrumentedExecutor): context-bound + context-unbound
  paths; record_block also carries telemetry.
- Layer 4 (compression): tiny fields survive _compress_oldest.
- Layer 5 (save_action_log): _format_version header, telemetry
  fields per record, header even with 0 records.
- Layer 6 (sim_action): legacy call shape, entity_class threading,
  None-omission.
- Layer 7 (sim_recommend_action): emission on all 3 paths +
  fail-soft when sim logging disabled.
- Layer 8 (RequestContext binding): round-trip + reset-on-exception.

All passing. ruff clean on touched files (2 pre-existing ruff
errors in orchestrator.py are unrelated to this PR).

Frozen contract impact: ActionRecord SHAPE-FROZEN at 1.0 (CC3) —
optional fields appended at end with defaults are non-breaking;
docstring updated to declare the new fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-lens pre-merge review of feat/0-9-1-stage-0b-0c-telemetry surfaced
3 Critical architecture findings + 1 cross-confirmed Important
finding (both lenses) + 4 Important findings (mixed lens). All folded
before opening the PR per feedback_review_before_ship.md. 32 tests
passing (up from 27 — 5 new fold-regression guards).

Critical (architecture):

1. tick mismatch with Stage 0d. Pre-fold, sim_recommend_action emitted
   tick=int(time.time()) (raw epoch ~1.7e9); Stage 0d's
   sim_ec_activation emits tick=int(time.time() - _sim_start)
   (elapsed seconds). A 1e9 offset would have made Roy-3 left-joins
   on tick return zero matches every time. Fix imports sim_logger
   and subtracts _sim_start. Pinned by
   test_tick_aligned_with_sim_logger_start.

2. _format_version "1.0" → "1.1". Plan's "Cross-cutting: persistence
   schema" section explicitly pins this at "1.1" (minor bump from
   pre-0b unversioned per CC1's "0.x" sentinel rule). Pre-fold
   shipped "1.0" — readers branching on version would have read the
   wrong dialect. Bumped + extracted to _ACTIONS_JSONL_FORMAT_VERSION
   module constant so the next bump is a one-line change.

3. Fourth silent-return-None early-exit was missing emission.
   recommend_action's `if not available_tools: return None` bailed
   before the emitter — Roy-3 couldn't distinguish "no tools
   available" from "no tools scored above gate." Pinned by
   test_emission_on_empty_available_tools_path.

Cross-confirmed (architecture I2 + bio I1+I2): entity_class heuristic
scoped to strict opt-in.

  Pre-fold, _derive_entity_class included a tool-name verb-prefix-
  strip heuristic that produced noise on non-entity tools: get_status
  → "status", set_entity_sensor → "entity_sensor", do_something_clever
  → "something_clever". Roy-3 normalization would have silently
  attributed pain events to fake entity classes. Bio-lens flagged
  the contamination question; arch-lens flagged the false-positive
  rate.

  Fix drops the verb-strip path entirely. Tool authors opt into
  Roy-3 attribution by passing entity_class through params
  (entity_class / target / entity / object). The post-fold heuristic
  is conservative: Roy-3 normalization skips None, so being more
  conservative is strictly safer than producing wrong buckets —
  silent miscount is worse than missing data.

  Docstring adds two bio-fidelity guardrails:
    - "DO NOT consume this field from any substrate write path" —
      walls entity_class off from NAc/EC/ATL/Hippocampus/PainBus,
      making the contamination question structurally unambiguous.
    - 1.1 TODO pointing to a declared `Tool.entity_class` field as
      the future shape — tracks the same surface as
      feedback_two_identity_schemes.md.

  Tests: test_tool_name_alone_does_not_derive verifies the dropped
  heuristic; test_non_entity_tools_with_underscores_return_none
  pins the false-positive regression guard.

Bio I3: empty-scores cluster sentinel.

  On the empty-scores path, cluster_reward_bias_consulted was always
  None — conflating "agent had no active cluster" with "agent had a
  cluster but no tools scored." Roy-3 H1 disambiguation needs the
  distinction (the Wire-A vs recommend_action gap is exactly here).
  Post-fold: 0.0 sentinel when current_cluster_id is set, None when
  truly absent. Pinned by
  test_empty_scores_sentinel_distinguishes_cluster_known_vs_unknown.

Architecture I5: use context_scope() helper instead of manual
set_context/reset_context. Both AUT and orchestrator thread bindings
now use the canonical helper from utils/http.py; future sim entry
points cannot forget the reset.

Architecture I4: explicit header-skip reader contract in
save_action_log docstring. Pinned by test_consumer_can_skip_header_line
which simulates the documented "skip _record_kind == 'header'"
reader pattern.

Bio nice-to-have: comment on the AUT/orch agent_id binding
documenting that the current sim-fixed strings are correct for the
single-AUT topology but NPCs spawned via AgentFactory in this
orchestrator session would need per-spawn context_scope.

Architecture nice-to-have: narrowed `except Exception` to
`except ImportError` in _emit_recommend_action_event. Non-sim
runtime is the only documented swallow case; other exceptions
propagate so a real sim_logger bug surfaces.

Deferred (architecture I1, I6, N1-N3): sub-second t ordering
(documentation, no code), third-thread LLM worker test (existing
Plan 4 A.2 inheritance), nice-to-have polish. Tracked in fold
review thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dennys246 dennys246 merged commit 242235a into main May 16, 2026
5 checks passed
@dennys246 dennys246 deleted the feat/0-9-1-stage-0b-0c-telemetry branch May 16, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant