feat(0.9.1): Stages 0b + 0c — action JSONL telemetry + recommend_action emission#254
Merged
Merged
Conversation
…on emission
Telemetry instrumentation prereqs from release_0_9_1.md (lifted
verbatim from bio_emergent_persona_foundations.md Stage 0). Ships
the measurement infrastructure Roy-3 needs to disambiguate whether
Wire-A's annotation actually reached the LLM proposer's decision
pathway — without this, Roy-3 reads tool counts at the action layer
but can't trace WHY each call was made.
Stage 0b — Action JSONL telemetry:
- ActionRecord (CC3-additive on the frozen dataclass): appends
agent_id, session_id, entity_class optional fields with None
defaults. Existing third-party ActionSink consumers continue to
work unchanged.
- RequestContext binding at sim orchestrator entry: AUT thread
binds in _aut_worker; main-thread orchestrator binds in the
orchestrator-loop try/finally. Per-thread ContextVar scoping
means each worker sees its own agent_id ("sim_aut" vs
"sim_orchestrator") + the shared session_id from the entry-built
timestamp. reset_context in finally guards against per-call leak.
- InstrumentedExecutor.execute() reads utils/http.py::current_context()
+ derives entity_class via best-effort heuristic
(params["entity_class"] → params["target"]/"entity"/"object" →
tool-name verb-prefix-strip + role-suffix-strip). Verb-only tools
(respond, examine) return None — Roy-3 normalization aggregates
with None skipped from exposure counts.
- save_action_log writes a _format_version header line at the head
of actions.jsonl per CC1 contract, plus the three new telemetry
fields per record.
- RecordingSink._compress_oldest preserves telemetry through
compression (tiny fields, full attribution survives audit trail).
Stage 0c — recommend_action emission:
- NAc.recommend_action emits exactly one sim_log("NAc_RECOMMEND", ...)
event per call, including all three early-exit paths (no scores,
sub-threshold, success). Per the plan: "the event MUST emit even
when recommend_action returns None" — Roy-3 needs to distinguish
"gate fired, consumer did nothing" from "consumer didn't run at all."
- Fields: tick (int(time.time()) bucket), current_cluster_id,
cluster_reward_bias_consulted (the value read from
_cluster_reward_bias for the active cluster only — NOT the
agent-wide Wire-A aggregate; mismatch between rendered Wire-A
signal and consulted recommend_action signal is the H1 failure
mode), best_tool, best_score, min_confidence, passed_gate.
- _emit_recommend_action_event helper at top of nac.py is fail-soft —
non-sim runtime calls (e.g., headless API, unit tests without
sim logging enabled) don't crash on missing telemetry plumbing.
Stage 0b/0c NAc snapshots interpretation: per-stage save_aut_state
calls (already wired since PR #248) satisfy "session boundary" for
Roy's multi-stage harness pattern. Each Roy stage produces its own
session_id with its own aut_nac.json — reward_bias evolution is
plottable across the priming-stage sequence. Intra-session
checkpoints (within a single sim_id) are a follow-up if needed.
Sim-action interface change: sim_action() grows entity_class +
**kwargs keyword-only parameters. Existing 2/3-arg positional calls
keep working; new callers can pass entity_class. Field omitted
from JSONL when None (avoids null-noise in Roy-3 records).
Test surface (27 tests, 8 layers):
- Layer 1 (ActionRecord): back-compat shape + new fields populated.
- Layer 2 (entity_class derivation): all 4 fallback paths, priority
order, verb-only tools → None, non-dict params → None.
- Layer 3 (InstrumentedExecutor): context-bound + context-unbound
paths; record_block also carries telemetry.
- Layer 4 (compression): tiny fields survive _compress_oldest.
- Layer 5 (save_action_log): _format_version header, telemetry
fields per record, header even with 0 records.
- Layer 6 (sim_action): legacy call shape, entity_class threading,
None-omission.
- Layer 7 (sim_recommend_action): emission on all 3 paths +
fail-soft when sim logging disabled.
- Layer 8 (RequestContext binding): round-trip + reset-on-exception.
All passing. ruff clean on touched files (2 pre-existing ruff
errors in orchestrator.py are unrelated to this PR).
Frozen contract impact: ActionRecord SHAPE-FROZEN at 1.0 (CC3) —
optional fields appended at end with defaults are non-breaking;
docstring updated to declare the new fields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-lens pre-merge review of feat/0-9-1-stage-0b-0c-telemetry surfaced
3 Critical architecture findings + 1 cross-confirmed Important
finding (both lenses) + 4 Important findings (mixed lens). All folded
before opening the PR per feedback_review_before_ship.md. 32 tests
passing (up from 27 — 5 new fold-regression guards).
Critical (architecture):
1. tick mismatch with Stage 0d. Pre-fold, sim_recommend_action emitted
tick=int(time.time()) (raw epoch ~1.7e9); Stage 0d's
sim_ec_activation emits tick=int(time.time() - _sim_start)
(elapsed seconds). A 1e9 offset would have made Roy-3 left-joins
on tick return zero matches every time. Fix imports sim_logger
and subtracts _sim_start. Pinned by
test_tick_aligned_with_sim_logger_start.
2. _format_version "1.0" → "1.1". Plan's "Cross-cutting: persistence
schema" section explicitly pins this at "1.1" (minor bump from
pre-0b unversioned per CC1's "0.x" sentinel rule). Pre-fold
shipped "1.0" — readers branching on version would have read the
wrong dialect. Bumped + extracted to _ACTIONS_JSONL_FORMAT_VERSION
module constant so the next bump is a one-line change.
3. Fourth silent-return-None early-exit was missing emission.
recommend_action's `if not available_tools: return None` bailed
before the emitter — Roy-3 couldn't distinguish "no tools
available" from "no tools scored above gate." Pinned by
test_emission_on_empty_available_tools_path.
Cross-confirmed (architecture I2 + bio I1+I2): entity_class heuristic
scoped to strict opt-in.
Pre-fold, _derive_entity_class included a tool-name verb-prefix-
strip heuristic that produced noise on non-entity tools: get_status
→ "status", set_entity_sensor → "entity_sensor", do_something_clever
→ "something_clever". Roy-3 normalization would have silently
attributed pain events to fake entity classes. Bio-lens flagged
the contamination question; arch-lens flagged the false-positive
rate.
Fix drops the verb-strip path entirely. Tool authors opt into
Roy-3 attribution by passing entity_class through params
(entity_class / target / entity / object). The post-fold heuristic
is conservative: Roy-3 normalization skips None, so being more
conservative is strictly safer than producing wrong buckets —
silent miscount is worse than missing data.
Docstring adds two bio-fidelity guardrails:
- "DO NOT consume this field from any substrate write path" —
walls entity_class off from NAc/EC/ATL/Hippocampus/PainBus,
making the contamination question structurally unambiguous.
- 1.1 TODO pointing to a declared `Tool.entity_class` field as
the future shape — tracks the same surface as
feedback_two_identity_schemes.md.
Tests: test_tool_name_alone_does_not_derive verifies the dropped
heuristic; test_non_entity_tools_with_underscores_return_none
pins the false-positive regression guard.
Bio I3: empty-scores cluster sentinel.
On the empty-scores path, cluster_reward_bias_consulted was always
None — conflating "agent had no active cluster" with "agent had a
cluster but no tools scored." Roy-3 H1 disambiguation needs the
distinction (the Wire-A vs recommend_action gap is exactly here).
Post-fold: 0.0 sentinel when current_cluster_id is set, None when
truly absent. Pinned by
test_empty_scores_sentinel_distinguishes_cluster_known_vs_unknown.
Architecture I5: use context_scope() helper instead of manual
set_context/reset_context. Both AUT and orchestrator thread bindings
now use the canonical helper from utils/http.py; future sim entry
points cannot forget the reset.
Architecture I4: explicit header-skip reader contract in
save_action_log docstring. Pinned by test_consumer_can_skip_header_line
which simulates the documented "skip _record_kind == 'header'"
reader pattern.
Bio nice-to-have: comment on the AUT/orch agent_id binding
documenting that the current sim-fixed strings are correct for the
single-AUT topology but NPCs spawned via AgentFactory in this
orchestrator session would need per-spawn context_scope.
Architecture nice-to-have: narrowed `except Exception` to
`except ImportError` in _emit_recommend_action_event. Non-sim
runtime is the only documented swallow case; other exceptions
propagate so a real sim_logger bug surfaces.
Deferred (architecture I1, I6, N1-N3): sub-second t ordering
(documentation, no code), third-thread LLM worker test (existing
Plan 4 A.2 inheritance), nice-to-have polish. Tracked in fold
review thread.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stages 0b + 0c of
release_0_9_1.md. Telemetry instrumentation prereqs for Roy-3 measurement. The wires (Wire 3 / Wire-A / Wire 2 / Wire 1) ship as behavioral changes; 0b/0c ship the observability infrastructure those wires need to be measured.Per the plan: "Stage 0a-d first because telemetry blocks observation. Roy-2c (one env var) lands before any wire work; Stage 0b-c lands as the structural prerequisite for measuring whether subsequent wires produced behavioral signal."
What ships
ActionRecordgainsagent_id/session_id/entity_class(CC3-additive on the frozen dataclass)RequestContextbound at sim orchestrator entry viacontext_scope()on AUT + orch threadsInstrumentedExecutorreadscurrent_context()+ derivesentity_classstrictly opt-in (params-only after pre-merge fold)save_action_logwrites_format_version: "1.1"header line + new fields per recordsim_actiongrowsentity_classkwarg threaded throughsim_log's data dictNAc.recommend_actionemits onesim_log("NAc_RECOMMEND", ...)event per call (all 4 early-return paths — Roy-3 must distinguish "no tools available" / "no scores" / "sub-threshold" / "success")sim_ec_activation(int(time.time() - _sim_start)) so Roy-3 cross-channel joins workTotal: +1,028 / -148, 8 files. 32 unit tests passing across 8 layers.
NAc snapshots at session boundary
Per the plan: "Save NAc snapshots at session boundary (not just final) so reward_bias evolution is plottable." Roy's multi-stage harness pattern (each stage produces its own
session_idwith its ownaut_nac.json) already satisfies this since PR #248 wired EC + ATL intosave_aut_state. Reward_bias evolution is plottable across the priming-stage sequence today. Intra-session checkpoints (within a single sim_id) are a follow-up if/when cradle continuous developmental sims need them — flagged by the bio-fidelity reviewer as a 1.0 follow-up, not a 0.9.1 gap.Two-lens pre-merge review
Per
feedback_review_before_ship.md, spawned parallel architecture + bio-fidelity reviews. Folded into commit011c995before opening this PR:3 Critical (architecture lens):
int(time.time())did not align with Stage 0d'sint(time.time() - _sim_start)tick space — Roy-3 cross-channel joins would have returned zero matches every timesim_logger._sim_start, subtracted; pinned bytest_tick_aligned_with_sim_logger_start_format_version: "1.0"vs plan-specified"1.1"— readers branching on version would have read the wrong dialect_ACTIONS_JSONL_FORMAT_VERSIONconstantif not available_tools) skipped emission — Roy-3 couldn't distinguish "no tools available" from "no tools scored above gate"best_tool=None, best_score=0.0, passed_gate=False; pinned bytest_emission_on_empty_available_tools_pathCross-confirmed (Arch I2 + Bio I1/I2) — entity_class heuristic:
Pre-fold
_derive_entity_classincluded a verb-prefix-strip + role-suffix-strip path that produced noise on non-entity tools (get_status→"status",set_entity_sensor→"entity_sensor",do_something_clever→"something_clever"). Roy-3 normalization would have silently attributed pain events to fake entity classes. Dropped the verb-strip path entirely. Tool authors now opt into Roy-3 attribution by passingentity_classthrough params. Roy-3 normalization skipsNone, so being conservative is strictly safer than producing wrong buckets. Docstring adds two bio-fidelity guardrails: explicit "DO NOT consume from substrate write paths" + 1.1 TODO pointing to a declaredTool.entity_classfield (tracksfeedback_two_identity_schemes.md).4 other Important findings folded:
cluster_reward_bias_consulted=Noneon empty-scores conflates "cluster unknown" with "cluster known, no tool scored"0.0sentinel when cluster_id known,Noneonly when truly absent. Pinned.set_context/reset_contextvscontext_scope()helpercontext_scope()on both AUT + orch thread bindings — future sim entry points cannot forget the resetsave_action_logdocstring +test_consumer_can_skip_header_lineregression guardexcept Exceptionswallows real bugs in_emit_recommend_action_eventexcept ImportError(the only documented non-sim-runtime case)Plus one bio nice-to-have: comment on the AUT/orch agent_id binding documenting that the current sim-fixed strings are correct for single-AUT topology but NPCs spawned via AgentFactory would need per-spawn
context_scope.Deferred (not blocking): sub-second
tordering (docs only, no code change), third-thread LLM worker context inheritance test (existing Plan 4 A.2 fallback already covers it), nice-to-have polish.Test plan
python -m pytest tests/unit/test_stage_0b_0c_telemetry.py -q— 32 passed across 8 layers (ActionRecord, entity_class derivation, InstrumentedExecutor, compression, save_action_log, sim_action, sim_recommend_action emission, RequestContext binding).python -m pytest tests/unit/test_stage_0b_0c_telemetry.py tests/unit/test_nac*.py tests/unit/test_simulation_agent.py tests/unit/test_save_aut_state.py -q— 142 passed (no regression).python -m pytest tests/ -x -q -m "not slow" --ignore=tests/integration/test_memory_hub.py— 6600 passed (full fast suite before fold).ruff check + ruff formatclean on touched files (2 pre-existing F821/F841 errors inorchestrator.pyare unrelated — confirmed againstmain).What's next in 0.9.1
Per
release_0_9_1.md:🤖 Generated with Claude Code