- Status: Tracking
- Refs: #68, RFC 0001 design note
- Last updated: 2026-05-12 (all nine items landed: F-1, F-2, F-3, F-4 incl. #1/#2/#3/#4, F-5, F-6 Phase 1+2, F-7, F-8, F-9 incl. quota-pause)
The "foundation PR" described at the end of RFC 0001 is too big for one reviewable change (~5 KLoC including stdlib HTTP server, auth, JSON-RPC + SSE, SQLite schema, daemon CLI, bridges-into-daemon, subprocess-per-agent, and conservative cost defaults). This document splits it into nine stackable PRs and pins the acceptance criteria for each. Implementation follows this index in order; later items can land in parallel once F-1 and F-2 are merged.
| ID | Scope | Depends on | Est LoC | Status |
|---|---|---|---|---|
| F-1 | daemon/ package skeleton; serve + daemon CLI |
— | ~1500 | MERGED #80 |
| F-2 | SQLite schema + events persistence + jobs migration | F-1 | ~700 | MERGED #101 + follow-ups (#fix-f2) |
| F-3 | monitor/scheduler runs in daemon |
F-2 | ~700 | MERGED #101 + follow-ups (#fix-f2) |
| F-4 | agent_runner becomes subprocess-per-agent |
F-2 | ~1000 | skeleton complete; F-4.1 perm routing + F-4.2 bridge notify + F-4.3 restart policy + F-4.4 e2e all landed (see §F-4 below); Windows path remains out of scope |
| F-5 | proactive watcher runs in daemon |
F-2 | ~200 | LANDED (see §F-5 below) |
| F-6 | Telegram bridge in daemon | F-2 | ~500 | LANDED — Phase 1 + Phase 2 inbound refactor both live (see §F-6 below) |
| F-7 | Slack bridge in daemon | F-6 | ~500 | LANDED — same Phase 1 + Phase 2 surface as F-6 |
| F-8 | WeChat bridge in daemon | F-6 | ~500 | LANDED — same Phase 1 + Phase 2 surface as F-6 (QR-login still REPL-driven) |
| F-9 | Conservative cost-guardrail defaults under serve |
F-1 | ~150 | LANDED (see §F-9 below) — defaults + RPCs + per-runner quota-pause hook all live |
Scope. Adopt the cc_daemon/ reference scaffolding from
feature/daemon-spike
(server, auth, originator, rpc, events, permission, methods)
as-is — those modules encode the contract the maintainer reviewed in
PR #74. Layer the foundation glue on top:
cc_daemon/discovery.py— atomic~/.cheetahclaws/daemon.jsonso REPL / Web / bridge clients can locate the running daemon (transport, address, version). Spike's pid file stays for "is anything running?" liveness; discovery answers "where is it?".cc_daemon/system_methods.py— registerssystem.ping(returns"pong") andsystem.shutdown(setsDaemonState.shutdown_event, giving us cross-platform graceful exit since Windows can't deliver SIGTERM cleanly to another Python process).cc_daemon/cli.py— rewrittenserve_main(argv)that callsbootstrap(), pinslog_fileto<data_dir>/logs/daemon.log, threads the loadedconfigand the--unauthenticated-metricsflag throughDaemonState, writes the discovery file on bind, watches the shutdown event, and clears discovery on exit.cc_daemon/server.py— minimal patch: route/healthz/readyz/metricsthroughhealth.payload_for(path, config)instead of the spike's stub{"status": "ok"}. Auth-gated by default; opt out via--unauthenticated-metrics. Adds Windows guard aroundsocketserver.UnixStreamServer(unavailable on Windows).commands/daemon_cmd.py—cheetahclaws daemon {status, stop, logs, rotate-token}subcommand handlers.statusreads discovery + pingssystem.ping;stopcallssystem.shutdownRPC then falls back to SIGTERM / TerminateProcess;logstails~/.cheetahclaws/logs/daemon.log;rotate-tokenregenerates the token (notes that existing TCP clients receive 401 until they re-read the file).health.py— refactor: extract module-levelhealthz_payload(config)/readyz_payload(config)/metrics_payload(config)/payload_for(path, config)so both the existing standalone health HTTP server andcc_daemon/server.pyreuse the same circuit-breaker / quota / runtime-registry probes. No behaviour change for existinghealth_check_portusers.cheetahclaws.py— main() short-circuit:cheetahclaws servedispatches tocc_daemon.cli.serve_main;cheetahclaws daemon <action>dispatches tocommands.daemon_cmd.dispatch. Replaces the spike'sspike-daemonshim.
Acceptance.
cheetahclaws servestarts;cheetahclaws daemon statusreports pid, transport, address, uptime, ping outcome.- Unix socket (POSIX):
curl --unix-socket <path> -X POST /rpc -H "Cheetahclaws-Api-Version: 0" -d '{"jsonrpc":"2.0","id":1,"method":"system.ping"}'returns{"jsonrpc":"2.0","id":1,"result":"pong"}. - TCP: same call without
Authorization: Bearer <token>returns 401; with valid token returns 200; sustained bad-token attempts trip the spike's brute-force throttle (429). curl … GET /eventskeeps the stream open; heartbeats arrive at spike's 15 s cadence.cheetahclaws daemon stop→system.shutdownRPC → discovery file cleared and process exits 0.cheetahclaws daemon rotate-tokenregenerates the token; existing TCP clients receive 401 on next request until they re-read the file.- pytest green on Linux, macOS, Windows (TCP-only on Windows; Unix socket tests skip on Windows).
Scope. Seven additive tables in ~/.cheetahclaws/sessions.db; swap
the F-1 in-memory event ring for a SQLite-backed channel; migrate
jobs.py JSON storage to SQLite. Originator-tracked permission flow
is already provided by spike's cc_daemon/originator.py +
cc_daemon/permission.py (see PR #80) — this PR doesn't re-do it.
Tables (additive — sessions from session_store.py untouched).
schema_meta, daemon_events, agent_runs, agent_iterations,
jobs, monitor_subscriptions, monitor_reports, bridges.
Deliverables.
cc_daemon/schema.py— DDL +init_schema(db_path)(idempotent, internally locked) +get_conn()(thread-local, mirrorssession_storepattern) +get_schema_version()accessor; future migrations land in_apply_migrations().cc_daemon/cli.py:cmd_servecallsinit_schema()right afterbootstrap()so tables exist before the first publish.cc_daemon/events.py— rewritten:EventBus.publishdoes an INSERT intodaemon_events(id fromAUTOINCREMENT, monotonic across restarts and prunes), still fans out to in-process subscribers for live tail;replay_since(N)reads from SQLite and emits a syntheticgapevent whenNis older than the oldest surviving row. Default retention: 24 h / 100 K rows; opportunistic prune every 100 publishes.jobs.py—_persist/_row_to_jobhit SQLite;_ensure_migrated()imports legacy~/.cheetahclaws/jobs.jsononce (tracked viaschema_meta.jobs_migrated_from_json). Migration is one-way: after the marker is set, edits to the JSON file are no longer read. The file is left on disk for backward viewing only (prior-release users, backup tooling); SQLite is the source of truth from then on. Public API unchanged.
Follow-ups (#fix-f2).
cc_daemon/schema.pysetsPRAGMA synchronous=NORMALon init and on every thread-local connection. Safe under WAL — only the most recent transactions can be lost on hard kernel crash, which for an event log already retention-pruned in 24 h windows is an acceptable trade. Microbenchmark:EventBus.publishof 10 Ktext_chunkevents drops from 305 μs/event to 39 μs/event (~8× — chauncygu #74 review §7 follow-up).jobs.pyandmonitor/store.pymigration docstrings now make the one-way semantics explicit (the original "kept readable for one release as fallback" wording in PR #101 implied a fallback read path that didn't exist; users editing the JSON expecting it to be picked up would have been silently surprised).
Acceptance.
init_schema()is idempotent across daemon restarts and concurrent callers (verified by 12 unit tests intests/test_cc_daemon_schema.py).- Spike's 13 contract tests in
tests/test_daemon_spike.pykeep passing on the SQLite-backed bus (only the two ring-buffer tests needed an in-place rewrite to test retention-based eviction instead of the deleted in-memory cap). - New
tests/test_cc_daemon_events_sqlite.py(15 tests) covers persistence, retention by row count + age, gap-on-old-since, cross-instance replay (simulated daemon restart), and thereset_bus_for_tests()truncate path. - New
tests/test_jobs_sqlite.py(14 tests) covers create / start / add_step / lifecycle / list_recent / list_running /_MAX_JOBSpruning + JSON-file migration (idempotency, corrupt-file tolerance, legacy-file kept readable). - New e2e
tests/e2e_daemon_skeleton.py::test_events_persist_in_sqlite_across_daemon_restartpublishes events on daemon A viaecho.ping, stops A, starts B against the same data dir, and verifiesGET /events?since=0replays the events from SQLite.
Scope. monitor/scheduler.py runs daemon-side; subscription store
moves from JSON to the F-2 monitor_subscriptions table; reports
persist + emit SSE events; REPL skips its local scheduler when a
daemon is detected.
Deliverables.
monitor/store.py— SQLite-backed (monitor_subscriptionsandmonitor_reportstables). One-shot import of legacy~/.cheetahclaws/monitor_subscriptions.jsonon first call (tracked inschema_meta.monitor_migrated_from_json); JSON kept readable for one release. New helpers:save_report,list_reports. Public API of the legacy store unchanged.monitor/scheduler.py—run_one()persists the full report body viasave_reportand publishes amonitor_reportevent oncc_daemon.events.get_bus()with{topic, report_id, body, sent_to, errors}. Loop's idle wait switched fromtime.sleep(30)×60 to a singleEvent.wait(60)so daemon shutdown isn't stalled by the scheduler thread napping.cc_daemon/monitor_methods.py— registersmonitor.subscribe,monitor.unsubscribe,monitor.list,monitor.runfor external clients (Web UI / third-party tools).DaemonState.__init__callsmonitor_methods.registernext tosystem_methods.cc_daemon/cli.py:cmd_serve— starts the scheduler withmonitor.scheduler.start(config)after schema init; the existing shutdown watcher callsmonitor.scheduler.stop()before triggering HTTP-server shutdown.commands/monitor_cmd.py—/monitor startand/monitor stopdetect a live daemon viacc_daemon.discovery.locate()and no-op with a friendly message./monitor subscribe/unsubscribe/listcontinue to work in REPL because they hit SQLite directly.
Follow-ups (#fix-f2).
cc_daemon/cli.py:cmd_servenow startsmonitor.scheduler.start(...)after the listener has bound and the discovery file is on disk (PR #101 had it before the bind). Order matters — if a due subscription fires before the daemon is reachable, an LLM/network error in fetch/summarize/deliver surfaces in the log before the user sees the listening line, and external clients can't yet act on the resultingmonitor_reportSSE event.monitor/scheduler.py—_foreign_daemon_running()step-aside check at the top of every loop tick. Closes the race where REPL/monitor startfires in the brief window before the daemon writes its discovery file: both schedulers would otherwise race onlast_run_atand double-fire subscriptions. Daemon passesowned_by_daemon=Truetostart(...)to opt out of the check (otherwise it would defer to its own discovery entry forever).
Acceptance.
cheetahclaws serverunning →monitor.subscribeover RPC persists to SQLite; daemon scheduler fires on cadence; reports show up inmonitor_reportsand on the SSE channel asmonitor_reportevents.- Daemon stop → start with same data dir →
monitor.listover RPC returns the previously-subscribed topics. (Verified bytests/e2e_daemon_skeleton.py::test_monitor_subscribe_via_rpc_survives_daemon_restart.) - REPL
/monitor subscribewhile daemon is running: subscription visible viamonitor.listfrom outside. Daemon picks up the new row on its next 60 s poll. - Without daemon: today's REPL-only behaviour unchanged (in-process scheduler thread).
- Telegram / Slack / WeChat delivery from daemon: out of scope for F-3
(waits for F-6/F-7/F-8). Reports +
monitor_reportevents still fire so the digest isn't lost; bridges deliver only when REPL is running with the channel connected.
Tests. tests/test_monitor_store_sqlite.py (18),
tests/test_monitor_scheduler_events.py (7),
tests/test_cc_daemon_monitor_methods.py (12), plus 1 new e2e in
tests/e2e_daemon_skeleton.py for the survive-restart case.
Scope. Each AgentRunner is its own subprocess. From #68: "subprocess-per-agent rather than threads — one leaking/crashing runner shouldn't take down the scheduler and bridges."
Deliverables.
cc_daemon/runner_supervisor.py— spawn / monitor / restart agent-runner subprocesses.cc_daemon/runner_ipc.py— line-delimited JSON over stdin/stdout between supervisor and runner.agent_runner.py— main entry point usable aspython -m agent_runner --pipe …; iteration-log writes flow back to the daemon and land inagent_iterations.- Permission requests from runners routed through supervisor →
cc_daemon/permission.py.
Acceptance.
- Runner crash (
kill -9 <runner_pid>) does not kill the daemon; supervisor logs the crash and emitsagent_runner_crashevent. - Runner OOM does not affect monitor or bridges.
- Runner subprocess stops within 5 s of
agent.stopRPC. - Iteration-log entries match in-process behavior (status, duration, summary, token counts).
A POSIX-only skeleton landed under the agent_runner_subprocess /
CHEETAHCLAWS_ENABLE_F4 feature flag (off by default; REPL is byte-for-byte
unchanged). Files:
| File | LoC | Role |
|---|---|---|
cc_daemon/runner_supervisor.py |
~610 | Lifecycle (start / stop / stop_all / get / list_all), three-phase stop (IPC stop → SIGTERM → SIGKILL, ≤5 s), reader loop, crash classification, SQLite persistence helpers |
cc_daemon/runner_ipc.py |
33 | Thin re-export of cc_kernel.runner.ipc.JsonLineChannel |
cc_daemon/agent_methods.py |
~100 | agent.start / agent.stop / agent.list / agent.status RPCs, registered from cc_daemon/server.py:DaemonState.__init__ |
agent_runner.py |
+231 | python -m agent_runner --pipe subprocess entry, _PipeAgentRunner shim that bridges send_fn and iteration_done to IPC, dispatch in start_runner / stop_runner |
tests/test_cc_daemon_runner_supervisor.py |
~430 | 17 unit tests: handshake, graceful stop, SIGKILL escalation on hung runner, crash via external SIGKILL, IPC shim identity, 9 SQLite persistence cases |
tests/test_cc_daemon_agent_methods.py |
~210 | 10 RPC tests: registration, param validation, list/status when empty, end-to-end list→stop with inline runner |
Acceptance status:
- ✅ Crash detection.
kill -9 <runner_pid>flipshandle.statusto"crashed", finalizes theagent_runsrow (status='crashed',error="exit_code=-9; stderr_tail=..."), and publishesagent_runner_crashon the event bus. - ✅ OOM resilience. Same code path as
kill -9; the OOM killer's SIGKILL is observed viaproc.poll()from the reader loop. - ✅ Stop within 5 s. Verified by
test_graceful_stop_within_5sandtest_hanging_runner_escalates_to_sigkill. Graceful IPCstopfirst; SIGTERM after 2 s; SIGKILL after another 3 s. - ✅ Iteration log parity. jsonl format is byte-identical to today's
in-thread
AgentRunner._persist_record.agent_iterationsandagent_runsSQLite rows are populated end-to-end (verified by 9 persistence tests).INSERT OR IGNOREmakes re-delivery idempotent.
- Permission routing. ✅ Landed (see §F-4.1 below). The supervisor
now routes
permission_requestIPC throughcc_daemon/permission.py:PermissionStorewhen the runner was started withauto_approve=False. The originator (the client_id that calledagent.start) answers viapermission.answerand the supervisor forwards the response back to the runner aspermission_response. - Bridge
notifyforwarding. ✅ Landed (see §F-4.2 below). The supervisor's reader now routes{"op":"notify", "text": ...}IPC frames throughbridge_supervisor.notify(kind, text)and publishes anagent_runner_notifyevent. The runner can target a specific bridge viamsg["bridge"]or omit it for a"*"broadcast. - Restart policy. ✅ Landed (see §F-4.3 below). The originator
picks a
restart_policy("none" | "on-crash") atagent.starttime along withmax_restarts/backoff_base_s/backoff_cap_s. The supervisor's reader hooks the lineage's restart counter into athreading.Timerafter a crash;stop()cancels any pending Timer. Exhaustion publishesagent_runner_restart_exhaustedso observers can take over. - e2e test against the real
python -m agent_runner. ✅ Landed (see §F-4.4 below).tests/e2e_f4_runner.pyspawns the realpython -m agent_runner --pipesubprocess viarunner_supervisor.startand verifies theagent_runsinsert, theagent_iterations→last_iterationupdate, theended_atfinalisation on graceful stop, and the full F-4.1 permission round-trip end-to-end. - Windows path. Out of scope per RFC;
enabled()returns False onsys.platform.startswith("win")and the dispatch inagent_runner.start_runnerfalls back to threads.
The supervisor and runner now agree on one IPC round-trip per permission prompt:
permission_request {request_id, tool, input, rationale}
runner ────────────────────────────────────────────────────────────────► supervisor
│
permission_store.create(
originator = handle.originator,
on_answer = _forward_back )
│
┌── originator answers via JSON-RPC ──┐
│ permission.answer(req_id, result) │
└─────────────────────────────────────┘
│
permission_response {request_id, granted} ▼
runner ◄──────────────────────────────────────────────────────────────── supervisor
Files touched:
| File | What changed |
|---|---|
cc_daemon/permission.py |
PermissionRequest gains an optional on_answer(req) callback; PermissionStore.create() accepts it, answer() fires it after the store has been mutated (outside the lock), the janitor synthesises {"approve": False, "timeout": True} and fires it on expiry. |
cc_daemon/runner_supervisor.py |
RunnerHandle gains originator: str + permission_store: Optional. start() takes both as kwargs. _reader_loop's permission_request branch now: (a) keeps the auto-approve fast path when auto_approve=True or no store is wired in (back-compat), (b) otherwise calls store.create(originator=…, on_answer=…) and the callback ships permission_response back to the runner. |
cc_daemon/agent_methods.py |
agent.start reads ctx.client_id for the originator and passes daemon_state.permissions as the store. agent.list / agent.status results now include originator. |
agent_runner.py |
Extracted today's inline PermissionRequest handling into AgentRunner._handle_permission_request(event) -> rec_status. _PipeAgentRunner overrides it: emit permission_request with a fresh correlation id, wait on a threading.Event populated by the control-loop's permission_response handler (already in place), then set event.granted and either continue or stop. |
Semantics:
auto_approve=True— runner doesn't even bother the supervisor; permission requests are granted in-process. Identical to today's REPL behaviour.auto_approve=Falseand store present — the originator's RPC client is the only client that can answer (NotOriginatorfor everyone else, just like spike's permission tests).auto_approve=Falseand store absent — the supervisor still grants (treated as the back-compat safety path so a misconfigured caller doesn't lock the runner up). Only RPC-driven flows pass a store.- Timeout — the store's janitor fires the same callback path with
{"approve": False, "timeout": True}so the runner unblocks rather than waiting_PERMISSION_WAIT_S(30 min) for an IPC frame that's never coming.
Tests live in tests/test_cc_daemon_runner_permission_routing.py (10
new cases — store callback unit, supervisor approve/deny/timeout
round-trips, non-originator guard, missing-store fallback, and the
agent.start RPC wiring). The existing 17 supervisor tests + 10
agent-method tests + 11 spike tests all still pass.
The supervisor's reader used to drop the runner's {"op":"notify"}
IPC frames on the floor — fine for F-4's skeleton because no bridge
was running in-daemon yet. With F-6's Phase 1 mailbox landed, the
forward path is one delegation:
runner._notify("text")
│
▼
chan.send({"op":"notify", "text":"...", "bridge":"telegram"|"*"|...})
│
▼
supervisor._reader_loop (op == "notify"):
text = msg.get("text") or msg.get("msg") or ""
target = msg.get("bridge", "*")
if text:
delivered = bridge_supervisor.notify(target, text)
bus.publish("agent_runner_notify",
{name, run_id, bridge, delivered, text[:500]})
Defaults & semantics:
- Default target =
"*". A runner that doesn't know which bridge its originator owns (the common case foragent_runner._notify) sends the message to every live bridge. Useful for "agent finished" pings. - Empty text is silently dropped. No event, no bus traffic — keeps the iteration log from spamming when an agent template emits an empty notify (common during shutdown).
- No backpressure. If
bridge_supervisor.notifyraises, we capture it onhandle.errorand keep the reader thread alive. Bridges becoming unreachable mid-iteration must never crash a runner. - No retry.
delivered: falseevents are visible to observers via the bus; retry policy (if any) belongs to the originator, not the supervisor.
Tests live in tests/test_cc_daemon_runner_notify_routing.py (3
cases — single-bridge dispatch, broadcast default, empty-text drop).
Verifies via an inline python -c runner that speaks the IPC
protocol and a patch.object(bs, "notify", ...) so we don't need a
real network bridge to exercise the wiring.
The originator picks a policy at agent.start time. The supervisor's
reader-loop finally consults the policy after a crash and arms a
threading.Timer for the next attempt. A graceful stop() is never
followed by a restart.
RestartPolicy (dataclass, frozen):
| Field | Default | Meaning |
|---|---|---|
mode |
"none" |
"none" (no auto-restart) or "on-crash" (respawn only on crash). |
max_restarts |
0 |
Total restarts for the whole lineage. Zero disables even when mode='on-crash'. |
backoff_base_s |
1.0 |
First delay; doubles each subsequent attempt. |
backoff_cap_s |
60.0 |
Hard ceiling on the doubled delay. |
backoff_jitter_s |
0.5 |
Symmetric uniform jitter (clipped at zero). |
next_delay(restart_count) is a pure function — given the lineage's
running counter it returns the next delay in seconds, or None when
the policy is exhausted or disabled. Tests cover the full decision
matrix without touching a clock or a subprocess.
agent.start accepts these as flat params (restart_policy,
max_restarts, backoff_base_s, backoff_cap_s, backoff_jitter_s).
Bad values raise TypeError → JSON-RPC -32602 invalid params. The
nastiest footgun — backoff_cap_s < backoff_base_s (would clamp every
attempt down to the cap and "feel" disabled) — is rejected at config
time.
Lifecycle:
runner crashes
│
▼
reader.finally:
status='crashed'; agent_runs.error filled in; bus.publish(agent_runner_crash)
│
▼
_maybe_schedule_restart(handle):
delay = policy.next_delay(restart_count)
if delay is None:
if restart_count > 0 → bus.publish(agent_runner_restart_exhausted)
stop # default path for mode='none'
else:
bus.publish(agent_runner_restart_scheduled, {delay_s})
handle._restart_timer = threading.Timer(delay, _do_restart, (handle,))
↓ later …
_do_restart(prev):
if registry slot is empty or has a newer run_id → abort silently
new_handle = _RESTART_SPAWNER(**prev._start_kwargs,
_restart_count_carry=prev.restart_count + 1)
bus.publish(agent_runner_restart, {old_run_id, new_run_id, restart_count})
Files touched:
| File | What changed |
|---|---|
cc_daemon/runner_supervisor.py |
Adds RestartPolicy dataclass + RunnerHandle.restart_policy / restart_count / _start_kwargs / _restart_timer / _restart_decided. start() gains kwargs (restart_policy, _restart_count_carry) and stashes _start_kwargs for successor calls. Reader's finally invokes _maybe_schedule_restart() on crash. stop() cancels the pending Timer before the kill ladder. New _RESTART_SPAWNER module hook for tests. |
cc_daemon/agent_methods.py |
agent.start parses RestartPolicy.from_params(params) and threads it through. _handle_to_dict now reports restart_count + flattened restart_policy on agent.list / agent.status. |
tests/test_cc_daemon_runner_restart_policy.py |
New, 16 cases: 10 pure-function (next_delay matrix, from_params validation), 3 reader-loop integration (disabled → no timer, on-crash → spawner called with carry+1, exhaustion publishes the event), 1 stop() cancellation, 2 handle serialisation / sanity. |
tests/test_cc_daemon_runner_permission_routing.py |
_FakeHandle stub gains restart_policy + restart_count so _handle_to_dict doesn't AttributeError. |
Events:
agent_runner_crash(unchanged) — first signal a lineage is in trouble.agent_runner_restart_scheduled—{name, run_id, restart_count, delay_s}.agent_runner_restart— successor handle spawned:{name, old_run_id, new_run_id, restart_count, pid}.agent_runner_restart_failed— successor handshake itself failed:{name, run_id, restart_count, error}.agent_runner_restart_exhausted—max_restartshit:{name, run_id, restart_count, max_restarts}.
Race-safety notes:
_do_restartre-checks the registry under_handles_lockand aborts if the slot no longer holds the original handle (covers thestop()raced with Timer-fire case).stop()cancels_restart_timerbefore the alive-check, so a lineage whose previous process already died is still properly killed off without a respawn.- A failed spawn (handshake timeout, exception in
start) does not chain into another retry — the lineage stops viaagent_runner_restart_failed. Otherwise an import-time bug in the agent template would burn throughmax_restartsinstantly.
Tests:
pytest tests/test_cc_daemon_runner_restart_policy.py — 16/16 green in
~3 s. The wider F-4 regression
(test_cc_daemon_runner_supervisor.py + test_cc_daemon_agent_methods.py +
test_cc_daemon_runner_permission_routing.py) is 55/55 green in ~13 s,
plus the F-4.4 e2e (4/4 in ~2 s) was rerun unchanged.
tests/e2e_f4_runner.py covers the gap between the unit tests (which
use an inline -c subprocess that speaks the protocol) and a real
deployment. It spawns python -m agent_runner --pipe via
runner_supervisor.start, with the agent runtime stubbed in a tightly
scoped way:
agent_runner._pipe_mainchecksCHEETAHCLAWS_E2E_FAKE_AGENT=1after the handshake and, if set, replacesagent.runwith a small scripted generator (TextChunk→ optionalPermissionRequest→TurnDone). The hook is env-gated so production paths can never reach it.- A companion env var
CHEETAHCLAWS_E2E_FAKE_PERMISSION=1makes the stub emit onePermissionRequestso the test can drive the F-4.1 routing through real IPC.
Cases:
test_start_creates_agent_runs_row— supervisor.start returns with theagent_runsrow already inserted (sync write before reader thread starts, per the F-4 skeleton invariants).test_iteration_lands_in_sqlite_under_real_runner— the real_PipeAgentRunner._persist_recordemitsiteration_doneover IPC, the supervisor writesagent_iterationsand bumpsagent_runs.last_iteration. Tolerates 15 s for cold subprocess startup.test_graceful_stop_finalises_agent_runs_status—rs.stop()delivers IPC "stop", runner exits, supervisor's reader finalises the row withstatus='stopped'and a non-nullended_at.test_real_runner_permission_routing_round_trip— the stubbedagent.runyields aPermissionRequest, the real_PipeAgentRunner._handle_permission_requestshipspermission_requestIPC, the supervisor opens a pending request inPermissionStoreunder originator"alice", the test answers viastore.answer(..., "alice", {"approve": True}), and the runner's iteration completes —agent_iterationsrow arrives, proving the approval flowed all the way back through the real subprocess.
All four pass in ~2.5 s on a developer laptop and ~3 s under the wider F-4 regression suite (82 tests across supervisor unit, agent_methods unit, permission routing, spike contract, dup-stop integration, and e2e).
Scope. _proactive_watcher_loop from cheetahclaws.py becomes a daemon-owned task.
Acceptance.
/proactive 5mwhile daemon is running: setting persists, sentinel runs in daemon, survives REPL exit.- Without daemon: unchanged.
| File | Role |
|---|---|
cc_daemon/proactive_state.py |
schema_meta-backed KV for proactive.enabled / proactive.interval_s / proactive.last_tick_at. Public surface: get_state(), set_state(), disable(), tickle(), record_tick(). Survives daemon restarts because it's on the same sessions.db the F-2 schema owns. |
cc_daemon/proactive_scheduler.py |
Single background thread (proactive-scheduler). Ticks at TICK_INTERVAL_S = 1.0, reads proactive_state, publishes proactive_tick on the SSE bus when the idle threshold is crossed, and resets last_tick_at using one now reading so the event and the row share a clock. Mirrors F-3's monitor.scheduler (owned_by_daemon, _foreign_daemon_running(), interruptible Event.wait so shutdown doesn't stall). |
cc_daemon/proactive_methods.py |
proactive.set / proactive.get / proactive.tickle RPCs. Same param-validation conventions as monitor.*. Registered next to the other method modules in DaemonState.__init__. |
cc_daemon/cli.py:cmd_serve |
Starts the proactive scheduler after bind + discovery (so external clients can subscribe to proactive_tick before the first tick lands), with owned_by_daemon=True. Shutdown watcher stops it alongside monitor.scheduler. |
cc_daemon/server.py |
DaemonState.__init__ registers proactive_methods alongside system_methods, monitor_methods, and agent_methods. |
commands/core.py:cmd_proactive |
When a foreign daemon is registered, the slash command routes through the proactive.set / proactive.get RPCs instead of mutating RuntimeContext. On RPC failure, falls back to today's in-process path so a misbehaving daemon doesn't break the REPL UX. |
cheetahclaws.py:_proactive_watcher_loop |
Polls _proactive_foreign_daemon_running() and step-asides when a daemon owns the watcher — prevents double-fire across REPL + daemon. |
Event payload (proactive_tick):
Consumers (REPL, bridges, future agents) decide what to do with it — typically inject the same "review previous messages" prompt the old in-REPL watcher used to fire. The scheduler itself never reaches into agent / bridge state; that coupling lives in the consumer, where it belongs.
tests/test_cc_daemon_proactive.py— 20 cases across:proactive_state: defaults, round-trip, validation (rejects 0/negative),disable()keeps interval,tickle()bumps timestamp, corrupt-row tolerance.proactive_scheduler: disabled state silent, idle threshold publishes one event,owned_by_daemon=Truedisables foreign-check,stop()joins within 5 s, double-start returns False.proactive_methods: round-trip, missingenabledrejected, non-int interval rejected, zero rejected,ticklebumpslast_tick_at,getreports scheduler-running flag.- REPL step-aside helper: returns False for none / own pid, True for foreign pid.
Full daemon regression: 143 tests passing across schema, events, supervisor, agent_methods, monitor_methods, system_methods, permission routing, spike, discovery, CLI, the F-4 e2e, and the new F-5 module.
Scope per PR. The named bridge (telegram, then slack, then wechat) runs inside daemon; incoming messages enter via POST /rpc {"method":"session.send", …}; outgoing replies come from an SSE subscription to that session's events.
Per-bridge deliverables.
- Move
bridges/<kind>.pypoll loop into a daemon-owned worker. - Drop
RuntimeContext.<kind>_send/<kind>_input_eventand friends; replace with the API-mediated path. bridge.start/bridge.stop/bridge.listRPC methods.- Persist bridge state to
bridgestable.
Acceptance per bridge.
- Phone message → daemon
session.send→ REPL/Web/another bridge can subscribe to the same session and see events. - Bridge survives REPL exit; user can keep texting.
- Permission requests originating from a bridge-driven turn route only to that bridge for answer (per RFC 0001 §2).
F-7 depends on F-6 (shared scaffolding); F-8 the same.
A POSIX + Windows-compatible skeleton landed under the
CHEETAHCLAWS_ENABLE_F6 feature flag (off by default; REPL is
byte-for-byte unchanged). The Phase 1 surface is "everything F-4 #2
needs to deliver runner notifications, plus a clean lifecycle"; the
Phase 2 inbound refactor (phone → session.send → SSE-subscribed
clients) is documented separately at the end of this section.
Files:
| File | LoC | Role |
|---|---|---|
cc_daemon/bridge_supervisor.py |
~430 | Lifecycle (start / stop / stop_all / get / list_all), per-kind feature-flag gate (CHEETAHCLAWS_ENABLE_F6/7/8), outbound notify() mailbox consumed by F-4 #2 + bridge.send RPC, bridges table upsert/finalize, redacted config snapshots in event payloads. |
cc_daemon/bridge_methods.py |
~135 | bridge.start / bridge.stop / bridge.list / bridge.send / bridge.status RPCs. Registered from cc_daemon/server.py:DaemonState.__init__ next to agent_methods. |
cc_daemon/server.py |
+6 | DaemonState.__init__ adds bridge_methods.register. The methods are exposed unconditionally so bridge.list always answers, but bridge.start itself enforces the per-kind flag. |
cc_daemon/cli.py |
+6 | _watch_shutdown calls bridge_supervisor.stop_all before triggering the HTTP listener shutdown, so a SIGTERM cleanly tears down bridge worker threads. |
tests/test_cc_daemon_bridge_supervisor.py |
~290 | 17 cases across feature flag, lifecycle (start/stop/double-start/dependency-on-F6), outbound notify (single + broadcast + empty drop), SQLite persistence (list_persisted, DB-failure tolerance), config redaction. |
tests/test_cc_daemon_bridge_methods.py |
~210 | 10 RPC cases: registration, param validation across all five methods, start-list-stop round trip with redacted config in response, bridge.send outbound dispatch. |
Per-bridge flag matrix (per the "Bridge flag" decision):
| Env var | Effect |
|---|---|
CHEETAHCLAWS_ENABLE_F6 |
Telegram-in-daemon allowed. |
CHEETAHCLAWS_ENABLE_F7 |
Slack-in-daemon (requires F-6). |
CHEETAHCLAWS_ENABLE_F8 |
WeChat-in-daemon (requires F-6). |
Acceptance status (Phase 1):
- ✅ Bridge survives REPL exit. The worker thread is daemon-owned;
the REPL never owns its lifetime.
cheetahclaws daemon stopshuts it down via_watch_shutdown→bridge_supervisor.stop_all. - ✅ Connection state persisted. A row lands in the
bridgestable on every start/stop.bridge.listmerges live handles + persisted rows so the caller sees disabled bridges from previous daemon runs. Tokens are redacted to last 4 chars before they hit the row / the wire / event payloads. - ✅ Outbound mailbox for F-4 #2.
bridge_supervisor.notify(kind, text)dispatches to the running bridge's send function (lazy-imported frombridges/<kind>.py, so the daemon and REPL share network code)."*"broadcasts to every live bridge. - ✅ REPL behaviour unchanged. Default-off flag; the existing
/telegramslash command still uses today's in-process supervisor. - ⚠ Phone →
session.send(inbound API path). Deferred to Phase 2 — see below.
Bus events:
bridge_started— payload includes redacted config.bridge_stopped— terminal state, withlast_errorfor crash classification.bridge_crash— uncaught exception inside the worker.
Phase 2 replaces the legacy bridges/<kind>.py supervisor (which
expects a REPL session_ctx.run_query callback) with a slim
daemon-driven loop that talks to the rest of the system via two events
on the bus:
phone ──── poll ────► bridge worker ───── publish session_inbound ────► event bus
│
subscribers (REPL/Web)
│
event bus ◄────── publish session_outbound ────── agent driver ◄──── consume inbound
│
└── subscribed by every Phase 2 bridge that matches session_id + target_bridges
│
└── handle.sender(config, text) ────► chat
New files / sections:
| File | Role |
|---|---|
cc_daemon/session_methods.py |
session.send(session_id, text, origin?, message_id?) publishes session_inbound. session.reply(session_id, text, target_bridges?, message_id?) publishes session_outbound. session.list_recent(limit=20) reads the in-memory LRU. Permission-routing originator defaults to the RPC caller's client_id when no explicit origin is supplied. |
cc_daemon/bridge_supervisor.py |
New BridgeHandle.daemon_phase2 flag + session_id() helper (tg:<chat_id>, sl:<channel>, wc:<user_id>). When daemon_phase2=True, the worker bypasses the legacy supervisor and runs _phase2_worker, which: (a) subscribes to the bus, filters session_outbound by session_id + target_bridges, forwards to handle.sender; (b) runs a per-kind inbound poller (_phase2_telegram_inbound, _phase2_slack_inbound, _phase2_wechat_inbound) that re-uses the existing HTTP helpers in bridges/<kind>.py but publishes session_inbound on every new message instead of invoking session_ctx.run_query. |
cc_daemon/bridge_methods.py |
bridge.start now accepts daemon_phase2: bool (default False). The bridge handle response surfaces daemon_phase2 + session_id so the caller can confirm what mode the worker is in. |
cc_daemon/server.py |
Registers session_methods on DaemonState.__init__. No feature flag — the methods are pure message-passing primitives and are safe on any daemon. |
Acceptance criteria revisited:
| Criterion (from the RFC's per-bridge "Acceptance" block) | Status |
|---|---|
Phone message → daemon session.send → REPL/Web/another bridge can subscribe to the same session and see events |
✅ via session_inbound events on the SSE feed |
| Bridge survives REPL exit; user can keep texting | ✅ (already from Phase 1; the daemon owns the worker thread) |
| Permission requests originating from a bridge-driven turn route only to that bridge for answer | ✅ via originator stamping — session.send writes origin=<kind>:<session_id> (or the explicit origin param) onto the event. The agent driver (REPL/Web) uses that string as the originator when minting a PermissionRequest; the existing cc_daemon/permission.py PermissionStore already enforces "only this originator can answer." |
Bus events:
session_inbound—{session_id, text, origin, message_id, ts}. Published bysession.sendor directly by a Phase 2 bridge worker on a new phone message. Identical shape either way so subscribers don't need to branch on source.session_outbound—{session_id, text, target_bridges, message_id, ts}. Published bysession.reply(the agent driver's outbound surface).target_bridges=nullis broadcast; a list of kinds restricts delivery. Phase 2 workers filter onsession_id == handle.session_id()and(target_bridges is None or handle.kind in target_bridges).
Tests:
tests/test_cc_daemon_session_methods.py— 13 cases (publish, LRU, param validation acrosssession.send/session.reply/session.list_recent).tests/test_cc_daemon_bridge_phase2.py— 7 cases:session_id()formatting (3, all three kinds), outbound delivery viasession_outboundevent matching session_id + target_bridges (2), inbound poller publishessession_inboundfor a new Telegram message (1),bridge.startRPC passesdaemon_phase2through and surfaces it on the response (1).
Phase 1 still works unchanged — daemon_phase2=False (the default) keeps the legacy bridges/<kind>.py supervisor as the worker, preserving the REPL-shaped behaviour for callers that haven't migrated.
What's intentionally still REPL-driven after Phase 2:
- The agent loop itself. The daemon publishes
session_inboundand forwardssession_outbound; it does not start agent turns. A REPL/Web/automation client consumessession_inboundevents, runs the agent loop, and callssession.replyfor each output. F-4's subprocess runner gives the cleanest deployment path: an originator that wants fully-headless operation runsagent.startfor a template that subscribes tosession_inboundand drives the conversation. - WeChat QR-login. The daemon's inbound poller assumes
(token, base_url, user_id)are already in the config; the QR handshake to mint them is still REPL-driven (/wechat login). Migrating QR-login into the daemon is a separate change, called out in §F-8 above.
F-7 is mostly configuration: the F-6 bridge supervisor already knows
how to dispatch a kind="slack" worker, and the existing
bridges/slack.py _slack_supervisor(token, channel, config) plugs
in alongside Telegram's. What's new for F-7:
- Feature flag
CHEETAHCLAWS_ENABLE_F7(default off).bridge_supervisor.enabled("slack")reads this;bridge.start kind="slack"raises a clear error when it's missing — and a separate clear error when F-7 is on but F-6 isn't (the shared scaffolding has to be enabled for the daemon-side bridges feature surface to exist at all). - Outbound sender resolution.
_resolve_sender("slack")returns a thin wrapper overbridges/slack.py:_slack_send(token, channel, text), sobridge.sendand the F-4notifyIPC route through the same HTTP code the REPL uses. bridgesSQLite row. Same schema as Telegram's; thebridge.listRPC merges Slack rows in.- Tests in
tests/test_cc_daemon_bridge_supervisor.py::TestSlackWorkercover: F-6 dependency error, supervisor invocation with the expected(token, channel, config)shape, outbound sender wiring.
Acceptance status (Phase 1) — identical to F-6:
- ✅ Bridge survives REPL exit (daemon-owned thread).
- ✅
bridgesrow persisted on start/stop. - ✅ Outbound
notifymailbox accessible from F-4 runners. - ✅ REPL
/slackbehaviour unchanged. - ✅ Phone →
session.sendinbound path — landed via F-6 Phase 2 (this kind reuses_phase2_workerwith its own_phase2_<kind>_inbound).
Same shape as F-7, with two WeChat-specific wrinkles called out by
the existing bridges/wechat.py:
- QR-login prerequisite. WeChat's transport requires an authed
(token, base_url)pair that today's_wx_start_bridgemints via a QR-code login. The daemon worker doesn't drive the QR flow itself — instead, the worker checks thatwechat_tokenandwechat_base_urlare already set in the config dict and exits cleanly with a clearlast_errorif either is missing. Operators are expected to run/wechat login(REPL) once to populate the config, after which the daemon can take over. - Per-user send. WeChat doesn't have Telegram's chat_id /
Slack's channel — outbound goes to a specific contact identified
by
wechat_user_idin the bridge config._resolve_sender("wechat")threads this throughbridges/wechat.py:_wx_send(user_id, text, cfg).
Files / tests:
- Feature flag
CHEETAHCLAWS_ENABLE_F8(default off; depends on F-6 enabled too). - Tests in
tests/test_cc_daemon_bridge_supervisor.py::TestWechatWorker: F-6 dependency error, supervisor invocation with(token, base_url, config), missing-config clean-exit path, outbound sender wiring.
Acceptance status (Phase 1) — identical to F-6:
- ✅ Bridge survives REPL exit (daemon-owned thread).
- ✅
bridgesrow persisted; secrets redacted before storage. - ✅ Outbound
notifymailbox accessible from F-4 runners. - ✅ REPL
/wechatbehaviour unchanged. - ✅ Phone →
session.sendinbound path — landed via F-6 Phase 2 (this kind reuses_phase2_workerwith its own_phase2_<kind>_inbound). - ⚠ QR-login in-daemon not yet supported (today the REPL still drives the auth handshake before the daemon can take ownership).
Scope. When running under cheetahclaws serve, the four budget keys default to non-None:
{
"session_token_budget": 200000,
"session_cost_budget": 2.0,
"daily_token_budget": 2000000,
"daily_cost_budget": 20.0
}REPL --in-process mode keeps None defaults (no surprise for existing users).
Acceptance.
cheetahclaws servestarted without overrides →cheetahclaws daemon statusreports the four defaults.- Agent runner exceeds per-session budget → status moves to
paused_budget,quota_warnevent emitted, runner pauses. agent.resumeRPC with a new budget argument unpauses the runner.- REPL without daemon: budgets still default to
None.
What landed:
| File | Role |
|---|---|
cc_daemon/cli.py |
New module-level F9_SERVE_BUDGET_DEFAULTS dict (200k tokens / $2 / 2M tokens / $20) plus _apply_serve_defaults(config) — pure function that flips any None budget key to its conservative default. Called from cmd_serve after load_config() and before _bootstrap, so the quota module sees the final values on first init. |
cc_daemon/system_methods.py |
New system.status RPC returning {budgets: {…four keys…}, runners: int, bridges: int}. The four keys are surfaced verbatim from daemon_state.config so agent.resume's mutations are visible the next time someone polls. |
cc_daemon/agent_methods.py |
New agent.resume RPC accepting `budget_overrides: {key: value |
commands/daemon_cmd.py |
_status now calls system.status after system.ping and prints a budgets: block plus live runners / bridges counts. Backward-compatible: an older daemon that doesn't speak system.status falls through silently (the system.ping line still appears). |
tests/test_cc_daemon_f9_budgets.py |
12 cases: _apply_serve_defaults (3, pure-function), system.status (3, returns budgets + counts, handles unlimited), agent.resume (6, merge, null=unlimited, unknown key, non-numeric, non-dict, noop empty). |
Per-runner quota-pause hook (landed in second pass):
| Stage | Where | Behaviour |
|---|---|---|
| Pre-iter check | AgentRunner._run_loop (top of every iteration body) |
Calls quota.check_quota(_session_id, _config). If raises QuotaExceeded, hands the exception to _on_quota_exceeded. |
| Base hook | AgentRunner._on_quota_exceeded |
No-op. REPL path keeps today's behaviour — agent.run itself catches QuotaExceeded internally and yields a [Quota exceeded …] text chunk. |
| F-4 override | _PipeAgentRunner._on_quota_exceeded |
Sends {"op":"paused_budget", "reason": …} IPC, sets self.status='paused_budget', then blocks on self._resume_event.wait(). On wake sends {"op":"resumed"} and returns. |
| Supervisor inbound | runner_supervisor._reader_loop |
New paused_budget branch: flips handle.status='paused_budget', calls _db_update_run_status (updates agent_runs.status + error), publishes quota_warn on the bus. New resumed branch mirrors that back to running + publishes agent_runner_resumed. |
| Supervisor outbound | runner_supervisor.resume(name) |
Public function. Sends {"op":"resume"} IPC frame to the named runner. Idempotent — a runner that wasn't paused absorbs the frame in its control loop. |
| RPC | agent.resume |
Now accepts optional name. When supplied, calls runner_supervisor.resume(name) after merging budget overrides. Returns `{"budgets": {…}, "resumed": bool |
| Control loop | _pipe_main._control_loop |
New resume handler sets runner._resume_event. The stop handler also sets the event so a stop arriving while paused unblocks the runner cleanly instead of waiting up to 30 minutes for an IPC frame that's never coming. |
Events on the bus:
quota_warn—{name, run_id, reason}— fired on the supervisor's first sighting ofpaused_budget.agent_runner_resumed—{name, run_id}— fired when the runner re-entersrunning.
The pre-iter check is read-only — it doesn't write to the quota file or consume tokens. The actual budget enforcement still happens inside agent.run on every API call (record_usage after each turn, check_quota before the next). The runner-side hook just adds a fast-fail check at iteration boundaries so a paused runner can sit cheaply on a wait_event instead of repeatedly bouncing off the quota inside agent.run.
Tests for the quota-pause hook in tests/test_cc_daemon_quota_pause.py (2 cases): full IPC roundtrip (paused_budget → supervisor quota_warn → resume → resumed → agent_runner_resumed), and runner_supervisor.resume("no-such-runner") returns False. Plus 2 new cases in tests/test_cc_daemon_f9_budgets.py: agent.resume(name=…) calls runner_supervisor.resume, and an empty name field is rejected with -32602.
Cost-default knobs operators can override:
// ~/.cheetahclaws/config.json (overrides win over F-9 defaults)
{
"session_token_budget": 500000, // 500k tokens per session
"session_cost_budget": 5.0, // $5 per session
"daily_token_budget": null, // explicit "unlimited" survives F-9
"daily_cost_budget": 100.0
}REPL invariant: cheetahclaws (no serve) still imports cc_config
directly, so the four budget keys remain None (unlimited) — F-9 only
fires inside cmd_serve. Verified by the existing
tests/test_cc_daemon_cli.py round-trip plus the new _apply_serve_defaults
unit tests (which don't depend on a daemon being up).
- Tests. Every PR ships unit tests; F-1, F-3, F-4, F-6/7/8 also ship
tests/e2e_daemon_<area>.py. - Docs. Every PR updates the relevant section in
docs/architecture.md. The "Daemon" header is created by F-1; subsequent PRs append. - Config keys. New keys go in
cc_config.DEFAULTS; documented indocs/architecture.md. - Backwards compatibility. Users who never run
cheetahclaws servesee no behavior change until the eventual default flip — that flip is out of scope here and tracked in #68 as the "Phase D" item.
When a PR lands, change its Status in the index from TODO to MERGED #<pr>. If acceptance criteria evolve during a PR, update the per-PR section in the same PR — do not let this doc drift from the implementation.
{ "interval_s": 300, // configured idle threshold "last_tick_at": 1715520012.3, // when the user was last active "fired_at": 1715520312.8 // current time the tick was emitted }