fix(cli): preserve sessions across ao stop → ao update → ao start (closes #1743)#1780
fix(cli): preserve sessions across ao stop → ao update → ao start (closes #1743)#1780suraj-markup wants to merge 3 commits into
Conversation
) The ao stop → ao update → ao start flow was silently dropping the restore prompt because last-stop.json could be missing on disk by the time `ao start` looked for it. Three layered fixes: 1. ao stop and the SIGTERM shutdown handler now PRE-WRITE last-stop.json before the kill loop runs, then reconcile after. The previous order wrote only after the loop completed — a SIGKILL or crash mid-loop silently lost the record. The pre-write also calls fsyncSync on the temp file before rename, so the data survives a hard kill that lands immediately after the call returns (renameSync is atomic but the data blocks aren't durable until fsync). 2. ao start now falls back to scanning recently `manually_killed` sessions (terminated within 10 minutes, reason=manually_killed) when last-stop.json is missing or empty. The restore prompt surfaces them with the same UX as the existing record so a single regression in the write pipeline cannot silently drop the user's in-flight work again. 3. Added a regression test that exercises the stop → simulated update → start flow against a temp HOME and asserts the file survives. Tests added: - packages/cli/__tests__/lib/last-stop-fallback.test.ts (5) - packages/cli/__tests__/lib/stop-update-start-flow.test.ts (2) - writeLastStop fsync test in running-state.test.ts (1) - ao update preserves last-stop.json regression in update.test.ts (1) - pre-write/clear ordering tests in start.test.ts (2) - ao start fallback path tests in start.test.ts (2) Closes #1743 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test Coverage Report
Per-file breakdown
Uncovered lines
|
Greptile SummaryThis PR adds three layers of hardening for the
Confidence Score: 4/5Safe to merge with awareness that the write-after-kill-loop ordering remains unaddressed — the fsync durability and fallback scan together mitigate the worst consequences, but the narrow race window between the kill loop and the writeLastStop call still exists. The fsync durability fix and the fallback mechanism are both well-implemented and thoroughly tested. The cross-project config fix (loading global config in the fallback) is a genuine improvement over the previous revision. What keeps this from a clean score is the unresolved timing issue called out in a prior review: packages/cli/src/commands/start.ts around the
|
| Filename | Overview |
|---|---|
| packages/cli/src/lib/running-state.ts | Adds atomicWriteFileSyncDurable which fsyncs the temp fd before renaming; writeLastStop now uses this instead of the previous plain atomic write. Temp-file cleanup on write failure is handled in the finally block. The parent directory entry after renameSync is not itself fsync'd, which is a minor durability gap on strict crash-consistency filesystems, but a clear improvement over the prior state. |
| packages/cli/src/commands/start.ts | Adds the fallback path: when readLastStop() returns null/empty, loads the global config and calls findRecentlyKilledSessions so cross-project sessions are surfaced. The fallback is enclosed in the existing non-fatal try/catch. The writeLastStop in the ao stop subcommand is still called after the kill loop, preserving the original race window. |
| packages/cli/src/lib/last-stop-fallback.ts | New module: buildLastStopFallback filters sessions by manually_killed reason and terminatedAt within a configurable window, groups by project, and synthesizes a LastStopState. findRecentlyKilledSessions wraps this in an async call with silent error suppression. Logic is clean and well-tested. |
| packages/cli/tests/lib/last-stop-fallback.test.ts | New unit test file covering all key branches of buildLastStopFallback: empty result, primary-project sessions, other-project sessions, malformed/missing timestamps, and custom window. Good use of a fixed NOW anchor for deterministic time-window assertions. |
| packages/cli/tests/lib/running-state.test.ts | Adds three new tests: fsync is called before rename, temp file is removed when writeFileSync throws, and otherProjects round-trips through write/read. Uses a module-level node:fs mock that delegates to the real implementation while recording calls — a clean pattern for observing sync I/O without breaking ESM. |
| packages/cli/tests/commands/start.test.ts | Adds three regression tests: fallback fires for recently killed sessions, fallback respects the 10-minute window, and fallback uses global config so cross-project sessions appear in otherProjects. Tests use AO_GLOBAL_CONFIG env override and restore it in finally blocks to avoid test pollution. |
| packages/cli/tests/commands/update.test.ts | Adds a regression test that stages a real last-stop.json in a temp HOME directory and asserts byte-for-byte equality after ao update runs. Correctly overrides both HOME and USERPROFILE for cross-platform coverage and restores them in finally. |
| packages/cli/tests/lib/stop-update-start-flow.test.ts | New hermetic integration test for the full stop→update→start flow: verifies writeLastStop/readLastStop round-trip and that the fallback (findRecentlyKilledSessions) surfaces killed sessions when the file is absent. Uses a dedicated testHome backed by node:os mock to stay isolated from the host filesystem. |
Sequence Diagram
sequenceDiagram
participant User
participant aoStop as ao stop
participant aoUpdate as ao update
participant aoStart as ao start
participant SM as SessionManager
participant FS as Filesystem
User->>aoStop: ao stop
aoStop->>SM: "list() -> activeSessions"
aoStop->>SM: kill(session) x N
Note over aoStop,FS: Fix #2: atomicWriteFileSyncDurable
aoStop->>FS: openSync(last-stop.tmp)
aoStop->>FS: writeFileSync(fd, data)
aoStop->>FS: fsyncSync(fd) NEW
aoStop->>FS: closeSync(fd)
aoStop->>FS: renameSync(tmp to last-stop.json)
User->>aoUpdate: ao update
Note over aoUpdate,FS: Does NOT touch ~/.agent-orchestrator/
aoUpdate->>aoUpdate: run ao-update.sh
User->>aoStart: ao start
aoStart->>FS: readLastStop()
alt last-stop.json exists and non-empty
FS-->>aoStart: LastStopState
else "file missing or empty (Fix #3: fallback)"
aoStart->>FS: existsSync(globalConfigPath)
aoStart->>SM: getSessionManager(globalConfig)
aoStart->>SM: "list() -> all sessions"
Note over aoStart: Filter: manually_killed, within 10 min
SM-->>aoStart: synthesized LastStopState
end
aoStart->>User: Restore N sessions stopped at X?
User->>aoStart: yes
aoStart->>SM: restore(sessionId) x N
aoStart->>FS: clearLastStop()
Reviews (3): Last reviewed commit: "fix(cli): fallback scans the global conf..." | Re-trigger Greptile
|
Hey @suraj-markup, thanks for the layered fix. Before approving I'd like a deterministic repro on fresh Why I'm asking:
If the race is hard to trigger by hand, an instrumented test that exercises the SIGKILL / mid-shutdown window would work just as well. Happy to approve the fallback piece on its own in the meantime if that helps unblock. |
) PR #1780 review (Harshit): the `ao stop` CLI is a separate process from the `ao start` parent and `await writeLastStop(...)` completes before the CLI sends SIGTERM to the parent — so the original race the pre-write+reconcile branch was guarding against is not reproducible by hand on main. The fallback in `ao start` is doing the user-visible work. Drop the higher-surface pre-write+reconcile changes; keep what is unambiguously safer or independently useful: - Keep `last-stop-fallback.ts` + its integration in runStartup so a missing/malformed last-stop.json still surfaces a restore prompt. - Keep the fsync'd `atomicWriteFileSyncDurable` so the write survives a hard kill that lands immediately after rename. - Also fix Greptile P2: `atomicWriteFileSyncDurable` now unlinks the temp file on `writeFileSync`/`fsyncSync` throw (previously leaked). - Revert `registerStop` and the SIGTERM shutdown handler to their pre-PR post-kill write shape — same as origin/main. - Drop the two pre-write ordering tests; this removes the Greptile P2 unreachable `clearLastStop` branch naturally. Tests retained / added: - last-stop-fallback unit tests (5) - stop → simulated update → start integration test (2) - writeLastStop fsync test - NEW: writeLastStop leaves no temp file on write-throw - ao update no-touch regression test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @harshitsinghbhandari — all three concerns are fair and I agree with the downsize. Pushed 72beb03. On the repro (your #1 and #2). You're right. The What's in the new commit:
Kept (unambiguously safer or independently useful):
On your #3 ( Net diff vs your last review: −238 / +83 across 5 files. |
…surface (#1743) Greptile P1 on PR #1780. `getSessionManager(config)` in the fallback path was built from the current project's config, so `sm.list()` only saw that project's sessions. When `ao stop` ran globally and killed sessions across multiple projects, the synthesized `LastStopState` would have an empty `otherProjects` — defeating the cross-project restore that the pre-existing `readLastStop` path already supports (because `ao stop` writes the cross-project rows at stop time). Mirror the global-config load that the restore step on line ~1002 already does: if the global config exists, prefer it when constructing the fallback session manager. The downstream restore code already loads the global config when `otherProjects` is non-empty, so this just lets the fallback populate that array in the first place. Regression test in start.test.ts asserts both the in-project and a cross-project session get routed to `sm.restore()` after the fallback synthesizes a record from a global-scope `sm.list()`. The two existing fallback tests now pin AO_GLOBAL_CONFIG to a non-existent path so they don't accidentally read the host's real ~/.agent-orchestrator/config.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Greptile P1 (fallback session manager using project-scoped config) → fixed in 1339d9c. The fallback now loads the global config (mirroring the restore step at ~start.ts:1002) before constructing the session manager, so cross-project sessions are surfaced into |
Summary
Closes #1743.
The
ao stop→ao update→ao startflow was silently dropping the restore prompt becauselast-stop.jsoncould be missing from disk by the timeao startchecked for it. Three layered fixes harden the path:Root cause
ao stop(and the SIGTERM shutdown handler inlib/shutdown.ts) only wrotelast-stop.jsonAFTER finishing the kill loop. If the CLI was killed mid-shutdown — SIGKILL, hard crash, or an impatient re-run — the record was never written andao starthad nothing to restore.The atomic write also went through
writeFileSync+renameSyncwithout an explicitfsync, so a hard kill that landed immediately after the call could lose the file's data even though the rename had committed the dirent.Fix
ao stopand the shutdown handler now writelast-stop.jsonwith every active session id BEFORE running the kill loop, then reconcile (rewrite orclearLastStop) once the loop's actual results are known. The newatomicWriteFileSyncDurablecallsfsyncSyncon the temp file beforerenameSync, so the bytes are durable even if the process is killed milliseconds later.ao start. Whenlast-stop.jsonis missing or empty, scan recentlymanually_killedsessions (terminated within the last 10 minutes) viafindRecentlyKilledSessionsand synthesize aLastStopState. The restore prompt fires with the same wording as the existing record. This protects against future regressions in the write pipeline.ao updatedoes NOT touchlast-stop.json. The currentao updatecommand never reads or writes anything under~/.agent-orchestrator/, but adding state-clearing logic in the future would re-introduce the bug. The new test stages alast-stop.jsonin a tempHOMEand asserts it is byte-for-byte identical afterao updateruns.Before vs. after of the file-write timing
If the CLI is killed between
kill(s)andwrite last-stop.jsonin the old order, the record vanishes. The new order moves the durable write before any teardown work.Tests added
packages/cli/__tests__/lib/last-stop-fallback.test.ts(5) — unit tests forbuildLastStopFallback: window respect, primary vs other-project routing, missing/unparseable timestamps, custom windowpackages/cli/__tests__/lib/stop-update-start-flow.test.ts(2) — hermetic stop → simulated update → start flow against a temp HOME, plus fallback pathwriteLastStop fsyncs the temp file before renaminginrunning-state.test.tsdoes not touch ~/.agent-orchestrator/last-stop.jsoninupdate.test.tspre-writes last-stop.json before the kill loop runsinstart.test.tsclears last-stop.json when every kill failsinstart.test.tsfalls back to recently manually-killed sessions when last-stop.json is missinginstart.test.tsdoes not surface fallback candidates older than the recent windowinstart.test.tspnpm typecheckandpnpm lintpass.pnpm testshows the same 10 pre-existing failures on this branch as onmain; my 7 new test cases all pass.Test plan
pnpm typecheckcleanpnpm lintno errors (warnings only, all pre-existing)pnpm exec vitest run lib/last-stop-fallback lib/stop-update-start-flow lib/running-state— all greenpnpm exec vitest run commands/start.test.ts -t "issue #1743|every kill fails"— 4 passpnpm exec vitest run commands/update.test.ts— 26 pass (including new regression)ao stop→ao update→ao starton a checkout with active sessions reproduces the restore prompt🤖 Generated with Claude Code