Skip to content

fix(cli): preserve sessions across ao stop → ao update → ao start (closes #1743)#1780

Open
suraj-markup wants to merge 3 commits into
mainfrom
fix/1743-session-restore-after-update
Open

fix(cli): preserve sessions across ao stop → ao update → ao start (closes #1743)#1780
suraj-markup wants to merge 3 commits into
mainfrom
fix/1743-session-restore-after-update

Conversation

@suraj-markup
Copy link
Copy Markdown
Collaborator

Summary

Closes #1743.

The ao stopao updateao start flow was silently dropping the restore prompt because last-stop.json could be missing from disk by the time ao start checked for it. Three layered fixes harden the path:

Root cause

ao stop (and the SIGTERM shutdown handler in lib/shutdown.ts) only wrote last-stop.json AFTER finishing the kill loop. If the CLI was killed mid-shutdown — SIGKILL, hard crash, or an impatient re-run — the record was never written and ao start had nothing to restore.

The atomic write also went through writeFileSync + renameSync without an explicit fsync, so a hard kill that landed immediately after the call could lose the file's data even though the rename had committed the dirent.

Fix

  1. Pre-write before kill, with fsync. ao stop and the shutdown handler now write last-stop.json with every active session id BEFORE running the kill loop, then reconcile (rewrite or clearLastStop) once the loop's actual results are known. The new atomicWriteFileSyncDurable calls fsyncSync on the temp file before renameSync, so the bytes are durable even if the process is killed milliseconds later.
  2. Fallback in ao start. When last-stop.json is missing or empty, scan recently manually_killed sessions (terminated within the last 10 minutes) via findRecentlyKilledSessions and synthesize a LastStopState. The restore prompt fires with the same wording as the existing record. This protects against future regressions in the write pipeline.
  3. Regression test that ao update does NOT touch last-stop.json. The current ao update command never reads or writes anything under ~/.agent-orchestrator/, but adding state-clearing logic in the future would re-introduce the bug. The new test stages a last-stop.json in a temp HOME and asserts it is byte-for-byte identical after ao update runs.

Before vs. after of the file-write timing

Before:                                     After:
sm.list()                                   sm.list()
filter active sessions                      filter active sessions
for s in active: kill(s) ─┐                 writeLastStop(active)        ◄── pre-write, fsync'd
write last-stop.json      │                 for s in active: kill(s) ─┐
killProcessTree(parent) ◄─┘ races with      reconcile last-stop.json  │
                            shutdown.ts     killProcessTree(parent) ◄─┘

If the CLI is killed between kill(s) and write last-stop.json in the old order, the record vanishes. The new order moves the durable write before any teardown work.

Tests added

  • packages/cli/__tests__/lib/last-stop-fallback.test.ts (5) — unit tests for buildLastStopFallback: window respect, primary vs other-project routing, missing/unparseable timestamps, custom window
  • packages/cli/__tests__/lib/stop-update-start-flow.test.ts (2) — hermetic stop → simulated update → start flow against a temp HOME, plus fallback path
  • writeLastStop fsyncs the temp file before renaming in running-state.test.ts
  • does not touch ~/.agent-orchestrator/last-stop.json in update.test.ts
  • pre-writes last-stop.json before the kill loop runs in start.test.ts
  • clears last-stop.json when every kill fails in start.test.ts
  • falls back to recently manually-killed sessions when last-stop.json is missing in start.test.ts
  • does not surface fallback candidates older than the recent window in start.test.ts

pnpm typecheck and pnpm lint pass. pnpm test shows the same 10 pre-existing failures on this branch as on main; my 7 new test cases all pass.

Test plan

  • pnpm typecheck clean
  • pnpm lint no errors (warnings only, all pre-existing)
  • pnpm exec vitest run lib/last-stop-fallback lib/stop-update-start-flow lib/running-state — all green
  • pnpm exec vitest run commands/start.test.ts -t "issue #1743|every kill fails" — 4 pass
  • pnpm exec vitest run commands/update.test.ts — 26 pass (including new regression)
  • Manual: ao stopao updateao start on a checkout with active sessions reproduces the restore prompt

🤖 Generated with Claude Code

)

The ao stop → ao update → ao start flow was silently dropping the
restore prompt because last-stop.json could be missing on disk by the
time `ao start` looked for it. Three layered fixes:

1. ao stop and the SIGTERM shutdown handler now PRE-WRITE last-stop.json
   before the kill loop runs, then reconcile after. The previous order
   wrote only after the loop completed — a SIGKILL or crash mid-loop
   silently lost the record. The pre-write also calls fsyncSync on the
   temp file before rename, so the data survives a hard kill that lands
   immediately after the call returns (renameSync is atomic but the
   data blocks aren't durable until fsync).

2. ao start now falls back to scanning recently `manually_killed`
   sessions (terminated within 10 minutes, reason=manually_killed)
   when last-stop.json is missing or empty. The restore prompt
   surfaces them with the same UX as the existing record so a single
   regression in the write pipeline cannot silently drop the user's
   in-flight work again.

3. Added a regression test that exercises the stop → simulated update →
   start flow against a temp HOME and asserts the file survives.

Tests added:
- packages/cli/__tests__/lib/last-stop-fallback.test.ts (5)
- packages/cli/__tests__/lib/stop-update-start-flow.test.ts (2)
- writeLastStop fsync test in running-state.test.ts (1)
- ao update preserves last-stop.json regression in update.test.ts (1)
- pre-write/clear ordering tests in start.test.ts (2)
- ao start fallback path tests in start.test.ts (2)

Closes #1743

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 10, 2026

Test Coverage Report

Metric Value
Lines covered 1098/1536
Lines not covered 438/1536
Overall coverage 71.5%
Per-file breakdown
File Coverage
packages/cli/src/commands/start.ts 884/1242 (71.2%)
packages/cli/src/lib/last-stop-fallback.ts 60/62 (96.8%)
packages/cli/src/lib/running-state.ts 154/232 (66.4%)

Uncovered lines

  • packages/cli/src/commands/start.ts: L116-L117, L128-L130, L133-L135, L137, L139-L143, L146-L147, L149, L151-L155, L157-L159, L199-L200, L205-L223, L225-L232, L234-L235, L241-L247, L249, L280-L281, L295-L297, L306-L307, L310-L334, L379-L393, L395-L398, L400-L414, L442, L453, L477-L478, L481-L482, L490-L491, L498-L503, L541-L542, L566-L569, L583-L584, L591-L601, L604-L605, L629-L630, L633-L637, L645-L652, L659-L660, L697-L700, L715-L720, L737-L739, L741-L748, L750-L752, L754-L759, L761-L762, L856-L859, L906, L928-L936, L1022-L1024, L1057-L1058, L1074-L1075, L1081-L1082, L1084-L1085, L1089, L1144-L1145, L1163-L1168, L1201-L1202, L1208-L1209, L1280, L1300-L1301, L1385-L1386, L1438-L1455, L1512, L1546-L1548, L1571-L1578, L1605-L1613, L1624-L1625, L1659-L1660, L1668-L1679, L1682, L1698, L1704, L1707-L1715, L1725-L1726, L1767-L1770, L1773, L1775-L1777, L1782-L1783, L1792-L1796, L1798-L1799, L1808-L1809, L1819-L1824, L1860-L1866
  • packages/cli/src/lib/last-stop-fallback.ts: L120-L121
  • packages/cli/src/lib/running-state.ts: L59-L60, L74-L75, L78-L85, L98-L99, L133-L136, L157-L158, L164, L189-L196, L204-L215, L226-L236, L250-L252, L264-L266, L280-L287, L317-L319, L350, L359-L366

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 10, 2026

Greptile Summary

This PR adds three layers of hardening for the ao stopao updateao start restore flow: (1) replaces the plain writeFileSync+renameSync pair with an fsync-before-rename variant (atomicWriteFileSyncDurable) to make last-stop.json durable against SIGKILL; (2) adds a fallback in ao start that scans recently manually_killed sessions when the file is missing, using the global config so cross-project sessions are included; and (3) adds regression tests that assert ao update never touches last-stop.json.

  • atomicWriteFileSyncDurable (running-state.ts): calls fsyncSync on the temp fd before renameSync so data survives a kill signal arriving milliseconds after the write returns; cleanup on write failure is handled in the finally block.
  • Fallback scan (start.ts, last-stop-fallback.ts): when readLastStop() returns null or empty, findRecentlyKilledSessions queries the session manager for manually_killed sessions within a 10-minute window and synthesizes a LastStopState so the restore prompt still fires.
  • Test coverage includes unit tests for buildLastStopFallback (window, routing, malformed timestamps), a hermetic stop→update→start integration test, and fsync/temp-cleanup regression tests for writeLastStop.

Confidence Score: 4/5

Safe to merge with awareness that the write-after-kill-loop ordering remains unaddressed — the fsync durability and fallback scan together mitigate the worst consequences, but the narrow race window between the kill loop and the writeLastStop call still exists.

The fsync durability fix and the fallback mechanism are both well-implemented and thoroughly tested. The cross-project config fix (loading global config in the fallback) is a genuine improvement over the previous revision. What keeps this from a clean score is the unresolved timing issue called out in a prior review: writeLastStop in ao stop is still invoked after the kill loop finishes, meaning a SIGKILL that lands inside the kill loop can still produce a missing last-stop.json — exactly the scenario the PR description's Before vs. after diagram was meant to fix. The fallback partially compensates via the 10-minute window, but it is a recovery mechanism, not a prevention.

packages/cli/src/commands/start.ts around the ao stop subcommand kill loop (lines 1760-1817) — the pre-write ordering claimed by the PR description was not implemented here.

Important Files Changed

Filename Overview
packages/cli/src/lib/running-state.ts Adds atomicWriteFileSyncDurable which fsyncs the temp fd before renaming; writeLastStop now uses this instead of the previous plain atomic write. Temp-file cleanup on write failure is handled in the finally block. The parent directory entry after renameSync is not itself fsync'd, which is a minor durability gap on strict crash-consistency filesystems, but a clear improvement over the prior state.
packages/cli/src/commands/start.ts Adds the fallback path: when readLastStop() returns null/empty, loads the global config and calls findRecentlyKilledSessions so cross-project sessions are surfaced. The fallback is enclosed in the existing non-fatal try/catch. The writeLastStop in the ao stop subcommand is still called after the kill loop, preserving the original race window.
packages/cli/src/lib/last-stop-fallback.ts New module: buildLastStopFallback filters sessions by manually_killed reason and terminatedAt within a configurable window, groups by project, and synthesizes a LastStopState. findRecentlyKilledSessions wraps this in an async call with silent error suppression. Logic is clean and well-tested.
packages/cli/tests/lib/last-stop-fallback.test.ts New unit test file covering all key branches of buildLastStopFallback: empty result, primary-project sessions, other-project sessions, malformed/missing timestamps, and custom window. Good use of a fixed NOW anchor for deterministic time-window assertions.
packages/cli/tests/lib/running-state.test.ts Adds three new tests: fsync is called before rename, temp file is removed when writeFileSync throws, and otherProjects round-trips through write/read. Uses a module-level node:fs mock that delegates to the real implementation while recording calls — a clean pattern for observing sync I/O without breaking ESM.
packages/cli/tests/commands/start.test.ts Adds three regression tests: fallback fires for recently killed sessions, fallback respects the 10-minute window, and fallback uses global config so cross-project sessions appear in otherProjects. Tests use AO_GLOBAL_CONFIG env override and restore it in finally blocks to avoid test pollution.
packages/cli/tests/commands/update.test.ts Adds a regression test that stages a real last-stop.json in a temp HOME directory and asserts byte-for-byte equality after ao update runs. Correctly overrides both HOME and USERPROFILE for cross-platform coverage and restores them in finally.
packages/cli/tests/lib/stop-update-start-flow.test.ts New hermetic integration test for the full stop→update→start flow: verifies writeLastStop/readLastStop round-trip and that the fallback (findRecentlyKilledSessions) surfaces killed sessions when the file is absent. Uses a dedicated testHome backed by node:os mock to stay isolated from the host filesystem.

Sequence Diagram

sequenceDiagram
    participant User
    participant aoStop as ao stop
    participant aoUpdate as ao update
    participant aoStart as ao start
    participant SM as SessionManager
    participant FS as Filesystem

    User->>aoStop: ao stop
    aoStop->>SM: "list() -> activeSessions"
    aoStop->>SM: kill(session) x N
    Note over aoStop,FS: Fix #2: atomicWriteFileSyncDurable
    aoStop->>FS: openSync(last-stop.tmp)
    aoStop->>FS: writeFileSync(fd, data)
    aoStop->>FS: fsyncSync(fd) NEW
    aoStop->>FS: closeSync(fd)
    aoStop->>FS: renameSync(tmp to last-stop.json)

    User->>aoUpdate: ao update
    Note over aoUpdate,FS: Does NOT touch ~/.agent-orchestrator/
    aoUpdate->>aoUpdate: run ao-update.sh

    User->>aoStart: ao start
    aoStart->>FS: readLastStop()
    alt last-stop.json exists and non-empty
        FS-->>aoStart: LastStopState
    else "file missing or empty (Fix #3: fallback)"
        aoStart->>FS: existsSync(globalConfigPath)
        aoStart->>SM: getSessionManager(globalConfig)
        aoStart->>SM: "list() -> all sessions"
        Note over aoStart: Filter: manually_killed, within 10 min
        SM-->>aoStart: synthesized LastStopState
    end
    aoStart->>User: Restore N sessions stopped at X?
    User->>aoStart: yes
    aoStart->>SM: restore(sessionId) x N
    aoStart->>FS: clearLastStop()
Loading

Reviews (3): Last reviewed commit: "fix(cli): fallback scans the global conf..." | Re-trigger Greptile

Comment thread packages/cli/src/lib/running-state.ts
Comment thread packages/cli/src/commands/start.ts Outdated
@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator

Hey @suraj-markup, thanks for the layered fix.

Before approving I'd like a deterministic repro on fresh main. I ran the issue's exact flow (ao stopao updateao start) on latest upstream/main (macOS) and the restore prompt fires for me — await writeLastStop(...) in registerStop (packages/cli/src/commands/start.ts:1781) completes before the SIGTERM to the parent, so the file is on disk.

Why I'm asking:

  1. The pre-write before kill change in registerStop sits inside the inner try/catch — if writeLastStop throws (lock contention, disk full, etc.), the outer catch fires and the kill loop gets skipped. That's a behavior change vs main (where a write failure didn't block kills), and worth confirming it's solving a real path before we ship it.
  2. Without a hard repro I can't tell whether the pre-write + reconcile state machine actually closes the gap, or whether the fallback in ao start is doing all the work on its own. The fallback is unambiguously safe; the pre-write + reconcile is a much bigger surface for a scenario nobody can hand-trigger.
  3. The original report's running.json showed "projects": [] at the time of the curl — which hints the new ao start hadn't finished registering yet. Wondering if some of what looked like "missing restore prompt" was actually a separate timing symptom.

If the race is hard to trigger by hand, an instrumented test that exercises the SIGKILL / mid-shutdown window would work just as well. Happy to approve the fallback piece on its own in the meantime if that helps unblock.

)

PR #1780 review (Harshit): the `ao stop` CLI is a separate process from
the `ao start` parent and `await writeLastStop(...)` completes before
the CLI sends SIGTERM to the parent — so the original race the
pre-write+reconcile branch was guarding against is not reproducible by
hand on main. The fallback in `ao start` is doing the user-visible work.

Drop the higher-surface pre-write+reconcile changes; keep what is
unambiguously safer or independently useful:

- Keep `last-stop-fallback.ts` + its integration in runStartup so a
  missing/malformed last-stop.json still surfaces a restore prompt.
- Keep the fsync'd `atomicWriteFileSyncDurable` so the write survives a
  hard kill that lands immediately after rename.
- Also fix Greptile P2: `atomicWriteFileSyncDurable` now unlinks the
  temp file on `writeFileSync`/`fsyncSync` throw (previously leaked).
- Revert `registerStop` and the SIGTERM shutdown handler to their
  pre-PR post-kill write shape — same as origin/main.
- Drop the two pre-write ordering tests; this removes the Greptile P2
  unreachable `clearLastStop` branch naturally.

Tests retained / added:
- last-stop-fallback unit tests (5)
- stop → simulated update → start integration test (2)
- writeLastStop fsync test
- NEW: writeLastStop leaves no temp file on write-throw
- ao update no-touch regression test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@suraj-markup
Copy link
Copy Markdown
Collaborator Author

Thanks @harshitsinghbhandari — all three concerns are fair and I agree with the downsize. Pushed 72beb03.

On the repro (your #1 and #2). You're right. The ao stop CLI is a separate process from ao start, so await writeLastStop(...) returns before this CLI sends SIGTERM to the parent — the file is on disk before the parent dies. The only ways the pre-write+reconcile branch closed a real gap were (a) the ao stop CLI itself getting SIGKILL'd between the kill loop and the writeLastStop call, or (b) an fsync race where the parent dies before the kernel flushes data blocks. Neither is hand-triggerable, and I couldn't write an instrumented test that wasn't just exercising the mock. So per your suggestion in #3b, I dropped that surface.

What's in the new commit:

  • registerStop and the SIGTERM shutdown handler — reverted to the original post-kill write shape (same as origin/main).
  • Pre-write ordering tests in start.test.ts — dropped (with them, your behavior-change concern in feat: implement web dashboard with attention-zone UI and API routes #1 also goes away naturally: a writeLastStop throw can no longer block the kill loop, because the pre-write doesn't exist).
  • Greptile P2 unreachable clearLastStop branch — gone with the reconciliation code.

Kept (unambiguously safer or independently useful):

  • last-stop-fallback.ts + the runStartup integration — your defense-in-depth point.
  • atomicWriteFileSyncDurable with fsync in running-state.ts — durability is strictly safer; not behavior-changing in any code path either of us cares about.
  • Greptile P2 feat: implement web dashboard with attention-zone UI and API routes #1 fix — temp file is now unlinked on writeFileSync/fsyncSync throw (previously leaked). Test added in running-state.test.ts.
  • ao update no-touch regression test (update.test.ts).
  • stop → simulated update → start integration test (stop-update-start-flow.test.ts).

On your #3 (projects: [] in running.json). Strong hint; not investigating in this PR since it's a separate timing symptom and the fallback covers the user-visible regression either way. Happy to file a follow-up issue if you'd like.

Net diff vs your last review: −238 / +83 across 5 files. pnpm typecheck clean; same 4 pre-existing failures on this branch as on the prior commit (none in the files touched here).

Comment thread packages/cli/src/commands/start.ts
…surface (#1743)

Greptile P1 on PR #1780. `getSessionManager(config)` in the fallback
path was built from the current project's config, so `sm.list()` only
saw that project's sessions. When `ao stop` ran globally and killed
sessions across multiple projects, the synthesized `LastStopState`
would have an empty `otherProjects` — defeating the cross-project
restore that the pre-existing `readLastStop` path already supports
(because `ao stop` writes the cross-project rows at stop time).

Mirror the global-config load that the restore step on line ~1002
already does: if the global config exists, prefer it when constructing
the fallback session manager. The downstream restore code already
loads the global config when `otherProjects` is non-empty, so this
just lets the fallback populate that array in the first place.

Regression test in start.test.ts asserts both the in-project and a
cross-project session get routed to `sm.restore()` after the fallback
synthesizes a record from a global-scope `sm.list()`. The two existing
fallback tests now pin AO_GLOBAL_CONFIG to a non-existent path so they
don't accidentally read the host's real ~/.agent-orchestrator/config.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@suraj-markup
Copy link
Copy Markdown
Collaborator Author

Greptile P1 (fallback session manager using project-scoped config) → fixed in 1339d9c. The fallback now loads the global config (mirroring the restore step at ~start.ts:1002) before constructing the session manager, so cross-project sessions are surfaced into otherProjects and routed through the normal restore path. Regression test in start.test.ts: fallback uses the global config so cross-project sessions appear in otherProjects (PR #1780) — asserts both an in-project and a cross-project session get sm.restore() after a global-scope sm.list(). The two existing fallback tests are now pinned to a non-existent AO_GLOBAL_CONFIG so they don't read the host's real ~/.agent-orchestrator/config.yaml. Thread resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(cli): ao stop → ao update → ao start leaves sessions terminated, no restore prompt

2 participants