fix(web): defer idle PTY kill to curb macOS ptmx slot exhaustion (#1718)#1719
fix(web): defer idle PTY kill to curb macOS ptmx slot exhaustion (#1718)#1719harshitsinghbhandari wants to merge 2 commits into
Conversation
…posioHQ#1718) When the last WebSocket subscriber disconnects from a terminal, TerminalManager used to call pty.kill() and drop the entry immediately. On every reconnect (tab reload, sleep/wake, network blip) the next attach allocated a fresh PTY with a new ptmx slot number. macOS never recycles ptmx slot numbers within a boot, so this churn slowly burned through kern.tty.ptmx_max=511 — a dashboard with ~30 sessions hit the cap after ~28 days of uptime, after which every new PTY allocation on the host (terminal tabs, tmux attach, ao spawn) fails until reboot. Add a 30s idle grace window before killing the PTY: - ManagedTerminal now tracks an idleGraceTimer. - The unsubscribe path schedules pty.kill() via setTimeout instead of calling it inline. The timer's callback re-checks subscribers.size to handle a reconnect that lands during the window. - The subscribe path clears any pending timer when a new subscriber arrives, so a quick reconnect transparently reuses the existing PTY (and its ring buffer) without allocating a new slot. - The timer is .unref()'d so it never holds the event loop alive on shutdown. Together with ComposioHQ#1640 (re-attach loop bound), this should push slot exhaustion from ~28 days out to several months under the same load. Adds unit tests covering: deferred kill, expiry without reconnect, reconnect-within-grace preservation, multi-subscriber semantics, and buffer continuity across an unsubscribe/resubscribe cycle. Fixes: ComposioHQ#1718 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Greptile SummaryThis PR addresses PTY slot exhaustion on macOS by deferring the
Confidence Score: 4/5The change is well-scoped and well-tested; the new grace-timer path does not affect any hot path when subscribers are present. The implementation is correct and the five new unit tests cover the critical scenarios. Two minor latent issues were flagged: packages/web/server/mux-websocket.ts — specifically the
|
| Filename | Overview |
|---|---|
| packages/web/server/mux-websocket.ts | Core change: defers PTY kill on last unsubscribe by 30 s via idleGraceTimer; clears the timer on reconnect; includes re-check inside timer callback. Logic is sound; one latent ordering note around open() not clearing the grace timer directly. |
| packages/web/server/tests/mux-websocket.test.ts | Adds 5 focused unit tests for the grace-period behaviour (no-immediate-kill, deferred kill, reconnect-within-grace, multi-subscriber semantics, buffer preservation). All use fake timers correctly. |
| packages/web/server/tests/direct-terminal-ws.integration.test.ts | Fixes "receives terminal data after open" to use a unique terminal id so it reliably exercises the fresh-attach path rather than accidentally hitting the new grace-reuse path. |
| .changeset/pty-idle-grace-period.md | Changeset entry correctly categorises the change as a patch on @aoagents/ao-web. |
Sequence Diagram
sequenceDiagram
participant WS as WebSocket Client
participant TM as TerminalManager
participant Timer as idleGraceTimer
participant PTY as PTY Process
WS->>TM: subscribe("ao-177")
TM->>PTY: spawn (if not already alive)
TM-->>WS: unsub()
Note over WS,TM: Client disconnects (tab reload / network blip)
WS->>TM: unsub()
TM->>Timer: setTimeout(30 s).unref()
Note over Timer: Grace window open
alt Reconnect within 30 s
WS->>TM: subscribe("ao-177")
TM->>Timer: clearTimeout — PTY reused, no new ptmx slot
TM-->>WS: buffered history replayed
else No reconnect — grace expires
Timer->>TM: callback fires
TM->>TM: "re-check subscribers.size === 0"
TM->>PTY: pty.kill()
TM->>TM: terminals.delete(key)
end
Comments Outside Diff (1)
-
packages/web/server/mux-websocket.ts, line 568-597 (link)open()doesn't cancel a pending grace timerIn the WS message handler,
terminalManager.open()(line 570) is called beforeterminalManager.subscribe()(line 597). The grace timer is only cancelled insidesubscribe(), so there is a narrow window where the client has already received the"opened"confirmation and the buffer has been sent, but the 30 s kill is still ticking. In the current sync call path this is harmless, but ifopen()is ever called without a subsequentsubscribe()— or if async work is inserted between the two — the grace timer will fire and kill the PTY that the client was just told is ready. ClearingidleGraceTimerat the top ofopen()(alongside the early-return PTY-alive check) would make the invariant hold regardless of call order.Prompt To Fix With AI
This is a comment left during a code review. Path: packages/web/server/mux-websocket.ts Line: 568-597 Comment: **`open()` doesn't cancel a pending grace timer** In the WS message handler, `terminalManager.open()` (line 570) is called before `terminalManager.subscribe()` (line 597). The grace timer is only cancelled inside `subscribe()`, so there is a narrow window where the client has already received the `"opened"` confirmation and the buffer has been sent, but the 30 s kill is still ticking. In the current sync call path this is harmless, but if `open()` is ever called without a subsequent `subscribe()` — or if async work is inserted between the two — the grace timer will fire and kill the PTY that the client was just told is ready. Clearing `idleGraceTimer` at the top of `open()` (alongside the early-return PTY-alive check) would make the invariant hold regardless of call order. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/web/server/mux-websocket.ts:568-597
**`open()` doesn't cancel a pending grace timer**
In the WS message handler, `terminalManager.open()` (line 570) is called before `terminalManager.subscribe()` (line 597). The grace timer is only cancelled inside `subscribe()`, so there is a narrow window where the client has already received the `"opened"` confirmation and the buffer has been sent, but the 30 s kill is still ticking. In the current sync call path this is harmless, but if `open()` is ever called without a subsequent `subscribe()` — or if async work is inserted between the two — the grace timer will fire and kill the PTY that the client was just told is ready. Clearing `idleGraceTimer` at the top of `open()` (alongside the early-return PTY-alive check) would make the invariant hold regardless of call order.
### Issue 2 of 2
packages/web/server/mux-websocket.ts:467-493
**Double-`unsub()` call re-arms a timer on an evicted terminal**
Once the grace timer fires it sets `terminal.idleGraceTimer = null` and removes the entry from `this.terminals`. If the captured `unsub` closure is then called a second time (defensive teardown, React strict-mode double-invoke, etc.), the conditions `subscribers.size === 0 && idleGraceTimer === null` are both true and a new 30 s timer is scheduled on the dead `terminal` object. The timer fires harmlessly (PTY is already null, `delete` on the map is a no-op), but it creates an unnecessary lingering reference. A simple guard like checking `!this.terminals.has(key)` before scheduling would make `unsub` idempotent after eviction.
Reviews (1): Last reviewed commit: "fix(web): defer idle PTY kill to curb ma..." | Re-trigger Greptile
Address PR ComposioHQ#1719 review (greptile P2): once the grace timer fires it clears idleGraceTimer and evicts the terminal from this.terminals. A double-unsub() (React strict-mode double-invoke, defensive teardown) would then satisfy `subscribers.size === 0 && idleGraceTimer === null` and arm a fresh 30 s timer holding a closure to the dead terminal. Add a `this.terminals.has(key)` guard so unsub is a no-op once the entry has been evicted. New regression test verifies a second unsub after grace expiry doesn't fire a second kill or schedule a new timer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Fixes #1718.
When the last WebSocket subscriber disconnects from a terminal,
TerminalManagerused to callpty.kill()and drop the entry inline. Every reconnect — tab reload, sleep/wake, network blip — then allocated a fresh PTY with a new ptmx slot number.macOS never recycles ptmx slot numbers within a boot, so this churn slowly burned through
kern.tty.ptmx_max=511. A dashboard with ~30 sessions hit the cap after ~28 days of uptime; afterwards every new PTY allocation on the host (terminal tabs, tmux attach, ao spawn) failed until reboot.This PR addresses the dominant churn source called out in the issue: reconnect-driven re-allocation. It is orthogonal to and stacks on top of #1640 (re-attach loop bound).
What changed
ManagedTerminalnow tracks anidleGraceTimer.PTY_IDLE_GRACE_MS = 30_000instead of running inline. The timer's callback re-checkssubscribers.sizeso a reconnect that lands inside the window transparently reuses the existing PTY and its ring buffer.resetTimerintroduced in fix(web): bound PTY re-attach loop with grace-period counter reset #1640 so its closure doesn't keep the evicted terminal reachable..unref()d so it doesn't hold the event loop alive on shutdown.TerminalManageris now exported so it can be unit-tested.Together with #1640, this should push slot exhaustion from ~28 days out to several months under the same load.
Test plan
pnpm --filter @aoagents/ao-web test -- mux-websocket— 17/17 pass (5 new + 3 fix: scroll wheel broken in tmux terminal - PR #1608 regression #1714 regression + 9 SessionBroadcaster)pnpm --filter @aoagents/ao-web test -- direct-terminal-ws— 21/21 pass (includes PTY file-descriptor leak in dashboard mux server exhausts macOS system PTY pool after 10–20h #1639/fix(web): bound PTY re-attach loop with grace-period counter reset #1640 runaway + recovery integration tests)pnpm typecheck— cleanpnpm lint— 0 errors (warnings are pre-existing in unrelated files)pnpm build— cleanNew unit tests
Integration test touch-up
direct-terminal-ws.integration.test.ts"receives terminal data after open" was relying on the implicit pre-#1718 behavior that every WS open triggered a fresh tmux attach (which in turn produced fresh init data). With the new grace path the previous test's PTY is reused and no fresh init data flows. Switched the test to a unique terminal id so it explicitly exercises the fresh-attach path it was meant to test. Other tests in that file drive output via stdin and were not affected.Notes for reviewer
idleGraceTimer.unref()avoids holding the event loop alive at process shutdown — without it, anao stopwhile sessions are in grace could delay exit by up to 30s.🤖 Generated with Claude Code