Skip to content

fix(web): defer idle PTY kill to curb macOS ptmx slot exhaustion (#1718)#1719

Open
harshitsinghbhandari wants to merge 2 commits into
ComposioHQ:mainfrom
harshitsinghbhandari:feat/1718
Open

fix(web): defer idle PTY kill to curb macOS ptmx slot exhaustion (#1718)#1719
harshitsinghbhandari wants to merge 2 commits into
ComposioHQ:mainfrom
harshitsinghbhandari:feat/1718

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator

Summary

Fixes #1718.

When the last WebSocket subscriber disconnects from a terminal, TerminalManager used to call pty.kill() and drop the entry inline. Every reconnect — tab reload, sleep/wake, network blip — then allocated a fresh PTY with a new ptmx slot number.

macOS never recycles ptmx slot numbers within a boot, so this churn slowly burned through kern.tty.ptmx_max=511. A dashboard with ~30 sessions hit the cap after ~28 days of uptime; afterwards every new PTY allocation on the host (terminal tabs, tmux attach, ao spawn) failed until reboot.

This PR addresses the dominant churn source called out in the issue: reconnect-driven re-allocation. It is orthogonal to and stacks on top of #1640 (re-attach loop bound).

What changed

  • ManagedTerminal now tracks an idleGraceTimer.
  • On last unsubscribe, the kill is deferred by PTY_IDLE_GRACE_MS = 30_000 instead of running inline. The timer's callback re-checks subscribers.size so a reconnect that lands inside the window transparently reuses the existing PTY and its ring buffer.
  • On new subscribe, any pending grace timer is cleared.
  • When the grace timer actually fires the kill, it also clears the resetTimer introduced in fix(web): bound PTY re-attach loop with grace-period counter reset #1640 so its closure doesn't keep the evicted terminal reachable.
  • The grace timer is .unref()d so it doesn't hold the event loop alive on shutdown.
  • TerminalManager is now exported so it can be unit-tested.

Together with #1640, this should push slot exhaustion from ~28 days out to several months under the same load.

Test plan

New unit tests

  • Kill is deferred on last unsub
  • Kill fires after 30s with no reconnect
  • Reconnect within grace preserves PTY (no new spawn, no kill)
  • Kill fires only when truly the last subscriber leaves (multi-subscriber semantics)
  • Ring buffer survives an unsub/resub cycle within grace

Integration test touch-up

direct-terminal-ws.integration.test.ts "receives terminal data after open" was relying on the implicit pre-#1718 behavior that every WS open triggered a fresh tmux attach (which in turn produced fresh init data). With the new grace path the previous test's PTY is reused and no fresh init data flows. Switched the test to a unique terminal id so it explicitly exercises the fresh-attach path it was meant to test. Other tests in that file drive output via stdin and were not affected.

Notes for reviewer

  • The 30s window is tunable. It needs to be long enough to absorb a tab reload (<1s) and brief sleep/wake (typically <10s), but short enough that abandoned sessions don't sit on PTY slots forever. 30s mirrors the issue's proposed value.
  • idleGraceTimer.unref() avoids holding the event loop alive at process shutdown — without it, an ao stop while sessions are in grace could delay exit by up to 30s.

🤖 Generated with Claude Code

…posioHQ#1718)

When the last WebSocket subscriber disconnects from a terminal,
TerminalManager used to call pty.kill() and drop the entry immediately.
On every reconnect (tab reload, sleep/wake, network blip) the next
attach allocated a fresh PTY with a new ptmx slot number.

macOS never recycles ptmx slot numbers within a boot, so this churn
slowly burned through kern.tty.ptmx_max=511 — a dashboard with ~30
sessions hit the cap after ~28 days of uptime, after which every new
PTY allocation on the host (terminal tabs, tmux attach, ao spawn)
fails until reboot.

Add a 30s idle grace window before killing the PTY:
- ManagedTerminal now tracks an idleGraceTimer.
- The unsubscribe path schedules pty.kill() via setTimeout instead of
  calling it inline. The timer's callback re-checks subscribers.size
  to handle a reconnect that lands during the window.
- The subscribe path clears any pending timer when a new subscriber
  arrives, so a quick reconnect transparently reuses the existing PTY
  (and its ring buffer) without allocating a new slot.
- The timer is .unref()'d so it never holds the event loop alive on
  shutdown.

Together with ComposioHQ#1640 (re-attach loop bound), this should push slot
exhaustion from ~28 days out to several months under the same load.

Adds unit tests covering: deferred kill, expiry without reconnect,
reconnect-within-grace preservation, multi-subscriber semantics, and
buffer continuity across an unsubscribe/resubscribe cycle.

Fixes: ComposioHQ#1718

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR addresses PTY slot exhaustion on macOS by deferring the pty.kill() call when the last WebSocket subscriber disconnects. Instead of killing immediately, a 30 s grace timer is scheduled; a reconnect arriving within that window cancels the timer and reuses the existing PTY and ring buffer, eliminating the ptmx-slot churn that was draining kern.tty.ptmx_max=511 over weeks.

  • Core change (mux-websocket.ts): ManagedTerminal gains an idleGraceTimer field; the unsubscribe path schedules the kill via setTimeout (.unref()'d); subscribe() cancels any pending timer on reconnect; the timer callback re-checks subscribers.size before killing.
  • New unit tests: Five tests covering no-immediate-kill, deferred kill at 30 s, reconnect-preserves-PTY, multi-subscriber semantics, and buffer-survival across an unsub/resub cycle — all using fake timers correctly.
  • Integration test fix: "receives terminal data after open" now uses a unique terminal id to avoid landing in the grace-reuse path, which would silently deliver an empty ring buffer and time out.

Confidence Score: 4/5

The change is well-scoped and well-tested; the new grace-timer path does not affect any hot path when subscribers are present.

The implementation is correct and the five new unit tests cover the critical scenarios. Two minor latent issues were flagged: open() not clearing the grace timer (currently safe due to synchronous call order, but a maintenance hazard) and a double-unsub() call re-arming a timer on an already-evicted terminal object (harmless in practice). Neither affects current behaviour.

packages/web/server/mux-websocket.ts — specifically the open() / subscribe() call ordering in the WS message handler and the unsubscribe closure.

Important Files Changed

Filename Overview
packages/web/server/mux-websocket.ts Core change: defers PTY kill on last unsubscribe by 30 s via idleGraceTimer; clears the timer on reconnect; includes re-check inside timer callback. Logic is sound; one latent ordering note around open() not clearing the grace timer directly.
packages/web/server/tests/mux-websocket.test.ts Adds 5 focused unit tests for the grace-period behaviour (no-immediate-kill, deferred kill, reconnect-within-grace, multi-subscriber semantics, buffer preservation). All use fake timers correctly.
packages/web/server/tests/direct-terminal-ws.integration.test.ts Fixes "receives terminal data after open" to use a unique terminal id so it reliably exercises the fresh-attach path rather than accidentally hitting the new grace-reuse path.
.changeset/pty-idle-grace-period.md Changeset entry correctly categorises the change as a patch on @aoagents/ao-web.

Sequence Diagram

sequenceDiagram
    participant WS as WebSocket Client
    participant TM as TerminalManager
    participant Timer as idleGraceTimer
    participant PTY as PTY Process

    WS->>TM: subscribe("ao-177")
    TM->>PTY: spawn (if not already alive)
    TM-->>WS: unsub()

    Note over WS,TM: Client disconnects (tab reload / network blip)
    WS->>TM: unsub()
    TM->>Timer: setTimeout(30 s).unref()
    Note over Timer: Grace window open

    alt Reconnect within 30 s
        WS->>TM: subscribe("ao-177")
        TM->>Timer: clearTimeout — PTY reused, no new ptmx slot
        TM-->>WS: buffered history replayed
    else No reconnect — grace expires
        Timer->>TM: callback fires
        TM->>TM: "re-check subscribers.size === 0"
        TM->>PTY: pty.kill()
        TM->>TM: terminals.delete(key)
    end
Loading

Comments Outside Diff (1)

  1. packages/web/server/mux-websocket.ts, line 568-597 (link)

    P2 open() doesn't cancel a pending grace timer

    In the WS message handler, terminalManager.open() (line 570) is called before terminalManager.subscribe() (line 597). The grace timer is only cancelled inside subscribe(), so there is a narrow window where the client has already received the "opened" confirmation and the buffer has been sent, but the 30 s kill is still ticking. In the current sync call path this is harmless, but if open() is ever called without a subsequent subscribe() — or if async work is inserted between the two — the grace timer will fire and kill the PTY that the client was just told is ready. Clearing idleGraceTimer at the top of open() (alongside the early-return PTY-alive check) would make the invariant hold regardless of call order.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/web/server/mux-websocket.ts
    Line: 568-597
    
    Comment:
    **`open()` doesn't cancel a pending grace timer**
    
    In the WS message handler, `terminalManager.open()` (line 570) is called before `terminalManager.subscribe()` (line 597). The grace timer is only cancelled inside `subscribe()`, so there is a narrow window where the client has already received the `"opened"` confirmation and the buffer has been sent, but the 30 s kill is still ticking. In the current sync call path this is harmless, but if `open()` is ever called without a subsequent `subscribe()` — or if async work is inserted between the two — the grace timer will fire and kill the PTY that the client was just told is ready. Clearing `idleGraceTimer` at the top of `open()` (alongside the early-return PTY-alive check) would make the invariant hold regardless of call order.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
packages/web/server/mux-websocket.ts:568-597
**`open()` doesn't cancel a pending grace timer**

In the WS message handler, `terminalManager.open()` (line 570) is called before `terminalManager.subscribe()` (line 597). The grace timer is only cancelled inside `subscribe()`, so there is a narrow window where the client has already received the `"opened"` confirmation and the buffer has been sent, but the 30 s kill is still ticking. In the current sync call path this is harmless, but if `open()` is ever called without a subsequent `subscribe()` — or if async work is inserted between the two — the grace timer will fire and kill the PTY that the client was just told is ready. Clearing `idleGraceTimer` at the top of `open()` (alongside the early-return PTY-alive check) would make the invariant hold regardless of call order.

### Issue 2 of 2
packages/web/server/mux-websocket.ts:467-493
**Double-`unsub()` call re-arms a timer on an evicted terminal**

Once the grace timer fires it sets `terminal.idleGraceTimer = null` and removes the entry from `this.terminals`. If the captured `unsub` closure is then called a second time (defensive teardown, React strict-mode double-invoke, etc.), the conditions `subscribers.size === 0 && idleGraceTimer === null` are both true and a new 30 s timer is scheduled on the dead `terminal` object. The timer fires harmlessly (PTY is already null, `delete` on the map is a no-op), but it creates an unnecessary lingering reference. A simple guard like checking `!this.terminals.has(key)` before scheduling would make `unsub` idempotent after eviction.

Reviews (1): Last reviewed commit: "fix(web): defer idle PTY kill to curb ma..." | Re-trigger Greptile

Comment thread packages/web/server/mux-websocket.ts
Address PR ComposioHQ#1719 review (greptile P2): once the grace timer fires it
clears idleGraceTimer and evicts the terminal from this.terminals. A
double-unsub() (React strict-mode double-invoke, defensive teardown)
would then satisfy `subscribers.size === 0 && idleGraceTimer === null`
and arm a fresh 30 s timer holding a closure to the dead terminal.

Add a `this.terminals.has(key)` guard so unsub is a no-op once the
entry has been evicted. New regression test verifies a second unsub
after grace expiry doesn't fire a second kill or schedule a new timer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PTY slot exhaustion on macOS during extended dashboard uptime

1 participant