fix: process group kill + session suspend/resume via session/load#310
Conversation
Fixes openabdev#309 — session pool leaks memory due to orphaned grandchild processes and no session resume capability. Changes: - Replace kill_on_drop with process groups (setpgid + kill(-pgid)) so the entire process tree is killed on session cleanup - 3-stage graceful shutdown: stdin close → SIGTERM → SIGKILL - Store agentCapabilities.loadSession from initialize response - Add session/load method for resuming suspended sessions - Suspend sessions on eviction (save sessionId) instead of discarding - Resume via session/load on reconnect, fallback to session/new - LRU eviction when pool is full (evict oldest idle session) - Lower default session_ttl_hours from 24 to 4 Memory impact on 3.6 GB host: Before: 10 x 300 MB = 3 GB (idle sessions kept alive + orphaned grandchildren) After: 1-2 x 300 MB = 300-600 MB (idle sessions suspended, reloaded on demand)
OrbStack Validation Results ✅Tested PR #310 on OrbStack k8s (v1.33.5) with Process Group KillSimulated
LRU Eviction + Session ResumeCreated 4 Discord threads with Key log lines: Zero Orphaned ProcessesAfter eviction, only 3 active session trees (no orphaned RSS with 3 Active Sessions
With old behavior (10 sessions, no eviction): ~3.4 GB → OOM on 3.6 GB host. Build & Tests
Minor Nit (non-blocking)
VerdictShip it. 🚀 |
The drop(self.stdin.clone()) only drops a cloned Arc, not the actual ChildStdin. SIGTERM on the next line handles shutdown. Removed the misleading comment and simplified to 2-stage: SIGTERM → SIGKILL.
|
Good catch on the |
…iability Addresses triage review on openabdev#310: 🔴 SUGGESTED CHANGES: - Merge connections + suspended into single PoolState struct under one RwLock to eliminate nested lock acquisition and deadlock risk - suspend_entry() is now a plain fn operating on &mut PoolState (no async, no separate lock) - cleanup_idle() collects stale keys and suspends under one lock hold - child_pid changed to child_pgid: Option<i32> using i32::try_from() to prevent kill(0, SIGTERM) on PID 0 and overflow on PID > i32::MAX 🟡 NITS: - setpgid return value now checked — returns Err on failure so spawn fails instead of silently creating a process without its own group - SIGKILL escalation uses std::thread::spawn instead of tokio::spawn so it fires even during runtime shutdown or panic unwinding
|
Addressed all review feedback in 1866a11: 🔴 SUGGESTED CHANGES — all fixed:
🟡 NITS — all fixed:
Not in this PR (tracked separately):
|
…iability Addresses triage review on openabdev#310: 🔴 SUGGESTED CHANGES: - Merge connections + suspended into single PoolState struct under one RwLock to eliminate nested lock acquisition and deadlock risk - suspend_entry() is now a plain fn operating on &mut PoolState (no async, no separate lock) - cleanup_idle() collects stale keys and suspends under one lock hold - child_pid changed to child_pgid: Option<i32> using i32::try_from() to prevent kill(0, SIGTERM) on PID 0 and overflow on PID > i32::MAX 🟡 NITS: - setpgid return value now checked — returns Err on failure so spawn fails instead of silently creating a process without its own group - SIGKILL escalation uses std::thread::spawn instead of tokio::spawn so it fires even during runtime shutdown or panic unwinding
…rocess-groups-and-resume fix: process group kill + session suspend/resume via session/load
What problem does this solve?
When running openab with
kiro-clion a constrained host (3.6 GB RAM on Zeabur), the session pool fills up with idle agent processes that are never properly reclaimed. Eachkiro-cli acpspawns a grandchildkiro-cli-chat acp(~300 MB).kill_on_droponly kills the direct child — the grandchild gets orphaned. 10 sessions × 300 MB = 3 GB leaked → OOM.Closes #309
At a Glance
Prior Art & Industry Research
OpenClaw (acpx):
acpx uses self-terminating queue-owner processes with a 3-stage graceful shutdown in
AcpClient.terminateAgentProcess():stdin.end()→ SIGTERM (1.5s) → SIGKILL (1s) → detach all handles. Each queue-owner exits on idle TTL, taking its agent process with it. Session resume is handled inreconnect.ts— it checkssupportsLoadSession(), triessession/load, falls back tosession/new. This PR adopts the 3-stage shutdown pattern and thesession/loadfallback logic.Hermes Agent (NousResearch/hermes-agent):
Hermes Agent runs in-process (no child process spawning) with a thread-safe
SessionManagerbacked by SQLite (~/.hermes/state.db). Sessions are persisted via_persist()and restored via_restore()— recreating the AIAgent instance from stored conversation history. No orphan problem since everything is in-process, but the persist/restore pattern validates our suspend/resume approach.Other references:
Maptask.dispose()+ Map deletekill_on_drop(broken for grandchildren)docker rm -fkills everythingSurvey source: Picrew/awesome-agent-harness
Proposed Solution
1. Process group kill (
connection.rs)Replace
kill_on_drop(true)withsetpgid(0, 0)at spawn to create a process group per session. OnDrop, kill the entire group viakill(-pgid, SIGTERM)with a 3-stage escalation (inspired by acpx):2. Session suspend/resume (
pool.rs)Add a
suspended: HashMap<thread_key, sessionId>map. When a session is evicted (TTL, LRU, or stale), save itssessionIdbefore killing. On reconnect, trysession/loadif the agent supports it (checked viaagentCapabilities.loadSessionfrominitialize), fall back tosession/new.3. LRU eviction (
pool.rs)When pool is full, evict the oldest idle session (by
last_active) instead of rejecting with "pool exhausted".4. Lower default TTL (
config.rs)session_ttl_hours: 24 → 4. Safe because suspended sessions can be reloaded on demand.Why this approach?
kiro-cli-chat acp(the agent is the direct child). openab useskiro-cli acp, which forkskiro-cli-chatas a grandchild — acpx's approach would leave the grandchild orphaned in our case. Process groups (setpgid+kill(-pgid)) handle any depth of process tree, making our solution more robust for openab's specific spawn chain.agentCapabilities.loadSession: true). The fallback logic mirrors acpx'sreconnect.ts: check capability → try load → fall back to new. This makes aggressive TTL safe — we can kill idle processes and reload them without losing conversation history.~/.hermes/state.db). Suspended sessionIds are lost on openab restart, which is acceptable — the agent starts a fresh session, same as today. If persistence becomes needed, we can add it later without changing the pool interface.Alternatives Considered
prctl(PR_SET_PDEATHSIG)Validation
cargo checkpassescargo testpasses (including new tests)loadSession: truevia ACPinitializekiro-cli-chatprocesses after session eviction, memory stays under 1 GB with 10 threadsFull investigation thread: #309