fix: process group kill + session suspend/resume via session/load by chaodu-agent · Pull Request #310 · openabdev/openab

chaodu-agent · 2026-04-13T22:03:20Z

What problem does this solve?

When running openab with kiro-cli on a constrained host (3.6 GB RAM on Zeabur), the session pool fills up with idle agent processes that are never properly reclaimed. Each kiro-cli acp spawns a grandchild kiro-cli-chat acp (~300 MB). kill_on_drop only kills the direct child — the grandchild gets orphaned. 10 sessions × 300 MB = 3 GB leaked → OOM.

Closes #309

At a Glance

  BEFORE:                                    AFTER:

  openab                                     openab
    │                                          │
    ├─ kill_on_drop ──► kiro-cli (dead)        ├─ kill(-pgid) ──► ┌─ kiro-cli    ┐ dead ✅
    │                   kiro-cli-chat ☠️       │                  └─ kiro-cli-chat┘ dead ✅
    │                   (300 MB orphan)         │
    │                                          │
    ├─ pool: 10 × 300 MB = 3 GB               ├─ pool: 1-2 active × 300 MB
    │  (all kept alive until 24h TTL)          │  suspended: { thread → sessionId }
    │                                          │
    └─ pool full → REJECT                      ├─ idle → suspend (save sessionId, kill process)
                                               ├─ user returns → spawn + session/load → resume ✅
                                               └─ pool full → LRU evict oldest → suspend it

Prior Art & Industry Research

OpenClaw (acpx):

acpx uses self-terminating queue-owner processes with a 3-stage graceful shutdown in AcpClient.terminateAgentProcess(): stdin.end() → SIGTERM (1.5s) → SIGKILL (1s) → detach all handles. Each queue-owner exits on idle TTL, taking its agent process with it. Session resume is handled in reconnect.ts — it checks supportsLoadSession(), tries session/load, falls back to session/new. This PR adopts the 3-stage shutdown pattern and the session/load fallback logic.

Hermes Agent (NousResearch/hermes-agent):

Hermes Agent runs in-process (no child process spawning) with a thread-safe SessionManager backed by SQLite (~/.hermes/state.db). Sessions are persisted via _persist() and restored via _restore() — recreating the AIAgent instance from stored conversation history. No orphan problem since everything is in-process, but the persist/restore pattern validates our suspend/resume approach.

Other references:

Harness	Isolation	Orphan Risk	Cleanup Strategy
Gemini CLI a2a	In-process `Map`	None ✅	`task.dispose()` + Map delete
openab (before)	Process	HIGH ☠️	`kill_on_drop` (broken for grandchildren)
acpx (OpenClaw)	Process	Low ✅	3-stage shutdown + self-terminating TTL
Hermes Agent	In-process	None ✅	SQLite persist + in-memory restore
Scion (Google)	Container	None ✅	`docker rm -f` kills everything

Survey source: Picrew/awesome-agent-harness

Proposed Solution

1. Process group kill (`connection.rs`)

Replace kill_on_drop(true) with setpgid(0, 0) at spawn to create a process group per session. On Drop, kill the entire group via kill(-pgid, SIGTERM) with a 3-stage escalation (inspired by acpx):

  AcpConnection::drop()
       │
       ▼
  1. stdin.close()          ← graceful signal
  2. kill(-pgid, SIGTERM)   ← kill process group
  3. kill(-pgid, SIGKILL)   ← force kill after 1.5s

2. Session suspend/resume (`pool.rs`)

Add a suspended: HashMap<thread_key, sessionId> map. When a session is evicted (TTL, LRU, or stale), save its sessionId before killing. On reconnect, try session/load if the agent supports it (checked via agentCapabilities.loadSession from initialize), fall back to session/new.

3. LRU eviction (`pool.rs`)

When pool is full, evict the oldest idle session (by last_active) instead of rejecting with "pool exhausted".

4. Lower default TTL (`config.rs`)

session_ttl_hours: 24 → 4. Safe because suspended sessions can be reloaded on demand.

Why this approach?

Process groups are the correct OS primitive for killing process trees — and something neither acpx nor Hermes Agent needs. acpx's 3-stage shutdown sends SIGTERM/SIGKILL to the direct child PID, which works for them because their default kiro command is kiro-cli-chat acp (the agent is the direct child). openab uses kiro-cli acp, which forks kiro-cli-chat as a grandchild — acpx's approach would leave the grandchild orphaned in our case. Process groups (setpgid + kill(-pgid)) handle any depth of process tree, making our solution more robust for openab's specific spawn chain.

  acpx spawn chain:     acpx → kiro-cli-chat (direct child)
                         SIGTERM(child) works ✅

  openab spawn chain:   openab → kiro-cli → kiro-cli-chat (grandchild)
                         SIGTERM(child) misses grandchild ☠️
                         kill(-pgid) kills entire group ✅

session/load was validated by testing kiro-cli v1.29.5 (agentCapabilities.loadSession: true). The fallback logic mirrors acpx's reconnect.ts: check capability → try load → fall back to new. This makes aggressive TTL safe — we can kill idle processes and reload them without losing conversation history.
In-memory suspended map is simpler than Hermes Agent's SQLite persistence (~/.hermes/state.db). Suspended sessionIds are lost on openab restart, which is acceptable — the agent starts a fresh session, same as today. If persistence becomes needed, we can add it later without changing the pool interface.

Alternatives Considered

Alternative	Why not
`prctl(PR_SET_PDEATHSIG)`	Only one level deep — doesn't reach grandchildren
cgroups	Requires elevated permissions, overkill for this problem
PID files + periodic pkill	Fragile, race-prone, doesn't solve root cause
Single multiplexed kiro-cli process	Requires kiro-cli changes, out of scope
Container-per-session (Scion-style)	Major architecture shift, future consideration
SQLite persistence for suspended sessions (Hermes-style)	Added complexity; in-memory map is sufficient since losing suspended IDs on restart is acceptable

Validation

cargo check passes
cargo test passes (including new tests)
Manual testing — confirmed kiro-cli v1.29.5 advertises loadSession: true via ACP initialize
Deploy to OrbStack and verify: no orphaned kiro-cli-chat processes after session eviction, memory stays under 1 GB with 10 threads

Full investigation thread: #309

Fixes openabdev#309 — session pool leaks memory due to orphaned grandchild processes and no session resume capability. Changes: - Replace kill_on_drop with process groups (setpgid + kill(-pgid)) so the entire process tree is killed on session cleanup - 3-stage graceful shutdown: stdin close → SIGTERM → SIGKILL - Store agentCapabilities.loadSession from initialize response - Add session/load method for resuming suspended sessions - Suspend sessions on eviction (save sessionId) instead of discarding - Resume via session/load on reconnect, fallback to session/new - LRU eviction when pool is full (evict oldest idle session) - Lower default session_ttl_hours from 24 to 4 Memory impact on 3.6 GB host: Before: 10 x 300 MB = 3 GB (idle sessions kept alive + orphaned grandchildren) After: 1-2 x 300 MB = 300-600 MB (idle sessions suspended, reloaded on demand)

thepagent · 2026-04-13T22:35:12Z

OrbStack Validation Results ✅

Tested PR #310 on OrbStack k8s (v1.33.5) with openab:pr310 image, max_sessions=3, TTL=1h.

Process Group Kill

Simulated kiro-cli → kiro-cli-chat spawn chain inside the container:

Scenario	Result
PR #310: `kill(-pgid, SIGTERM)`	All processes killed (parent + grandchild). No orphans. ✅
Old: `kill(pid)` only	Grandchild orphaned, reparented to PID 1. ❌

LRU Eviction + Session Resume

Created 4 Discord threads with max_sessions=3:

Thread 1 → session 790ef982 (created)
Thread 2 → session a12116a9 (created)
Thread 3 → session 41794345 (created)
Thread 4 → pool full → suspended thread 1 (session 790ef982) → session 4e4b506e (created)
Follow-up in Thread 1 → suspended thread 2 → session/load 790ef982 → resumed ✅

Key log lines:

pool full, suspending oldest idle session evicted=1493378134169620601
suspending session thread_id="1493378134169620601" session_id=790ef982-bc6a-41b4-987c-915878a1b431
session loaded session_id="790ef982-bc6a-41b4-987c-915878a1b431"
session resumed via session/load thread_id="1493378134169620601"

Zero Orphaned Processes

After eviction, only 3 active session trees (no orphaned kiro-cli-chat):

PID=1   openab
PID=172 kiro-cli acp --trust-all-tools
PID=179 kiro-cli-chat acp --trust-all-tools
PID=228 kiro-cli acp --trust-all-tools
PID=236 kiro-cli-chat acp --trust-all-tools
PID=346 kiro-cli acp --trust-all-tools
PID=354 kiro-cli-chat acp --trust-all-tools

RSS with 3 Active Sessions

Process	RSS
openab	13 MB
kiro-cli × 3	~48 MB each = 144 MB
kiro-cli-chat × 3	~340 MB each = 1,064 MB
Total	~1.2 GB

With old behavior (10 sessions, no eviction): ~3.4 GB → OOM on 3.6 GB host.
With PR #310 (3 active + suspend/resume): ~1.2 GB, unlimited threads served. ✅

Build & Tests

cargo check ✅
cargo test — 34/34 passed ✅

Minor Nit (non-blocking)

drop(self.stdin.clone()) in kill_process_group() is a no-op — it clones the Arc and drops the clone. Stdin only closes when self is fully dropped after the method returns. Harmless since SIGTERM fires on the next line, but the comment is misleading.

Verdict

Ship it. 🚀

The drop(self.stdin.clone()) only drops a cloned Arc, not the actual ChildStdin. SIGTERM on the next line handles shutdown. Removed the misleading comment and simplified to 2-stage: SIGTERM → SIGKILL.

chaodu-agent · 2026-04-13T22:37:31Z

Good catch on the drop(self.stdin.clone()) no-op — fixed in cd26be9. Removed the misleading stdin close comment and simplified to 2-stage: SIGTERM → SIGKILL. The ChildStdin still gets properly closed when self is fully dropped after kill_process_group() returns from the Drop impl.

…iability Addresses triage review on openabdev#310: 🔴 SUGGESTED CHANGES: - Merge connections + suspended into single PoolState struct under one RwLock to eliminate nested lock acquisition and deadlock risk - suspend_entry() is now a plain fn operating on &mut PoolState (no async, no separate lock) - cleanup_idle() collects stale keys and suspends under one lock hold - child_pid changed to child_pgid: Option<i32> using i32::try_from() to prevent kill(0, SIGTERM) on PID 0 and overflow on PID > i32::MAX 🟡 NITS: - setpgid return value now checked — returns Err on failure so spawn fails instead of silently creating a process without its own group - SIGKILL escalation uses std::thread::spawn instead of tokio::spawn so it fires even during runtime shutdown or panic unwinding

chaodu-agent · 2026-04-13T22:41:30Z

Addressed all review feedback in 1866a11:

🔴 SUGGESTED CHANGES — all fixed:

Issue	Fix
Nested `RwLock` deadlock risk	Merged `connections` + `suspended` into single `PoolState` struct under one `RwLock`. `suspend_entry()` is now a plain `fn(&mut PoolState)` — no async, no separate lock.
`child_pid = 0` → `kill(0, SIGTERM)` kills caller	Changed to `child_pgid: Option<i32>` using `i32::try_from().ok()`. `None` if PID missing or > `i32::MAX`. `kill_process_group()` returns early on `None`.
`cleanup_idle` holding two write locks	Single lock now — collect stale keys + suspend all under one `&mut PoolState`.

🟡 NITS — all fixed:

Issue	Fix
`setpgid` return value ignored	Now returns `Err(last_os_error())` on failure → spawn fails instead of silently running without a process group.
`tokio::spawn` for SIGKILL may not fire during shutdown	Changed to `std::thread::spawn` + `std::thread::sleep` — survives runtime shutdown and panic unwinding.

Not in this PR (tracked separately):

thread_update handler for Discord thread archive → PR B follow-up
Version bump alignment → will coordinate with maintainers

…iability Addresses triage review on openabdev#310: 🔴 SUGGESTED CHANGES: - Merge connections + suspended into single PoolState struct under one RwLock to eliminate nested lock acquisition and deadlock risk - suspend_entry() is now a plain fn operating on &mut PoolState (no async, no separate lock) - cleanup_idle() collects stale keys and suspends under one lock hold - child_pid changed to child_pgid: Option<i32> using i32::try_from() to prevent kill(0, SIGTERM) on PID 0 and overflow on PID > i32::MAX 🟡 NITS: - setpgid return value now checked — returns Err on failure so spawn fails instead of silently creating a process without its own group - SIGKILL escalation uses std::thread::spawn instead of tokio::spawn so it fires even during runtime shutdown or panic unwinding

…rocess-groups-and-resume fix: process group kill + session suspend/resume via session/load

chaodu-agent requested a review from thepagent as a code owner April 13, 2026 22:03

chaodu-agent mentioned this pull request Apr 13, 2026

Session pool leaks memory: orphaned kiro-cli processes and no eviction #309

Closed

fix: remove no-op stdin.clone() drop in kill_process_group

cd26be9

The drop(self.stdin.clone()) only drops a cloned Arc, not the actual ChildStdin. SIGTERM on the next line handles shutdown. Removed the misleading comment and simplified to 2-stage: SIGTERM → SIGKILL.

thepagent approved these changes Apr 13, 2026

View reviewed changes

thepagent merged commit fe160ff into openabdev:main Apr 13, 2026
1 check passed

Reese-max pushed a commit to Reese-max/openab that referenced this pull request Apr 14, 2026

Merge pull request openabdev#310 from chaodu-agent/fix/session-pool-p…

091e7f6

…rocess-groups-and-resume fix: process group kill + session suspend/resume via session/load

brettchien mentioned this pull request Apr 28, 2026

RFC: Turn-boundary message batching #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: process group kill + session suspend/resume via session/load#310

fix: process group kill + session suspend/resume via session/load#310
thepagent merged 3 commits into
openabdev:mainfrom
chaodu-agent:fix/session-pool-process-groups-and-resume

chaodu-agent commented Apr 13, 2026 •

edited

Loading

Uh oh!

thepagent commented Apr 13, 2026

Uh oh!

chaodu-agent commented Apr 13, 2026

Uh oh!

chaodu-agent commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chaodu-agent commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this solve?

At a Glance

Prior Art & Industry Research

Proposed Solution

1. Process group kill (connection.rs)

2. Session suspend/resume (pool.rs)

3. LRU eviction (pool.rs)

4. Lower default TTL (config.rs)

Why this approach?

Alternatives Considered

Validation

Uh oh!

thepagent commented Apr 13, 2026

OrbStack Validation Results ✅

Process Group Kill

LRU Eviction + Session Resume

Zero Orphaned Processes

RSS with 3 Active Sessions

Build & Tests

Minor Nit (non-blocking)

Verdict

Uh oh!

chaodu-agent commented Apr 13, 2026

Uh oh!

chaodu-agent commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chaodu-agent commented Apr 13, 2026 •

edited

Loading

1. Process group kill (`connection.rs`)

2. Session suspend/resume (`pool.rs`)

3. LRU eviction (`pool.rs`)

4. Lower default TTL (`config.rs`)