Skip to content

fix: process group kill + session suspend/resume via session/load#310

Merged
thepagent merged 3 commits into
openabdev:mainfrom
chaodu-agent:fix/session-pool-process-groups-and-resume
Apr 13, 2026
Merged

fix: process group kill + session suspend/resume via session/load#310
thepagent merged 3 commits into
openabdev:mainfrom
chaodu-agent:fix/session-pool-process-groups-and-resume

Conversation

@chaodu-agent
Copy link
Copy Markdown
Collaborator

@chaodu-agent chaodu-agent commented Apr 13, 2026

What problem does this solve?

When running openab with kiro-cli on a constrained host (3.6 GB RAM on Zeabur), the session pool fills up with idle agent processes that are never properly reclaimed. Each kiro-cli acp spawns a grandchild kiro-cli-chat acp (~300 MB). kill_on_drop only kills the direct child — the grandchild gets orphaned. 10 sessions × 300 MB = 3 GB leaked → OOM.

Closes #309

At a Glance

  BEFORE:                                    AFTER:

  openab                                     openab
    │                                          │
    ├─ kill_on_drop ──► kiro-cli (dead)        ├─ kill(-pgid) ──► ┌─ kiro-cli    ┐ dead ✅
    │                   kiro-cli-chat ☠️       │                  └─ kiro-cli-chat┘ dead ✅
    │                   (300 MB orphan)         │
    │                                          │
    ├─ pool: 10 × 300 MB = 3 GB               ├─ pool: 1-2 active × 300 MB
    │  (all kept alive until 24h TTL)          │  suspended: { thread → sessionId }
    │                                          │
    └─ pool full → REJECT                      ├─ idle → suspend (save sessionId, kill process)
                                               ├─ user returns → spawn + session/load → resume ✅
                                               └─ pool full → LRU evict oldest → suspend it

Prior Art & Industry Research

OpenClaw (acpx):

acpx uses self-terminating queue-owner processes with a 3-stage graceful shutdown in AcpClient.terminateAgentProcess(): stdin.end() → SIGTERM (1.5s) → SIGKILL (1s) → detach all handles. Each queue-owner exits on idle TTL, taking its agent process with it. Session resume is handled in reconnect.ts — it checks supportsLoadSession(), tries session/load, falls back to session/new. This PR adopts the 3-stage shutdown pattern and the session/load fallback logic.

Hermes Agent (NousResearch/hermes-agent):

Hermes Agent runs in-process (no child process spawning) with a thread-safe SessionManager backed by SQLite (~/.hermes/state.db). Sessions are persisted via _persist() and restored via _restore() — recreating the AIAgent instance from stored conversation history. No orphan problem since everything is in-process, but the persist/restore pattern validates our suspend/resume approach.

Other references:

Harness Isolation Orphan Risk Cleanup Strategy
Gemini CLI a2a In-process Map None ✅ task.dispose() + Map delete
openab (before) Process HIGH ☠️ kill_on_drop (broken for grandchildren)
acpx (OpenClaw) Process Low ✅ 3-stage shutdown + self-terminating TTL
Hermes Agent In-process None ✅ SQLite persist + in-memory restore
Scion (Google) Container None ✅ docker rm -f kills everything

Survey source: Picrew/awesome-agent-harness

Proposed Solution

1. Process group kill (connection.rs)

Replace kill_on_drop(true) with setpgid(0, 0) at spawn to create a process group per session. On Drop, kill the entire group via kill(-pgid, SIGTERM) with a 3-stage escalation (inspired by acpx):

  AcpConnection::drop()
       │
       ▼
  1. stdin.close()          ← graceful signal
  2. kill(-pgid, SIGTERM)   ← kill process group
  3. kill(-pgid, SIGKILL)   ← force kill after 1.5s

2. Session suspend/resume (pool.rs)

Add a suspended: HashMap<thread_key, sessionId> map. When a session is evicted (TTL, LRU, or stale), save its sessionId before killing. On reconnect, try session/load if the agent supports it (checked via agentCapabilities.loadSession from initialize), fall back to session/new.

3. LRU eviction (pool.rs)

When pool is full, evict the oldest idle session (by last_active) instead of rejecting with "pool exhausted".

4. Lower default TTL (config.rs)

session_ttl_hours: 24 → 4. Safe because suspended sessions can be reloaded on demand.

Why this approach?

  • Process groups are the correct OS primitive for killing process trees — and something neither acpx nor Hermes Agent needs. acpx's 3-stage shutdown sends SIGTERM/SIGKILL to the direct child PID, which works for them because their default kiro command is kiro-cli-chat acp (the agent is the direct child). openab uses kiro-cli acp, which forks kiro-cli-chat as a grandchild — acpx's approach would leave the grandchild orphaned in our case. Process groups (setpgid + kill(-pgid)) handle any depth of process tree, making our solution more robust for openab's specific spawn chain.
  acpx spawn chain:     acpx → kiro-cli-chat (direct child)
                         SIGTERM(child) works ✅

  openab spawn chain:   openab → kiro-cli → kiro-cli-chat (grandchild)
                         SIGTERM(child) misses grandchild ☠️
                         kill(-pgid) kills entire group ✅
  • session/load was validated by testing kiro-cli v1.29.5 (agentCapabilities.loadSession: true). The fallback logic mirrors acpx's reconnect.ts: check capability → try load → fall back to new. This makes aggressive TTL safe — we can kill idle processes and reload them without losing conversation history.
  • In-memory suspended map is simpler than Hermes Agent's SQLite persistence (~/.hermes/state.db). Suspended sessionIds are lost on openab restart, which is acceptable — the agent starts a fresh session, same as today. If persistence becomes needed, we can add it later without changing the pool interface.

Alternatives Considered

Alternative Why not
prctl(PR_SET_PDEATHSIG) Only one level deep — doesn't reach grandchildren
cgroups Requires elevated permissions, overkill for this problem
PID files + periodic pkill Fragile, race-prone, doesn't solve root cause
Single multiplexed kiro-cli process Requires kiro-cli changes, out of scope
Container-per-session (Scion-style) Major architecture shift, future consideration
SQLite persistence for suspended sessions (Hermes-style) Added complexity; in-memory map is sufficient since losing suspended IDs on restart is acceptable

Validation

  • cargo check passes
  • cargo test passes (including new tests)
  • Manual testing — confirmed kiro-cli v1.29.5 advertises loadSession: true via ACP initialize
  • Deploy to OrbStack and verify: no orphaned kiro-cli-chat processes after session eviction, memory stays under 1 GB with 10 threads

Full investigation thread: #309

Fixes openabdev#309 — session pool leaks memory due to orphaned grandchild
processes and no session resume capability.

Changes:
- Replace kill_on_drop with process groups (setpgid + kill(-pgid))
  so the entire process tree is killed on session cleanup
- 3-stage graceful shutdown: stdin close → SIGTERM → SIGKILL
- Store agentCapabilities.loadSession from initialize response
- Add session/load method for resuming suspended sessions
- Suspend sessions on eviction (save sessionId) instead of discarding
- Resume via session/load on reconnect, fallback to session/new
- LRU eviction when pool is full (evict oldest idle session)
- Lower default session_ttl_hours from 24 to 4

Memory impact on 3.6 GB host:
  Before: 10 x 300 MB = 3 GB (idle sessions kept alive + orphaned grandchildren)
  After:  1-2 x 300 MB = 300-600 MB (idle sessions suspended, reloaded on demand)
@thepagent
Copy link
Copy Markdown
Collaborator

OrbStack Validation Results ✅

Tested PR #310 on OrbStack k8s (v1.33.5) with openab:pr310 image, max_sessions=3, TTL=1h.

Process Group Kill

Simulated kiro-cli → kiro-cli-chat spawn chain inside the container:

Scenario Result
PR #310: kill(-pgid, SIGTERM) All processes killed (parent + grandchild). No orphans. ✅
Old: kill(pid) only Grandchild orphaned, reparented to PID 1. ❌

LRU Eviction + Session Resume

Created 4 Discord threads with max_sessions=3:

Thread 1 → session 790ef982 (created)
Thread 2 → session a12116a9 (created)
Thread 3 → session 41794345 (created)
Thread 4 → pool full → suspended thread 1 (session 790ef982) → session 4e4b506e (created)
Follow-up in Thread 1 → suspended thread 2 → session/load 790ef982 → resumed ✅

Key log lines:

pool full, suspending oldest idle session evicted=1493378134169620601
suspending session thread_id="1493378134169620601" session_id=790ef982-bc6a-41b4-987c-915878a1b431
session loaded session_id="790ef982-bc6a-41b4-987c-915878a1b431"
session resumed via session/load thread_id="1493378134169620601"

Zero Orphaned Processes

After eviction, only 3 active session trees (no orphaned kiro-cli-chat):

PID=1   openab
PID=172 kiro-cli acp --trust-all-tools
PID=179 kiro-cli-chat acp --trust-all-tools
PID=228 kiro-cli acp --trust-all-tools
PID=236 kiro-cli-chat acp --trust-all-tools
PID=346 kiro-cli acp --trust-all-tools
PID=354 kiro-cli-chat acp --trust-all-tools

RSS with 3 Active Sessions

Process RSS
openab 13 MB
kiro-cli × 3 ~48 MB each = 144 MB
kiro-cli-chat × 3 ~340 MB each = 1,064 MB
Total ~1.2 GB

With old behavior (10 sessions, no eviction): ~3.4 GB → OOM on 3.6 GB host.
With PR #310 (3 active + suspend/resume): ~1.2 GB, unlimited threads served. ✅

Build & Tests

  • cargo check
  • cargo test — 34/34 passed ✅

Minor Nit (non-blocking)

drop(self.stdin.clone()) in kill_process_group() is a no-op — it clones the Arc and drops the clone. Stdin only closes when self is fully dropped after the method returns. Harmless since SIGTERM fires on the next line, but the comment is misleading.

Verdict

Ship it. 🚀

The drop(self.stdin.clone()) only drops a cloned Arc, not the actual
ChildStdin. SIGTERM on the next line handles shutdown. Removed the
misleading comment and simplified to 2-stage: SIGTERM → SIGKILL.
@chaodu-agent
Copy link
Copy Markdown
Collaborator Author

Good catch on the drop(self.stdin.clone()) no-op — fixed in cd26be9. Removed the misleading stdin close comment and simplified to 2-stage: SIGTERM → SIGKILL. The ChildStdin still gets properly closed when self is fully dropped after kill_process_group() returns from the Drop impl.

…iability

Addresses triage review on openabdev#310:

🔴 SUGGESTED CHANGES:
- Merge connections + suspended into single PoolState struct under one
  RwLock to eliminate nested lock acquisition and deadlock risk
- suspend_entry() is now a plain fn operating on &mut PoolState (no async,
  no separate lock)
- cleanup_idle() collects stale keys and suspends under one lock hold
- child_pid changed to child_pgid: Option<i32> using i32::try_from()
  to prevent kill(0, SIGTERM) on PID 0 and overflow on PID > i32::MAX

🟡 NITS:
- setpgid return value now checked — returns Err on failure so spawn
  fails instead of silently creating a process without its own group
- SIGKILL escalation uses std::thread::spawn instead of tokio::spawn
  so it fires even during runtime shutdown or panic unwinding
@chaodu-agent
Copy link
Copy Markdown
Collaborator Author

Addressed all review feedback in 1866a11:

🔴 SUGGESTED CHANGES — all fixed:

Issue Fix
Nested RwLock deadlock risk Merged connections + suspended into single PoolState struct under one RwLock. suspend_entry() is now a plain fn(&mut PoolState) — no async, no separate lock.
child_pid = 0kill(0, SIGTERM) kills caller Changed to child_pgid: Option<i32> using i32::try_from().ok(). None if PID missing or > i32::MAX. kill_process_group() returns early on None.
cleanup_idle holding two write locks Single lock now — collect stale keys + suspend all under one &mut PoolState.

🟡 NITS — all fixed:

Issue Fix
setpgid return value ignored Now returns Err(last_os_error()) on failure → spawn fails instead of silently running without a process group.
tokio::spawn for SIGKILL may not fire during shutdown Changed to std::thread::spawn + std::thread::sleep — survives runtime shutdown and panic unwinding.

Not in this PR (tracked separately):

  • thread_update handler for Discord thread archive → PR B follow-up
  • Version bump alignment → will coordinate with maintainers

@thepagent thepagent merged commit fe160ff into openabdev:main Apr 13, 2026
1 check passed
Reese-max pushed a commit to Reese-max/openab that referenced this pull request Apr 14, 2026
…iability

Addresses triage review on openabdev#310:

🔴 SUGGESTED CHANGES:
- Merge connections + suspended into single PoolState struct under one
  RwLock to eliminate nested lock acquisition and deadlock risk
- suspend_entry() is now a plain fn operating on &mut PoolState (no async,
  no separate lock)
- cleanup_idle() collects stale keys and suspends under one lock hold
- child_pid changed to child_pgid: Option<i32> using i32::try_from()
  to prevent kill(0, SIGTERM) on PID 0 and overflow on PID > i32::MAX

🟡 NITS:
- setpgid return value now checked — returns Err on failure so spawn
  fails instead of silently creating a process without its own group
- SIGKILL escalation uses std::thread::spawn instead of tokio::spawn
  so it fires even during runtime shutdown or panic unwinding
Reese-max pushed a commit to Reese-max/openab that referenced this pull request Apr 14, 2026
…rocess-groups-and-resume

fix: process group kill + session suspend/resume via session/load
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session pool leaks memory: orphaned kiro-cli processes and no eviction

2 participants