Skip to content

Session pool leaks memory: orphaned kiro-cli processes and no eviction #309

@chaodu-agent

Description

@chaodu-agent

Description

When running openab with kiro-cli on a constrained host (e.g. 3.6 GB RAM on Zeabur), the session pool fills up with idle sessions that are never reclaimed in time. Once max_sessions is reached, new requests are rejected and the host eventually OOMs.

Each kiro-cli acp spawns a child kiro-cli-chat acp process (~230-390 MB each). When the pool drops a session, kill_on_drop only kills the direct child — the grandchild kiro-cli-chat process becomes orphaned and keeps consuming memory.

Observed on a live deployment — 10 stale kiro-cli-chat acp processes consuming 3 GB total:

PID Started RSS
459820 Apr12 290 MB
581161 Apr12 306 MB
625730 Apr12 300 MB
633382 Apr12 282 MB
673360 Apr12 388 MB
724688 00:43 273 MB
872305 08:48 274 MB
872764 08:50 236 MB
907784 10:39 227 MB
913618 11:00 230 MB

Four root causes identified:

  1. Orphaned grandchild processeskill_on_drop(true) only SIGKILLs the direct child PID. The grandchild kiro-cli-chat survives and leaks memory. Fix: use process groups (setsid/setpgid) and kill the entire group on cleanup.

  2. No cleanup on Discord thread archiveEventHandler only implements message and ready. Archiving a thread leaves the session alive until TTL. Fix: implement thread_update handler.

  3. No LRU eviction — when pool is full, get_or_create() rejects with "pool exhausted" instead of evicting the oldest idle session. Fix: evict oldest last_active session when at capacity.

  4. Default TTL too longsession_ttl_hours defaults to 24. On a 3.6 GB host with 10 sessions × ~300 MB = 3 GB of idle processes. Fix: lower default or document memory implications.

Industry Comparison

A survey of agent harnesses from Picrew/awesome-agent-harness shows openab sits in the riskiest position — process-level isolation without proper process group management:

Harness Isolation Orphan Risk Cleanup Strategy
Gemini CLI a2a In-process Map None ✅ task.dispose() + Map delete
openab Process HIGH ☠️ kill_on_drop (broken for grandchildren)
acpx Process Low ✅ 3-stage shutdown (stdin.end() → SIGTERM → SIGKILL) + self-terminating TTL
Scion Container None ✅ docker rm -f kills everything
Daytona / E2B VM/microVM None ✅ Destroy sandbox API

Key insight from acpx: they use a 3-stage graceful shutdown (stdin.end() → SIGTERM 1.5s → SIGKILL 1s → detach all handles) and self-terminating queue-owner processes that exit when idle. This eliminates both the orphan problem and the need for a central cleanup task.

Key insight from Scion: container-per-agent makes orphans impossible by design (docker rm -f kills the entire process tree). This is the most robust long-term architecture but requires more infrastructure.

Steps to Reproduce

  1. Deploy openab with kiro-cli on a host with limited RAM (e.g. 3.6 GB)
  2. Send messages from Discord that create multiple threads (up to max_sessions)
  3. Archive/close the Discord threads
  4. Observe that kiro-cli and kiro-cli-chat processes remain running
  5. Run ps aux | grep kiro-cli — orphaned processes accumulate
  6. Eventually the host runs out of memory and the pod/container is killed

Expected Behavior

  • When a Discord thread is archived, the associated session and all its child processes should be terminated
  • When the pool is full, the oldest idle session should be evicted to make room
  • When a session is dropped, all descendant processes (including grandchildren) should be killed via process group signal
  • Default TTL should be reasonable for small hosts, or clearly documented

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions