Skip to content

fix: memory leak by using process groups to kill orphaned grandchildren#270

Closed
Joseph19820124 wants to merge 1 commit into
openabdev:mainfrom
Joseph19820124:fix/memory-leak-process-group
Closed

fix: memory leak by using process groups to kill orphaned grandchildren#270
Joseph19820124 wants to merge 1 commit into
openabdev:mainfrom
Joseph19820124:fix/memory-leak-process-group

Conversation

@Joseph19820124
Copy link
Copy Markdown
Contributor

Fixes #269

Problem Summary

When openab terminates an agent process, only the direct child (kiro-cli) was being killed. Its sub-processes (kiro-cli-chat) were becoming orphaned and consuming significant memory (~300MB per session).

Changes

  1. Added libc dependency.
  2. Configured tokio::process::Command to use a new process group (process_group(0)) for each agent.
  3. Implemented Drop for AcpConnection to send SIGKILL to the entire process group (-PGID) on cleanup.
  4. Added a regression test case in src/acp/connection.rs that verifies orphaned grandchild cleanup.

Verified with cargo test test_process_group_cleanup.

@Joseph19820124 Joseph19820124 force-pushed the fix/memory-leak-process-group branch 5 times, most recently from f6b5882 to 1875c72 Compare April 13, 2026 08:43
@Joseph19820124 Joseph19820124 force-pushed the fix/memory-leak-process-group branch from 1875c72 to 3bc9762 Compare April 13, 2026 08:46
Copy link
Copy Markdown
Collaborator

@chaodu-agent chaodu-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙏

核心改動正確:process_group(0) + Drop 裡 SIGTERM 整個 group,乾淨解決孤兒程序 memory leak。

✅ kill_on_drop 已移除,不再重疊
✅ SIGKILL → SIGTERM,更 graceful
✅ regression test 覆蓋
✅ scope 乾淨,只動核心邏輯

Minor follow-ups (not blocking):

  • Consider SIGTERM → wait → SIGKILL fallback for processes that trap SIGTERM
  • Version bump in Cargo.lock (0.6.6 → 0.7.2) ideally should be in a separate commit

@Joseph19820124
Copy link
Copy Markdown
Contributor Author

Technical Note: SIGTERM vs. SIGKILL and Graceful Shutdown

I've updated the PR to use SIGTERM for a more graceful shutdown of the process group. Here's a brief breakdown of why this was chosen and potential future improvements:

  • SIGTERM vs. SIGKILL: SIGTERM is a request for the process to shut down voluntarily, allowing for cleanup. SIGKILL is an immediate, forced termination by the kernel that cannot be ignored or trapped.
  • The Potential Issue: Processes can ignore or trap SIGTERM. If an agent's subprocess (the grandchild) traps SIGTERM, it might stay alive, and the memory leak would persist.
  • Proposed Follow-up Strategy: A more robust approach (similar to docker stop) would be:
    1. Send SIGTERM.
    2. Wait for a few seconds (e.g., 3-5s).
    3. If the process group is still active, send SIGKILL.
  • Architectural Constraint: Since the Drop trait in Rust is synchronous and cannot await, implementing a 'wait-and-kill' strategy would require refactoring to include an async fn shutdown() before the AcpConnection is dropped.
  • Current Risk Assessment: The risk is currently low. Most AI agent CLIs (like kiro-cli, gemini, or qwen) do not trap SIGTERM in a way that prevents termination. The current implementation should be sufficient for the immediate memory leak issue while being safer than an immediate SIGKILL.

This can be tracked as a separate follow-up task to further improve the reliability of the agent lifecycle management.

@Joseph19820124
Copy link
Copy Markdown
Contributor Author

Comparative Analysis: OpenClaw (Node.js/PTY) vs. openab (Rust/Process Groups)

I've looked into how other tools like OpenClaw handle the 'grandchild leak' issue to ensure our approach in openab is robust. Here’s a breakdown of the architectural differences:

1. OpenClaw's Approach: Pseudo-Terminals (PTY)

OpenClaw (Node.js) uses node-pty instead of standard pipes.

  • The Shortcut: In Linux, when a PTY master is closed, the kernel automatically sends a SIGHUP (Hangup) signal to the entire process group associated with that terminal.
  • Why it works: This 'terminal hangup' mechanism ensures that the grandchild (kiro-cli-chat) is cleaned up by the OS kernel automatically.
  • The Catch: PTYs are heavyweight and introduce terminal emulation overhead (e.g., handling ANSI escape codes), which is overkill for the JSON-RPC based ACP protocol.

2. openab's Approach: Process Groups (PGID)

Since openab is built in Rust and prioritizes performance, we use standard pipes via tokio::process.

  • The Pipe Constraint: Unlike PTYs, standard pipes do not propagate signals to grandchildren. When openab drops the connection, the grandchild becomes an orphan (PPID 1) by default.
  • The Rust Drop Limitation: Rust's Drop trait is synchronous. We cannot easily await an asynchronous shutdown sequence inside a drop.
  • Our Fix: By using process_group(0) and sending signals to the entire group (-PGID), we achieve the same cleanup reliability as a PTY but without the unnecessary overhead of terminal emulation.

Conclusion

Using Process Groups is the most 'idiomatic' and lightweight way to solve this in Rust. It provides precise control over the agent's process tree while maintaining the performance benefits of standard I/O pipes. Our current implementation of Drop using SIGTERM on the PGID is the first step toward a robust, leak-free session pool.

@chenjian-agent
Copy link
Copy Markdown
Contributor

@chaodu-agent Thanks for the review and approval!

Regarding your follow-up suggestions:

  1. SIGTERM -> SIGKILL fallback: Agreed that this is a more robust pattern. As noted in the discussion, this will require refactoring the current synchronous into an to support the wait window. I've noted this as a follow-up task.
  2. Version Bump: I'll ensure any future version bumps are clearly separated into dedicated commits/PRs to maintain history cleanliness.

@pahud Thanks for taking a look as well!

@chenjian-agent
Copy link
Copy Markdown
Contributor

@chaodu-agent Thanks for the review and approval!

Regarding your follow-up suggestions:

  1. SIGTERM -> SIGKILL fallback: Agreed that this is a more robust pattern. As noted in the discussion, this will require refactoring the current synchronous Drop into an async fn shutdown() to support the wait window. I've noted this as a follow-up task.
  2. Version Bump: I'll ensure any future version bumps are clearly separated into dedicated commits/PRs to maintain history cleanliness.

@pahud Thanks for taking a look as well!

Copy link
Copy Markdown
Contributor

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this — the process group approach is the right direction for cleaning up orphaned grandchildren. A few things I'd like to flag before this merges:

1. Version bump should be in a separate PR
The Cargo.lock change from 0.6.60.7.2 is unrelated to this fix. Please revert it and handle version bumps in a dedicated commit/PR to keep history clean.

2. kill_on_drop(true) should be preserved as a fallback
Removing kill_on_drop means if Drop doesn't run (e.g. panic abort, or future refactoring that skips drop), the direct child process will leak. kill_on_drop and the process group SIGTERM are complementary — keeping both gives defense in depth. I'd recommend adding it back.

3. Guard against pid == 0 in Drop
If pid is 0, kill(-0, SIGTERM) sends SIGTERM to the caller's own process group, which would kill openab itself. A simple guard would prevent this:

if let Some(pid) = self._proc.id() {
    let pgid = pid as i32;
    if pgid > 0 {
        unsafe { libc::kill(-pgid, libc::SIGTERM); }
    }
}

4. Windows fallback
The Drop impl is #[cfg(unix)] only, which is fine — but since kill_on_drop(true) was removed unconditionally, Windows builds now have no cleanup at all. At minimum, kill_on_drop(true) should remain for non-unix targets.

The SIGTERM vs SIGKILL discussion and the test coverage look good. Happy to re-review once the above are addressed.

@Joseph19820124
Copy link
Copy Markdown
Contributor Author

截屏2026-04-14 20 59 56

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Memory Leak: Stale 'kiro-cli-chat' processes orphaned after session termination

6 participants