Skip to content

Bug: Claude child processes not reaped after query completion (single-user mode) #269

@akuehner

Description

@akuehner

Summary

In Clay v2.23.0 single-user mode (no osUsers), each sdk.query() call spawns a claude child process via the Claude Agent SDK. After the query stream completes, the process is never terminated — it remains as a child of the daemon indefinitely. Over normal usage this leads to dozens of orphaned claude processes consuming gigabytes of RAM.

Environment

  • Clay v2.23.0, Agent SDK v0.2.92
  • Linux (systemd service), single-user mode, no osUsers
  • Daemon running as root

Observed Behavior

After ~24 hours of normal usage:

  • 35 orphaned claude processes (all children of daemon PID)
  • ~7.8 GB RSS consumed
  • All orphans had exactly 2 file descriptors (base stdio sockets only — no active client connections)
  • Only the 3 sessions with active WebSocket clients had 3-4 FDs

Root Cause Analysis

In sdk-bridge.js, the in-process query path (startQuery(), used when linuxUser is null):

  1. sdk.query() is called at line ~1767, spawning a claude child process via the Agent SDK
  2. processQueryStream() iterates the async iterable (for await ... of session.queryInstance)
  3. When the stream ends, the finally block (line ~1594) sets session.queryInstance = null but never calls .close() or [Symbol.dispose]() on the query object

Compare this to the worker path (startQueryViaWorker()), which properly calls cleanupSessionWorker()worker.kill() → sends "shutdown" + SIGKILL after 3s. The in-process path has no equivalent cleanup.

Suggested Fix

In processQueryStream()'s finally block, before nulling queryInstance, explicitly close/dispose it:

// In the finally block of processQueryStream():
if (session.queryInstance) {
  // Close the SDK query to terminate the underlying claude process
  if (typeof session.queryInstance.close === 'function') {
    try { session.queryInstance.close(); } catch (e) {}
  } else if (typeof session.queryInstance[Symbol.dispose] === 'function') {
    try { session.queryInstance[Symbol.dispose](); } catch (e) {}
  }
}
session.queryInstance = null;

Additionally, deleteSession() and deleteSessionQuiet() in sessions.js should also ensure any active query's process is killed (they currently only call abortController.abort() and messageQueue.end(), which don't terminate the child process).

Workaround

Manually kill orphaned processes:

# Find claude processes with only 2 sockets (orphaned, no active client)
for pid in $(ps -eo pid,cmd --no-headers | grep 'claude$' | awk '{print $1}'); do
  sockets=$(ls -la /proc/$pid/fd 2>/dev/null | grep -c socket)
  if [ "$sockets" -le 2 ]; then
    kill -TERM $pid
  fi
done

Impact

On a machine with 8 GB RAM, this exhausted available memory within ~24 hours of normal usage, likely degrading performance of all sessions.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions