Skip to content

Activity monitor should detect and recover from Claude Code whirlpool hang #498

@Nanyan

Description

@Nanyan

Problem

Claude Code's Whirlpool (context compaction) can hang indefinitely, leaving the agent completely unresponsive. The current activity monitor does not detect this state — the Claude process is still running, tmux session exists, so all liveness checks pass. But the agent processes zero messages.

Incident: zylos305 (2026-04-14 ~13:16 UTC)

Timeline:

  • Agent was actively working (code edits, test runs, GitHub issue creation on coco-dashboard)
  • Context reached 82% (control_queue warnings at 71% and 82%)
  • Claude entered Whirlpool compaction at ~13:16 UTC
  • Whirlpool hung — over 14 minutes with no progress (Whirlpooling… (13m 57s · ↓ 2.6k tokens · thought for 2s))
  • 3 messages queued (2 scheduled tasks + 1 Lark group message), none processed
  • Agent remained in this state until manual intervention

Environment snapshot:

  • Claude Code 2.1.107
  • Claude process PID 2298112, RSS 634MB, running for 8h16m
  • VM: cocoai-zylos305-1b (7.8Gi RAM, 73% disk)
  • Last clean exit was 2026-04-04 (10 days of continuous operation with /clear-based session resets)
  • 14 IN vs 5 OUT messages in last hour at time of detection
  • state.md was large with extensive test environment context

Tmux pane at time of detection:

● Bash(cd ~/zylos/workspace/coco-dashboard && gh issue create ...)
  ⎿  https://github.com/coco-xyz/coco-dashboard/issues/1329

✶ Whirlpooling… (13m 57s · ↓ 2.6k tokens · thought for 2s)
  ⎿  Tip: Use /btw to ask a quick side question...

  ❯ [Scheduled Task: task-mnv2xy40-p20fiz] ...
  ❯ [Lark GROUP:coco-dashboard研发] 3deca688 said: ...
  ❯ [Scheduled Task: task-mn31ohyb-hiq20f] ...

Expected Behavior

The activity monitor should detect that Claude is stuck (not producing output despite having queued input) and automatically recover — either by aborting the compaction or by killing and restarting the Claude process.

Proposed Detection Approaches

Option A: Tmux screen scrape (targeted)

  • Periodically capture tmux pane text
  • Detect Whirlpooling or Compacting keywords
  • If the text persists for >5 minutes, consider it hung
  • Recovery: kill Claude process → activity monitor auto-restarts

Option B: Message flow monitoring (general)

  • Monitor c4.db conversations table
  • If last IN is newer than last OUT by >N minutes while Claude process is running, flag as stuck
  • This would catch whirlpool hangs AND other hang scenarios (API timeouts, etc.)
  • More robust but may need tuning to avoid false positives during legitimate long-running tasks

Option C: Combine both

  • Use Option B as the general detector with a longer threshold (e.g., 15-20 min)
  • Use Option A as a fast-path for known hang patterns (e.g., 5 min for whirlpool)

Impact

  • Agent becomes completely unresponsive during a whirlpool hang
  • All queued messages (user messages + scheduled tasks) pile up
  • No automatic recovery — requires manual SSH intervention
  • Users perceive the bot as "dead" with no explanation

Workaround

Manual: SSH into VM → kill Claude PID → activity monitor auto-restarts → queued messages redelivered in new session.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions