Skip to content

Tool-call hang detection: add timeout for stuck WebFetch/WebSearch #492

@Nanyan

Description

@Nanyan

Problem

Claude Code built-in tools (WebFetch, WebSearch) have no timeout mechanism. When a fetch request hangs indefinitely, the entire Claude Code turn is blocked — no messages can be processed until the hung tool call resolves or is manually interrupted.

Incidents

  1. the9bit_cocobot (2026-04-08): WebSearch infinite wait caused bot to become unresponsive for hours. Heartbeat was disabled, guardian only monitored process death (not functional liveness), ProcSampler blind to Codex internals. [Post-mortem R5]
  2. wlnhxb (2026-04-08~09): Fetch(https://webcache.googleusercontent.com/...) hung with no completion status (Fetching… shown indefinitely). Ideating phase showed 4h 23m wall time. 6 messages queued but unprocessed. Resolved by manually pressing Escape in tmux.

Root Cause

  • WebFetch and WebSearch are Claude Code built-in tools — we cannot add timeouts to them directly.
  • Claude Code process remains alive (not crashed), so process-death monitoring (activity-monitor) does not trigger restart.
  • Activity-monitor detects IDLE/BUSY state transitions, but a stuck tool call keeps the state frozen (no transitions to detect).

Proposed Solution

Add a max-BUSY-duration timeout (hook or activity-monitor enhancement):

  1. Monitor continuous BUSY duration. If Claude Code stays BUSY longer than a configurable threshold (e.g., 20-30 minutes), trigger recovery.
  2. Recovery action: Send Escape to the tmux session to interrupt the hung tool call, followed by /clear if needed.
  3. Configuration: Timeout should be configurable (default 30 min). Some tasks legitimately take long, so threshold should be generous.

Implementation Options

  • Option A (hook-based — recommended by voya.luo): Add a hook that monitors tool-call duration. If a single tool call exceeds the timeout, send Escape to tmux.
  • Option B (activity-monitor enhancement): Extend BUSY-state tracking to detect prolonged BUSY without state transitions and trigger Escape.

Affected Components

  • activity-monitor — needs max-BUSY-duration detection
  • Potentially hooks system — if implementing as a hook

Priority

Medium-High — affects all Claude-runtime VMs. Two incidents in one day across different customers.

References

  • the9bit_cocobot post-mortem: R2 (max-BUSY-duration kill), R5 (web search timeout)
  • wlnhxb tmux evidence: * Ideating… (4h 23m 7s · ↓ 138 tokens · thought for 2s)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions