Skip to content

fix(acp): clean up pending + cancel agent on abandoned prompts#760

Open
brettchien wants to merge 2 commits intoopenabdev:mainfrom
brettchien:fix/issue-732-recv-timeout-cleanup
Open

fix(acp): clean up pending + cancel agent on abandoned prompts#760
brettchien wants to merge 2 commits intoopenabdev:mainfrom
brettchien:fix/issue-732-recv-timeout-cleanup

Conversation

@brettchien
Copy link
Copy Markdown
Contributor

@brettchien brettchien commented May 6, 2026

Closes #732.

Problem

Originally reported on Discord: https://discord.com/channels/1491295327620169908/1491365158868619404/1500930306620920040

Prompts running longer than 600 s triggered recv_timeout in adapter.rs:386. The broker printed "Agent stopped responding" and broke out of the recv loop, but:

  • pending[request_id] was not removed
  • No session/cancel was sent — the agent kept running
  • When the agent eventually finished, it emitted a response with the original id. connection.rs:286 found the stale pending entry and forwarded it to the current prompt's notify_tx — causing that prompt to break immediately with empty text_buf(no response)
  • The cascade repeated for every subsequent prompt until the backlog drained

Fix (A + B + C)

(A) Replace flat 600 s timeout with a tokio::select! loop:

  • recv arm — normal path
  • 30 s liveness arm — checks conn.alive() + configurable hard ceiling
  • Long-running tools no longer trip the timeout; only a dead reader or the hard ceiling abandons the prompt

(B) AcpConnection::abandon_request(request_id) — drops pending[request_id] and best-effort sends session/cancel on every abandon path

(C) Capture request_id from session_prompt() (was discarded as _); skip notifications whose id doesn't match — defense-in-depth if any future abandon path misses abandon_request

Config

New pool.prompt_hard_timeout_secs (default 1800 s = 30 min). Acts as a safety net against runaway sessions while the 30 s liveness check covers the typical case.

Testing

  • 238 existing tests pass (cargo test --bin openab)
  • cargo clippy -- -D warnings clean
  • Manual repro: bash -c 'sleep 700 && echo done', wait past 600 s, send follow-up → real reply (no (no response) cascade)
  • Manual: kill agent mid-prompt → recv loop bails within ~30 s with Agent process died, next prompt works clean

No unit test for abandon_request — no test seam without a real subprocess; covered end-to-end via real ACP backends. session/cancel shape matches existing cancel_session / reset_session in pool.rs (prod-verified).

Refs

🤖 Generated with Claude Code

…bdev#732)

The flat 600s recv_timeout in adapter.rs:386 fires "Agent stopped responding"
without removing pending[id] or sending session/cancel. The agent keeps
running the abandoned prompt and eventually emits its final response with
the original id. The reader at connection.rs:284 looks up pending[id], sees
the now-stale entry, and forwards the message to the *current* notify_tx
subscriber — which belongs to the next prompt. The next prompt's loop sees
notification.id.is_some() and breaks immediately with empty text_buf,
returning "(no response)". Each new prompt sent before the agent drains its
backlog inherits the previous prompt's stale id and the cascade persists.

Fix follows the issue's recommended A+B+C:

(A) Replace flat 600s timeout with a tokio::select! loop in stream_prompt_blocks.
    Recv arm + 30s liveness arm. Liveness arm checks conn.alive() (cheap,
    just !reader_handle.is_finished()) and a configurable hard ceiling.
    Default ceiling is 30 min via pool.prompt_hard_timeout_secs. Long-running
    tools no longer trip the timeout — only a dead reader task or the hard
    ceiling abandon the prompt.

(B) Add AcpConnection::abandon_request(request_id) called on every abandon
    path: drops pending[request_id] so a late response cannot route to a
    future subscriber, and best-effort writes session/cancel so the agent
    stops working on a request the broker has given up on.

(C) Capture request_id from session_prompt() (was discarded as `_`) and
    skip notifications whose id doesn't match. Defense-in-depth at the
    routing layer; complements (B)'s cleanup if any future abandon path
    forgets to call abandon_request.

No unit test for abandon_request — the connection has no test seam without
spawning a real subprocess. Behavior is exercised end-to-end via the
adapter loop on real ACP backends.

Refs:
- openabdev#76 (Assumption 2: prompts always complete)
- openabdev#307 (sibling: same family, different visible symptom)
- openabdev#470 (added the 600s recv timeout this issue exposes)

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@brettchien brettchien requested a review from thepagent as a code owner May 6, 2026 14:55
@github-actions github-actions Bot added closing-soon PR missing Discord Discussion URL — will auto-close in 3 days pending-screening PR awaiting automated screening and removed closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 6, 2026
@shaun-agent
Copy link
Copy Markdown
Contributor

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report ## Intent

PR #760 is trying to fix an ACP prompt lifecycle bug where long-running prompts were incorrectly treated as dead after a flat 600-second receive timeout.

The operator-visible problem is that OpenAB reports Agent stopped responding, but does not fully clean up the abandoned request. The stale pending request can later receive an old agent response and incorrectly deliver it into a newer prompt, causing (no response) cascades for subsequent prompts.

Feat

This is a bug fix for ACP request/session handling.

It changes prompt execution behavior so long-running tools are not abandoned just because they exceed 600 seconds. Instead, OpenAB uses a liveness-check loop, abandons requests explicitly when needed, removes stale pending entries, best-effort sends session/cancel, and ignores mismatched response IDs as defense in depth.

It also adds a configurable hard ceiling: pool.prompt_hard_timeout_secs, defaulting to 1800 seconds.

Who It Serves

Primary beneficiaries:

  • Agent runtime operators
  • Discord and Slack end users whose prompts depend on ACP agents
  • Maintainers debugging long-running or stuck ACP sessions
  • Reviewers responsible for reliability of prompt dispatch and session cleanup

Rewritten Prompt

Fix ACP prompt abandonment so stale responses from timed-out or dead prompts cannot corrupt later prompts.

Replace the flat 600-second receive timeout with a loop that separately handles normal responses, periodic agent liveness checks, and a configurable hard timeout. When a prompt is abandoned for timeout, dead process, dropped channel, or similar failure, remove its pending request entry and best-effort send session/cancel. Preserve the request ID returned by session_prompt() and ensure notifications are only accepted when their response ID matches the active request.

Add configuration for the hard timeout with a safe default, update example config, and verify existing tests and clippy. Add targeted tests if practical; otherwise document why subprocess-backed ACP behavior requires manual or integration coverage.

Merge Pitch

This is worth advancing because it fixes a concrete reliability failure: one abandoned ACP prompt can poison later prompts until the stale backlog clears. The fix addresses both the immediate timeout behavior and the deeper cleanup issue around pending request ownership.

Risk profile is moderate. The change touches prompt dispatch, ACP connection state, config parsing, and process/session cancellation behavior. The main reviewer concern should be whether all abandon paths consistently call the new cleanup path, and whether session/cancel is safe to send for every abandoned request.

Best-Practice Comparison

Relevant OpenClaw principles:

  • Gateway-owned scheduling: partially relevant. This PR improves gateway-side ownership of prompt lifecycle decisions.
  • Durable job persistence: not directly relevant. The bug is in live ACP request routing, not persisted jobs.
  • Isolated executions: relevant in spirit. Stale responses should not affect later prompts.
  • Explicit delivery routing: highly relevant. Matching response IDs to active request IDs prevents cross-prompt delivery.
  • Retry/backoff and run logs: not directly addressed. Better logs around abandon/cancel paths may be useful follow-up.

Relevant Hermes Agent principles:

  • Gateway daemon tick model: partially relevant. The new 30-second liveness check resembles periodic supervision.
  • File locking to prevent overlap: not relevant unless ACP sessions are persisted or shared across processes.
  • Atomic writes for persisted state: not relevant.
  • Fresh session per scheduled run: conceptually relevant. The fix prevents stale session output from contaminating later work, though it does not create fresh sessions per prompt.
  • Self-contained prompts for scheduled tasks: not relevant to this specific PR.

Overall, the PR aligns most strongly with explicit delivery routing, isolated execution behavior, and gateway-owned lifecycle supervision.

Implementation Options

Conservative option: minimal cleanup fix
Keep the existing timeout shape but ensure every timeout path removes pending[request_id], sends best-effort session/cancel, and drops mismatched response IDs. This is smaller but still leaves long-running valid tools vulnerable to the flat timeout.

Balanced option: accept this PR’s direction
Replace the flat timeout with recv/liveness/hard-timeout supervision, add abandon_request, preserve request IDs, and reject mismatched notifications. This fixes the known cascade while keeping the design local to ACP prompt handling.

Ambitious option: formal ACP request lifecycle manager
Introduce an explicit request state machine with states like pending, active, abandoned, cancel_sent, completed, and failed. Add structured run logs, metrics, integration tests with a fake ACP subprocess, and centralized cancellation semantics across prompt, reset, and session shutdown paths.

Comparison Table

Option Speed to ship Complexity Reliability Maintainability User impact Fit for OpenAB right now
Conservative cleanup only High Low Medium Medium Medium Decent, but incomplete
Balanced PR approach Medium Medium High Good High Best fit
Full lifecycle manager Low High Very high High if done well High Better as follow-up

Recommendation

Advance the balanced approach from this PR, with reviewer focus on abandonment coverage, cancellation safety, and response-ID routing.

This is the right next step because it fixes the observed production failure without forcing a larger ACP lifecycle redesign into the same merge. A good follow-up would be a separate PR adding a fake ACP backend or subprocess test harness so timeout, cancel, stale-response, and dead-agent cases can be covered automatically.

@chaodu-agent
Copy link
Copy Markdown
Collaborator

🟢 Four-monk review — no blocking issues

Verdict: Approve. The 3-layer fix (select loop + abandon_request + stale id filtering) is well-structured and addresses the root cause of #732 cleanly.

Highlights

  • tokio::select! replaces the flat 600s timeout — long-running tools no longer get killed; only a dead reader or the hard ceiling triggers abandon
  • abandon_request() removes pending[request_id] and best-effort sends session/cancel (confirmed: ACP spec defines this as a client notification, no id, no response expected)
  • Stale id filtering provides defense-in-depth against any future missed abandon path
  • None => break path is safe — reader loop already drains all pending entries on EOF

Non-blocking NITs

# Note Suggestion
1 LIVENESS_CHECK_INTERVAL hardcoded at 30s Could be configurable for operators; fine as follow-up
2 session/cancel response silently dropped if agent replies Add trace! log when id-bearing response has no pending entry — improves observability
3 Hard timeout format shows 0m if < 60s Edge case; consider as_secs() display or min-value guard in config
Reviewer breakdown
Reviewer Verdict Extra observations
超渡法師 (Claude) 🟢 Baseline check confirmed leak path; verified tokio::select! race is safe for UnboundedReceiver
普渡法師 (Claude) 🟢 Verified None => break assumption (reader drains pending on EOF); flagged format edge case
擺渡法師 (Codex) 🟢 Confirmed session/cancel is spec-compliant notification; GitHub checks passing
覺渡法師 (Gemini) 🟢 Praised select loop design; agreed on trace log suggestion

- pool.liveness_check_secs: hoist the recv-loop poll cadence out of a
  hard-coded const onto PoolConfig so deployments can tune it. Default
  remains 30s.
- adapter: change hard-timeout error message from ({}m) to ({}s) so
  non-multiple-of-60 ceilings render correctly (e.g. 90s → "(90s)").
- acp/connection: emit a tracing::trace! line when an id-bearing message
  arrives whose pending entry was already abandoned. Behaviour is
  unchanged — the adapter recv loop still filters by request_id; this
  just makes the stale-response path observable at trace level.

cargo check + cargo clippy -- -D warnings + cargo test --bin openab all
clean (238 passed).
@brettchien
Copy link
Copy Markdown
Contributor Author

Follow-up — applied all 3 NITs

Thanks for the review. All three NITs addressed in f323bb0.

# NIT Resolution
1 LIVENESS_CHECK_INTERVAL hardcoded at 30s Hoisted onto PoolConfig as pool.liveness_check_secs (default 30, documented in config.toml.example). Wired through AdapterRouter::new.
2 id-bearing response has no pending entry (stale-id observability) Added trace!(request_id = id, "stale id-bearing message after abandon") at the fall-through path in acp/connection.rs. Behaviour unchanged — the message still falls through to subscriber forwarding; the adapter recv loop's request_id filter remains the actual safety net. The trace just makes the stale path observable.
3 Hard timeout format shows 0m if < 60s Format changed from ({}m) / as_secs() / 60 to ({}s) / as_secs(). Renders correctly for any positive ceiling (e.g. 90s instead of 1m).

cargo check && cargo clippy -- -D warnings && cargo test --bin openab clean (238 passed).

Copy link
Copy Markdown
Collaborator

@chaodu-agent chaodu-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approve — all NITs addressed

Verified commit f323bb0 resolves the three NITs from my earlier review:

  1. liveness_check_secs now configurable via PoolConfig (was hardcoded 30s)
  2. trace!(request_id = id, "stale id-bearing message after abandon") added for observability
  3. ✅ Hard timeout format uses as_secs() — no more 0m edge case

The 3-layer fix (select loop + abandon_request + stale id filtering) is solid. cargo test + clippy clean. Ready to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pending-screening PR awaiting automated screening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(acp): recv timeout on busy agent + stale-id routing → next prompt returns (no response) cascade

3 participants