fix(app): recover hidden question blockers#430
Conversation
Question.ask used to silently delete its pending entry when the fiber was interrupted (e.g. session cancel) without telling subscribers. The frontend question store would then keep an orphan entry forever and the dock could end up hidden while the assistant still appeared blocked. See issue #419. Add an Effect.onInterrupt that removes the pending entry FIRST and then publishes question.rejected, so any subscriber that races on the event and calls question.list() can never see a ghost entry. The reply / reject / instance-dispose paths fail the deferred normally and skip this hook, so their existing event publishes are unaffected. The interrupt log line carries reason: "interrupted" so post-mortems can tell user-rejection from system cancellation.
When the processor cleans up an in-flight question tool after the run was cancelled, it writes part.state.error which the LLM reads as the tool result on the next turn. The generic "Tool execution aborted" string was ambiguous between "user dismissed your question" and "the run was cancelled before they answered" — the latter is what actually happened here, and the wrong reading made models assume the user had refused. See issue #419. Rewrite to "Question cancelled before the user answered it." for question tools only; other tools keep the existing message. This states the certain fact (cancelled before answered) without claiming whether the user saw the question, since they may have.
The question fallback used to bail whenever sync already held any question entry, so a model emitting parallel question tool calls with one or more asked events lost would never recover the missed entries. See issue #419. Replace hasQuestionRequest with a per-(messageID, callID) identity check: trigger recovery whenever a running question tool part on this session has no matching sync entry. Fall back to a count check for the rare entries that lack tool identity (seeded test fixtures). Counts and identities stay scoped to this session so a parent walking the parent/child tree can't mask a local loss.
When a session is cancelled while a question tool is awaiting an answer the tool part transitions to error and the message stream renders a generic ToolErrorCard. That card shows the raw backend error string, which non-technical users cannot act on. See issue #419. Recognize the cancelled-question case via metadata.interrupted (already written by processor cleanup, so this stays decoupled from the exact backend error string) and render a short, non-blaming hint that states the certain fact (cancelled, no answer received) and points the user at the prompt input. Add the i18n key in both packages/ui/src/i18n locales.
Drive a real question tool through the cancel path: seed a question via the LLM mock, abort the session, and assert the dock disappears and the message stream surfaces the friendly cancelled-question hint. This is the user-path E2E coverage required by AGENTS.md for the #419 fix.
Adds a `cancelled` flag to Question.RejectedError so the processor only sets metadata.interrupted when the rejection came from session cancel (signal abort, fiber interrupt, or instance dispose), not from explicit user dismiss. Without this, an intentional dismissal renders with the same "session was interrupted" hint as a cancel. Also wraps the abort-signal callback with InstanceState.bind so the fork'd bus.publish reliably has Instance ALS context when fired from the JS event loop, and adds an explicit signal-path test (the prior test exercised only the fiber-interrupt defence-in-depth arm).
A sync question entry that lacks tool identity (legacy or seeded data) should still cover any one running part, regardless of whether the running part has identity. Previously the identity check returned the session immediately on any uncovered running-with-identity part, so a mixed old/new state could trigger fallback recovery unnecessarily. Pool both running-with-identity and running-without-identity into one uncovered count and only fire when the total exceeds entries-without- identity.
Without this, a developer with e.g. GEMINI_API_KEY exported on their host machine inherits that env into the spawned worker backend, and the auto-picked default model becomes a real provider — bypassing the in-process OPENCODE_E2E_LLM_URL fixture and silently making real API calls (or failing with auth errors). Strip *_API_KEY / *_API_TOKEN plus a small explicit list for the long tail (GITHUB_TOKEN for Copilot, HF_TOKEN, AWS_BEARER_TOKEN_BEDROCK, GOOGLE_APPLICATION_CREDENTIALS, etc).
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughAdds AbortSignal-driven cancellation to Question.ask, marks cancelled question tool parts as interrupted with a friendly error, introduces a recovery snapshot/clock/reverify flow to auto-halt stuck sessions, updates composer wiring and UI/i18n for interrupted hints, extends tests, and scrubs host AI credentials in the e2e backend fixture. ChangesQuestion cancellation, recovery clock, UI, and tests
Sequence DiagramsequenceDiagram
participant User
participant UI as UI/MessagePart
participant Proc as Session Processor
participant Tool as Question Tool
participant Q as Question Service
participant Abort as AbortController
User->>Proc: request session.abort / halt
Proc->>Abort: propagate abort via ctx.abort
Abort->>Tool: ctx.abort fires
Tool->>Q: call question.ask(..., signal: ctx.abort)
Q->>Q: attach abort listener to signal
Abort->>Q: signal aborts
Q->>Q: delete pending, publish Question.Event.Rejected
Q-->>Tool: reject ask promise with RejectedError(cancelled:true)
Tool-->>Proc: error propagates to processor
Proc->>Proc: detect cancelled question → set metadata.interrupted = true
Proc->>UI: update part state (error + interrupted metadata)
UI->>UI: detect metadata.interrupted === true
UI-->>User: render interrupted hint (i18n key)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 3 minutes and 21 seconds.Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses issue #419 by improving the handling of cancelled question tools, ensuring that pending UI elements are cleared and the user is provided with a friendly hint. Key changes include integrating AbortSignal into the Question service for reliable cancellation, refactoring the fallback logic to use unique tool identities (messageID and callID), and updating the UI and i18n strings to handle interrupted states. Additionally, the E2E test setup was modified to strip host-provided AI credentials. A critical issue was identified in the Question service regarding the incorrect usage of Effect.runFork within a callback, which lacks the necessary fiber context and requires the use of a captured runtime.
Edge-triggered clock that arms a HEAL_DELAY_MS timer when the snapshot transitions into missingRunning and clears it when the snapshot leaves. Map entry is deleted before any await to guarantee at-most-once fire per arm. Reverify is consumer-supplied so the 4-guard re-check lives at the call site. tick() is exposed for tests because the SSR build of solid-js under bun does not propagate signal updates through createEffect.
Hoist the halt helper above createSessionComposerState so the auto-heal clock inside createSessionBlockers can call it. Threading the dependency keeps the SDK + sync wiring at the page level and lets the blocker hook stay free of Page-only context.
The clock arms only when the snapshot reducer reports missingRunning; its reverify runs four guards before halting: 1. snapshot still missingRunning, 2. active session + directory unchanged since arm, 3. session still busy, 4. server question.list confirms the running part is still uncovered (delegates to findRunningQuestionFallbackSession so auto-heal and the recovery dock cannot disagree). When the server already covers the question we write back into sync and abort the halt. When the server call itself fails we proceed to halt; the user has been hung for HEAL_DELAY_MS so surfacing the interrupted card is safer than continuing to wait.
When the recovery clock fires and aborts the session, the queued followup must auto-send on the next busy=false tick. The new test walks busy=true→false with blocked=false (matching the auto-heal flow where the dock never surfaces) and checks the predicate.
…st errors Five crosscheck-driven fixes on the auto-heal clock: - Pass a non-swallowing halt variant to the clock so its 'halt failed' warn actually fires when sdk.session.abort rejects (session.tsx kept a swallowing variant for sessionRevert which already chains its own catch). - Recovery clock now forgets the previous active session's pending timer and lastSeen entry on navigation, so coming back to a still-stuck session re-arms cleanly. Also bounds lastSeen to one entry at a time. - Reverify returns proceed:false on question.list() failures instead of halting blindly: a transient blip should not kill a possibly-healthy session, navigation cleanup will retry the next time the user returns. - Re-check guards 1-3 after the question.list() await so a snapshot or busy transition during the round-trip cannot lead to a stale halt. - Remove the duplicate clock.dispose() onCleanup at the call site; the clock's own onCleanup is the single owner.
Both crosscheck reviewers in round 2 converged on the same dead-end: when reverify returned proceed:false on a question.list() blip, the pending entry was already deleted and lastSeen[sid] still read missingRunning, so a sticky stuck session would never re-arm without user navigation. ReverifyOutcome now carries an optional retry flag; the clock arms one follow-up timer when reverify asks for it. The use-session-blockers list() failure path uses this so a single transient error costs another HEAL_DELAY_MS rather than disabling auto-heal entirely. Snapshot-edge proceed:false cases (state moved away, server confirmed covered) keep the clean dead-end.
Two consecutive crosscheck rounds independently flagged this branch as a "dead end" because the warn line reads as if it just returns proceed:false. The retry:true contract is in question-recovery-clock.ts, one file away. A pointer comment keeps the next reviewer on the rails.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/app/src/pages/session/blockers/question-recovery-clock.test.ts (1)
2-2: ⚡ Quick winUse a single
createStorefor the harness state.These two Solid signals are just holding one small, related state object. Folding them into a store matches the repo convention and keeps the harness updates less split.
Proposed refactor
-import { createRoot, createSignal } from "solid-js" +import { createRoot } from "solid-js" +import { createStore } from "solid-js/store" ... - const [snap, setS] = createSignal<QuestionRecoverySnapshot>(none) - const [sid, setSidSignal] = createSignal<string | undefined>(overrides?.initialSid ?? "s") + const [state, setState] = createStore({ + snap: none as QuestionRecoverySnapshot, + sid: overrides?.initialSid ?? "s" as string | undefined, + }) setSnap = (s) => { - setS(s) + setState("snap", s) clock.tick() } setSid = (s) => { - setSidSignal(s) + setState("sid", s) clock.tick() } clock = createQuestionRecoveryClock({ - snapshot: snap, - activeSessionID: sid, + snapshot: () => state.snap, + activeSessionID: () => state.sid,As per coding guidelines,
packages/app/**/*.{ts,tsx,js,jsx}: Always prefercreateStoreover multiplecreateSignalcalls in SolidJS.Also applies to: 72-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/app/src/pages/session/blockers/question-recovery-clock.test.ts` at line 2, Replace the two separate Solid signals used in the test harness with a single createStore to hold the combined state: locate the createSignal usages (createSignal) inside the createRoot block in this test and create a store via createStore that contains both pieces of state, update all reads/writes to use the store accessor/mutator, and remove the extra createSignal imports; also update imports to include createStore and remove the now-unused createSignal references and apply the same change for the other occurrences referenced around lines 72-73.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/app/src/pages/session/blockers/question-recovery-clock.test.ts`:
- Line 2: Replace the two separate Solid signals used in the test harness with a
single createStore to hold the combined state: locate the createSignal usages
(createSignal) inside the createRoot block in this test and create a store via
createStore that contains both pieces of state, update all reads/writes to use
the store accessor/mutator, and remove the extra createSignal imports; also
update imports to include createStore and remove the now-unused createSignal
references and apply the same change for the other occurrences referenced around
lines 72-73.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 17910edc-d9f8-42d9-be9e-03bc1104b05e
📒 Files selected for processing (8)
packages/app/src/pages/session.tsxpackages/app/src/pages/session/blockers/question-recovery-clock.test.tspackages/app/src/pages/session/blockers/question-recovery-clock.tspackages/app/src/pages/session/blockers/question-recovery-snapshot.test.tspackages/app/src/pages/session/blockers/question-recovery-snapshot.tspackages/app/src/pages/session/blockers/use-session-blockers.tspackages/app/src/pages/session/composer/session-composer-state.tspackages/app/src/pages/session/use-session-followups.test.ts
The retry path could loop forever if question.list() stayed broken — every 3 s the clock would re-arm and pound the failing endpoint. Per arm we now allow at most one follow-up attempt; a second consecutive transient failure stops and waits for a fresh snapshot edge. The new test locks both the bound and the snapshot-edge-revives-it semantics.
The abort listener fires from the JS event loop, outside any fiber, so Effect.runFork(...) was using the empty default runtime — bus.publish worked only because InstanceState.bind restored ALS as a fallback. Capture the parent's Effect context once and Effect.provide it to the publish + Deferred.fail forks, so the InstanceRef + service layer flow through explicitly. The bind wrap is kept for log.info and any consumer still on the ALS path. The existing "ask - publishes question.rejected on input.signal abort" integration test already exercises this path; 40/40 question tests pass.
Extract the 4-guard reverify wiring from createSessionBlockers into a pure module so it can be unit-tested without standing up the full sdk + sync + permission + language provider tree. Locks each guard (snapshot / session+directory / busy / server-still-uncovered), the post-await re-check, and both branches of the server response (covered → hydrate sync; still uncovered → license halt).
The hint must reappear when metadata.interrupted flips from undefined to true *without* a page reload. In Solid that requires reading partMetadata as an accessor over part().state, not a setup-time snapshot. Add a structural assertion locking the accessor pattern plus a unit test on the metadata extractor that covers the shape variations the live message stream actually emits (initial undefined, gained-on-update, fresh reference for downstream equality checks). A full render harness (@solidjs/testing-library + happydom for the ui package) was considered but is infrastructure work outside this PR's scope; the structural + extractor coverage is enough to trip a future "let me memo this once" refactor before it reaches users.
R5: collapse the duplicated AbortSignal vs onInterrupt explainers in Question.ask() into one short note — the lengthier history belongs in #419 and the test names. R6: replace the hand-rolled wait-for-pending loop with a waitForPending() helper that asserts the pending question actually appeared. Without the assertion the abort tests would pass even if Question.ask never reached the publish path (timeout silently → controller.abort() is a no-op → events stay empty → toHaveLength(1) fires only because we expected it).
Replace colloquial phrasing with concise written form: "尚未收到回答" / "如需继续,请在下方重新说明" reads more like a system notice and matches the surrounding zh strings. Behaviour unchanged.
1c3fc14 to
1cfec29
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/app/src/pages/session/blockers/question-recovery-reverify.ts`:
- Around line 5-14: The ReverifyDeps<Q> generic is too loose and forces callers
to cast to never, causing questionRecoveryReverify() to pass incorrectly shaped
objects into findRunningQuestionFallbackSession(); tighten the contract by
changing ReverifyDeps<Q> to require the actual synced question/message/part
shapes used by findRunningQuestionFallbackSession() (i.e., questions with
tool.messageID/callID, parts with top-level messageID/callID, and the
messagesFor return type matching the message shape), remove the unsafe "as
never" workarounds, and update listQuestions, partsByMessageID, and messagesFor
signatures to reflect those precise types so callers and tests must provide
correctly shaped data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0e01a28f-d733-49ea-b4ab-28c94b5a21ee
📒 Files selected for processing (9)
packages/app/src/pages/session/blockers/question-recovery-clock.test.tspackages/app/src/pages/session/blockers/question-recovery-clock.tspackages/app/src/pages/session/blockers/question-recovery-reverify.test.tspackages/app/src/pages/session/blockers/question-recovery-reverify.tspackages/app/src/pages/session/blockers/use-session-blockers.tspackages/opencode/src/question/index.tspackages/opencode/test/question/question.test.tspackages/ui/src/components/message-part-stale.test.tspackages/ui/src/i18n/zh.ts
✅ Files skipped from review due to trivial changes (1)
- packages/ui/src/i18n/zh.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/opencode/test/question/question.test.ts
- packages/app/src/pages/session/blockers/question-recovery-clock.ts
- packages/opencode/src/question/index.ts
Q generic now extends `{ sessionID; tool?: { messageID; callID }; id? }`
matching the real SDK shape, instead of top-level messageID/callID. To
let reverify pass `Q[]` straight into findRunningQuestionFallbackSession
without `as never`, the fallback's syncQuestions input is widened to a
structural QuestionFallbackEntry shape (`{ tool? }`); QuestionRequest[]
callers still satisfy it via subtyping.
Also fixes the reverify tests that previously passed by coincidence:
fake questions used top-level `messageID/callID` (which `q.tool` reads
as undefined → legacy bucket), and running parts put `callID` inside
`state` instead of at part level. With both shapes corrected, identity
matching is now actually exercised. Adds an identity-mismatch test:
server returns same-session question with a different `tool.callID` →
proceed:true (running call remains uncovered, halt is licensed).
Trim overlapping explanations down to three load-bearing facts: input.signal is the production cancel channel, Effect.onInterrupt is defence for direct fiber kill, and the abort callback fires from the JS event loop so it needs the captured Effect context. No behavioural change.
ReverifyDeps.partsByMessageID and messagesFor previously returned `unknown`, forcing two `as never` casts at the fallback call site that masked any caller wiring up a wrong shape. Tighten to ReadonlyArray<Part> / ReadonlyArray<Message> (the SDK shapes that fallback actually reads), and widen findRunningQuestionFallbackSession's input to the same readonly shapes. QuestionRequest[] callers in snapshot.ts and use-session-blockers .ts still satisfy the contract via covariant subtyping; the test harness casts terse fake fixtures to the SDK shapes at the deps boundary so unit tests stay short while production callers must wire up the real types.
Round R15–R18 follow-upLatest review batch (Codex output) processed: Accepted + landed in this PR
Deferred to #433 (commented as checklist items)
Already filed
|
Round R20–R22 follow-upLatest review batch processed: Accepted + landed
P3 deferred / acknowledged
|
processor.failToolCall writes errorMessage(error) into part.state.error on the abort-signal path. With cancelled === true the message getter now returns the same friendly copy the legacy fiber-cleanup path uses, so consumers (state.error, logs, telemetry) read consistent text. Refs #419.
Replace retried boolean with retries counter capped at MAX_RETRIES (3). Persistent transient list() failures now log a structured warn before the clock stops, instead of dead-ending after a single follow-up. A fresh snapshot edge or session navigation still resets the budget. Refs #419.
Persistent transient list() failures previously stopped silently at MAX_RETRIES, leaving the user stuck on a hidden blocker. Now the clock warns and falls through to halt, which is the same conservative action the user could trigger manually anyway. Refs #419.
Replace two independent runFork calls with a single Effect.gen pipeline so subscribers see Rejected before any awaiter unblocks and any internal failure surfaces through one error log instead of disappearing into a detached fork. Refs #419.
End-to-end test wires snapshot reducer + clock + reverify against the same harness state to lock the recovery contract as a whole: edge into missingRunning arms a timer, server hydration on fire writes back and skips halt, transient list() failure recovers on the bounded follow-up, and snapshot flipping out of missingRunning before fire cancels cleanly. Refs #419.
2dd8905 to
31b256a
Compare
Summary
Recovers hidden question blockers across two paths: (1) backend reliably tears down the pending question and publishes
question.rejectedwhen a session is cancelled mid-question, and (2) frontend now has a snapshot + auto-heal clock that detects a running question part with no sync coverage and halts the stuck session as a last resort. The cancelled question's tool part renders an interrupted hint in the message stream.Why
When the user cancels mid-stream while the LLM has invoked the
questiontool, the prior code left the entry inpendingforever (becauseEffectBridge.run.promiseruns the tool viaEffect.runPromise, which does not propagate parent fiber interrupts). The dock kept rendering with no way to dismiss, the tool part stayed inrunningstate, and the user was stuck staring at a dead UI. Even after the backend fix, sync drops or worker race conditions could still leave a question part visible without a matching pending entry; the auto-heal clock guarantees recovery in that residual case. See #419.Related Issue
Refs #419.
Human Review Status
Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.
Review Focus
packages/opencode/src/question/index.ts— the cancellation channel: tool'sAbortSignalis now the only reliable cancel path;Effect.onInterruptremains as defence-in-depth for direct fiber kill (layer shutdown, supervisor).failFromAbortmutates pending + publishes Rejected synchronously, capturedEffect.context<never>()provides Instance ALS to the fork'dbus.publishwhen fired from the JS event loop.RejectedError.cancelledflag — distinguishes session-cancel (signal/interrupt/dispose, setsmetadata.interrupted) from explicit user dismissal (no hint).packages/app/src/pages/session/blockers/question-fallback.ts— multi-pending recovery: identity matching by(messageID, callID), with pooled buckets so legacy entries without identity can absorb running parts with identity.packages/app/src/pages/session/blockers/question-recovery-{snapshot,clock,reverify}.ts— auto-heal: snapshot reducer classifies (none/ready/missingRunning), clock arms an edge-triggered timer onmissingRunning, reverify re-checks four guards and re-pullsquestion.list()before halting. The clock is single-session, active-only by design: it trackslastActiveSidand forgets the previous session's pending timer + edge state on navigation. Background sessions are NOT auto-healed — this matches the original symptom (user stares at a stuck active session with no way out) and avoids cross-session false positives. Bounded retry: up toMAX_RETRIES(3) follow-up attempts per arm; if the budget exhausts, the clock logs a structured warn and escalates to halt rather than leave the user stuck on a hidden blocker. Fresh snapshot edge or session navigation still resets the budget for a new arm.packages/app/e2e/backend.ts— host AI provider env vars are scrubbed from the spawned worker backend so the e2e fixture'sOPENCODE_E2E_LLM_URLrouting always wins.Risk Notes
session.abort) — same effect as the user pressing stop. Guarded by four pre-fire checks + post-await re-check + server reverify, so halt should only trigger on genuinely stuck sessions.Question.RejectedErrorgains an optionalcancelledboolean. All existingnew RejectedError()callers (reject(), finalizer, etc.) keep default behavior.How To Verify
Screenshots or Recordings
N/A — UI change is conditional rendering of an existing tool-error hint variant; covered by the e2e test which asserts the hint copy and dock dismissal. Auto-heal clock is non-visual (it halts the session, no new UI).
Checklist
dev, and my PR title and commit messages use Conventional Commits in EnglishSummary by CodeRabbit
New Features
Bug Fixes
Tests