fix(app): recover hidden question blockers by Astro-Han · Pull Request #430 · Astro-Han/pawwork

Astro-Han · 2026-05-04T08:04:25Z

Summary

Recovers hidden question blockers across two paths: (1) backend reliably tears down the pending question and publishes question.rejected when a session is cancelled mid-question, and (2) frontend now has a snapshot + auto-heal clock that detects a running question part with no sync coverage and halts the stuck session as a last resort. The cancelled question's tool part renders an interrupted hint in the message stream.

Why

When the user cancels mid-stream while the LLM has invoked the question tool, the prior code left the entry in pending forever (because EffectBridge.run.promise runs the tool via Effect.runPromise, which does not propagate parent fiber interrupts). The dock kept rendering with no way to dismiss, the tool part stayed in running state, and the user was stuck staring at a dead UI. Even after the backend fix, sync drops or worker race conditions could still leave a question part visible without a matching pending entry; the auto-heal clock guarantees recovery in that residual case. See #419.

Related Issue

Refs #419.

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

packages/opencode/src/question/index.ts — the cancellation channel: tool's AbortSignal is now the only reliable cancel path; Effect.onInterrupt remains as defence-in-depth for direct fiber kill (layer shutdown, supervisor). failFromAbort mutates pending + publishes Rejected synchronously, captured Effect.context<never>() provides Instance ALS to the fork'd bus.publish when fired from the JS event loop.
RejectedError.cancelled flag — distinguishes session-cancel (signal/interrupt/dispose, sets metadata.interrupted) from explicit user dismissal (no hint).
packages/app/src/pages/session/blockers/question-fallback.ts — multi-pending recovery: identity matching by (messageID, callID), with pooled buckets so legacy entries without identity can absorb running parts with identity.
packages/app/src/pages/session/blockers/question-recovery-{snapshot,clock,reverify}.ts — auto-heal: snapshot reducer classifies (none / ready / missingRunning), clock arms an edge-triggered timer on missingRunning, reverify re-checks four guards and re-pulls question.list() before halting. The clock is single-session, active-only by design: it tracks lastActiveSid and forgets the previous session's pending timer + edge state on navigation. Background sessions are NOT auto-healed — this matches the original symptom (user stares at a stuck active session with no way out) and avoids cross-session false positives. Bounded retry: up to MAX_RETRIES (3) follow-up attempts per arm; if the budget exhausts, the clock logs a structured warn and escalates to halt rather than leave the user stuck on a hidden blocker. Fresh snapshot edge or session navigation still resets the budget for a new arm.
packages/app/e2e/backend.ts — host AI provider env vars are scrubbed from the spawned worker backend so the e2e fixture's OPENCODE_E2E_LLM_URL routing always wins.

Risk Notes

Behavioral: question tool error rendering changes. A user-dismissed question now does NOT show the "interrupted" hint (previously it did, conflated with session cancel).
Behavioral: when the auto-heal clock fires, the session is halted (session.abort) — same effect as the user pressing stop. Guarded by four pre-fire checks + post-await re-check + server reverify, so halt should only trigger on genuinely stuck sessions.
Schema: Question.RejectedError gains an optional cancelled boolean. All existing new RejectedError() callers (reject(), finalizer, etc.) keep default behavior.
No data migration. No new dependencies. No platform-specific code.

How To Verify

opencode typecheck: clean
app typecheck: clean
opencode unit tests (test/question test/session test/permission): 675 pass / 0 fail
question-fallback.test.ts: 9 pass / 0 fail
question-recovery-snapshot.test.ts + question-recovery-clock.test.ts + question-recovery-reverify.test.ts: pass
message-part-stale.test.ts: 6 pass / 0 fail
e2e cancelled-question test (with GEMINI_API_KEY=fake set): 1 pass / 6.0s
e2e cancelled-question test (env clean): 1 pass / 6.4s

Screenshots or Recordings

N/A — UI change is conditional rendering of an existing tool-error hint variant; covered by the e2e test which asserts the hint copy and dock dismissal. Auto-heal clock is non-visual (it halts the session, no new UI).

Checklist

Summary by CodeRabbit

New Features
- Sessions can auto-recover stuck question flows and optionally halt a running session to heal pending questions.
- Interrupted questions now show a clear, localized hint in the message stream prompting users to ask again.
Bug Fixes
- Better heuristics to detect and reconcile pending questions across session sync state.
- Cleaner, friendlier error text when questions are cancelled.
Tests
- Expanded end-to-end and unit tests covering cancellation, recovery clock behavior, i18n, and environment isolation.

Question.ask used to silently delete its pending entry when the fiber was interrupted (e.g. session cancel) without telling subscribers. The frontend question store would then keep an orphan entry forever and the dock could end up hidden while the assistant still appeared blocked. See issue #419. Add an Effect.onInterrupt that removes the pending entry FIRST and then publishes question.rejected, so any subscriber that races on the event and calls question.list() can never see a ghost entry. The reply / reject / instance-dispose paths fail the deferred normally and skip this hook, so their existing event publishes are unaffected. The interrupt log line carries reason: "interrupted" so post-mortems can tell user-rejection from system cancellation.

When the processor cleans up an in-flight question tool after the run was cancelled, it writes part.state.error which the LLM reads as the tool result on the next turn. The generic "Tool execution aborted" string was ambiguous between "user dismissed your question" and "the run was cancelled before they answered" — the latter is what actually happened here, and the wrong reading made models assume the user had refused. See issue #419. Rewrite to "Question cancelled before the user answered it." for question tools only; other tools keep the existing message. This states the certain fact (cancelled before answered) without claiming whether the user saw the question, since they may have.

The question fallback used to bail whenever sync already held any question entry, so a model emitting parallel question tool calls with one or more asked events lost would never recover the missed entries. See issue #419. Replace hasQuestionRequest with a per-(messageID, callID) identity check: trigger recovery whenever a running question tool part on this session has no matching sync entry. Fall back to a count check for the rare entries that lack tool identity (seeded test fixtures). Counts and identities stay scoped to this session so a parent walking the parent/child tree can't mask a local loss.

When a session is cancelled while a question tool is awaiting an answer the tool part transitions to error and the message stream renders a generic ToolErrorCard. That card shows the raw backend error string, which non-technical users cannot act on. See issue #419. Recognize the cancelled-question case via metadata.interrupted (already written by processor cleanup, so this stays decoupled from the exact backend error string) and render a short, non-blaming hint that states the certain fact (cancelled, no answer received) and points the user at the prompt input. Add the i18n key in both packages/ui/src/i18n locales.

Drive a real question tool through the cancel path: seed a question via the LLM mock, abort the session, and assert the dock disappears and the message stream surfaces the friendly cancelled-question hint. This is the user-path E2E coverage required by AGENTS.md for the #419 fix.

Adds a `cancelled` flag to Question.RejectedError so the processor only sets metadata.interrupted when the rejection came from session cancel (signal abort, fiber interrupt, or instance dispose), not from explicit user dismiss. Without this, an intentional dismissal renders with the same "session was interrupted" hint as a cancel. Also wraps the abort-signal callback with InstanceState.bind so the fork'd bus.publish reliably has Instance ALS context when fired from the JS event loop, and adds an explicit signal-path test (the prior test exercised only the fiber-interrupt defence-in-depth arm).

A sync question entry that lacks tool identity (legacy or seeded data) should still cover any one running part, regardless of whether the running part has identity. Previously the identity check returned the session immediately on any uncovered running-with-identity part, so a mixed old/new state could trigger fallback recovery unnecessarily. Pool both running-with-identity and running-without-identity into one uncovered count and only fire when the total exceeds entries-without- identity.

Without this, a developer with e.g. GEMINI_API_KEY exported on their host machine inherits that env into the spawned worker backend, and the auto-picked default model becomes a real provider — bypassing the in-process OPENCODE_E2E_LLM_URL fixture and silently making real API calls (or failing with auth errors). Strip *_API_KEY / *_API_TOKEN plus a small explicit list for the long tail (GITHUB_TOKEN for Copilot, HF_TOKEN, AWS_BEARER_TOKEN_BEDROCK, GOOGLE_APPLICATION_CREDENTIALS, etc).

coderabbitai · 2026-05-04T08:04:42Z

Warning

Rate limit exceeded

@Astro-Han has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 21 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6acb3fae-ed49-4673-b0c1-401893fdcfe0

📥 Commits

Reviewing files that changed from the base of the PR and between 1c3fc14 and 31b256a.

📒 Files selected for processing (10)

packages/app/src/pages/session/blockers/question-fallback.ts
packages/app/src/pages/session/blockers/question-recovery-chain.test.ts
packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
packages/app/src/pages/session/blockers/question-recovery-clock.ts
packages/app/src/pages/session/blockers/question-recovery-reverify.test.ts
packages/app/src/pages/session/blockers/question-recovery-reverify.ts
packages/app/src/pages/session/blockers/question-recovery-snapshot.ts
packages/opencode/src/question/index.ts
packages/opencode/test/question/question.test.ts
packages/ui/src/i18n/zh.ts

📝 Walkthrough

Walkthrough

Adds AbortSignal-driven cancellation to Question.ask, marks cancelled question tool parts as interrupted with a friendly error, introduces a recovery snapshot/clock/reverify flow to auto-halt stuck sessions, updates composer wiring and UI/i18n for interrupted hints, extends tests, and scrubs host AI credentials in the e2e backend fixture.

Changes

Question cancellation, recovery clock, UI, and tests

Layer / File(s)	Summary
API / Data Shape `packages/opencode/src/question/index.ts`	`ask(...)` gains optional `signal?: AbortSignal`; `RejectedError` payload adds optional `cancelled?: boolean`.
Core Question Cancellation `packages/opencode/src/question/index.ts`	`ask` registers abort listener that deletes pending entry, publishes `Event.Rejected`, and rejects deferred with `RejectedError({ cancelled: true })`; ensures cleanup removes listener.
Tool Wiring `packages/opencode/src/tool/question.ts`	`QuestionTool.execute` passes `signal: ctx.abort` into `question.ask`.
Session Processor `packages/opencode/src/session/processor.ts`	`failToolCall` and cleanup detect cancelled Question.RejectedError and set `metadata.interrupted: true`; cancelled question error text becomes `"Question cancelled before the user answered it."`.
Fallback Heuristic `packages/app/src/pages/session/blockers/question-fallback.ts`, `packages/app/src/pages/session/blockers/question-fallback.test.ts`	`findRunningQuestionFallbackSession` now accepts `syncQuestions`, builds covered (messageID,callID) set, counts uncovered running question parts, and triggers fallback only when uncoveredRunning > entriesWithoutTool; tests updated/expanded for identity, count fallback, and regression `#419`.
Recovery Snapshot & Reverify `packages/app/src/pages/session/blockers/question-recovery-snapshot.ts`, `packages/app/src/pages/session/blockers/question-recovery-snapshot.test.ts`, `packages/app/src/pages/session/blockers/question-recovery-reverify.ts`, `packages/app/src/pages/session/blockers/question-recovery-reverify.test.ts`	Adds `QuestionRecoverySnapshot` (`none
Recovery Clock `packages/app/src/pages/session/blockers/question-recovery-clock.ts`, `packages/app/src/pages/session/blockers/question-recovery-clock.test.ts`	New `createQuestionRecoveryClock` arms timers on `missingRunning` edges, calls `reverify` before halting, supports a single bounded retry, per-session pending state, disposal, and comprehensive unit tests using a fake clock.
Session Blockers / Composer Wiring `packages/app/src/pages/session/blockers/use-session-blockers.ts`, `packages/app/src/pages/session/composer/session-composer-state.ts`, `packages/app/src/pages/session.tsx`	`createSessionBlockers` and `createSessionComposerState` accept optional `halt(sessionID)`; `use-session-blockers` wires `createQuestionRecoveryClock` when `halt` is provided; `session.tsx` introduces `haltAbort` and forwards it.
UI, i18n & rendering `packages/ui/src/components/message-part.tsx`, `packages/ui/src/i18n/en.ts`, `packages/ui/src/i18n/zh.ts`, `packages/ui/src/components/message-part-stale.test.ts`	Renderer special-cases `partMetadata()?.interrupted === true` to show interrupted hint; adds `ui.messagePart.questions.interrupted` (EN/ZH) and tests verifying metadata-driven rendering and reactivity.
Tests & E2E harness `packages/opencode/test/`, `packages/app/src/pages/session/blockers/`, `packages/app/e2e/backend.ts`, `packages/app/e2e/session/session-composer-dock.spec.ts`	Extensive test additions/updates: question cancellation event and await rejection tests, processor effect test for friendly cancelled message, recovery-clock/reverify/snapshot unit tests, fallback tests, UI tests, e2e backend env scrubbing to remove host provider creds, and an e2e test asserting cancelled-question hint surfaces in message stream.

Sequence Diagram

sequenceDiagram
    participant User
    participant UI as UI/MessagePart
    participant Proc as Session Processor
    participant Tool as Question Tool
    participant Q as Question Service
    participant Abort as AbortController

    User->>Proc: request session.abort / halt
    Proc->>Abort: propagate abort via ctx.abort
    Abort->>Tool: ctx.abort fires
    Tool->>Q: call question.ask(..., signal: ctx.abort)
    Q->>Q: attach abort listener to signal
    Abort->>Q: signal aborts
    Q->>Q: delete pending, publish Question.Event.Rejected
    Q-->>Tool: reject ask promise with RejectedError(cancelled:true)
    Tool-->>Proc: error propagates to processor
    Proc->>Proc: detect cancelled question → set metadata.interrupted = true
    Proc->>UI: update part state (error + interrupted metadata)
    UI->>UI: detect metadata.interrupted === true
    UI-->>User: render interrupted hint (i18n key)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

[Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433 — overlaps the frontend auto-heal snapshot/clock/reverify plan; this PR implements those modules.

Possibly related PRs

fix(app): stabilize shell navigation state #424 — touches the same question-fallback logic and tests.
fix(app): add fallback for lost question.asked SSE events #379 — prior recovery/refetch changes around session composer and question recovery.
feat(opencode): add SSE replay for missed session events #401 — related changes to e2e backend environment setup / credential handling.

Suggested labels

bug, P1, app, ui, harness

"I hopped through code with a tiny heart drum,
A cancel came in — the pending ones numb.
I scrubbed the env, rang timers to mend,
Marked parts 'interrupted' and chased each loose end.
Now vanished questions show their hint again — yum!" 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(app): recover hidden question blockers' accurately describes the main change—recovering stuck/hidden question tool sessions through server-side teardown and client-side auto-healing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is comprehensive and well-structured, covering all required sections with sufficient detail about changes, rationale, risks, and verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/fix-question-hidden-blocker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 3 minutes and 21 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses issue #419 by improving the handling of cancelled question tools, ensuring that pending UI elements are cleared and the user is provided with a friendly hint. Key changes include integrating AbortSignal into the Question service for reliable cancellation, refactoring the fallback logic to use unique tool identities (messageID and callID), and updating the UI and i18n strings to handle interrupted states. Additionally, the E2E test setup was modified to strip host-provided AI credentials. A critical issue was identified in the Question service regarding the incorrect usage of Effect.runFork within a callback, which lacks the necessary fiber context and requires the use of a captured runtime.

Edge-triggered clock that arms a HEAL_DELAY_MS timer when the snapshot transitions into missingRunning and clears it when the snapshot leaves. Map entry is deleted before any await to guarantee at-most-once fire per arm. Reverify is consumer-supplied so the 4-guard re-check lives at the call site. tick() is exposed for tests because the SSR build of solid-js under bun does not propagate signal updates through createEffect.

Hoist the halt helper above createSessionComposerState so the auto-heal clock inside createSessionBlockers can call it. Threading the dependency keeps the SDK + sync wiring at the page level and lets the blocker hook stay free of Page-only context.

The clock arms only when the snapshot reducer reports missingRunning; its reverify runs four guards before halting: 1. snapshot still missingRunning, 2. active session + directory unchanged since arm, 3. session still busy, 4. server question.list confirms the running part is still uncovered (delegates to findRunningQuestionFallbackSession so auto-heal and the recovery dock cannot disagree). When the server already covers the question we write back into sync and abort the halt. When the server call itself fails we proceed to halt; the user has been hung for HEAL_DELAY_MS so surfacing the interrupted card is safer than continuing to wait.

When the recovery clock fires and aborts the session, the queued followup must auto-send on the next busy=false tick. The new test walks busy=true→false with blocked=false (matching the auto-heal flow where the dock never surfaces) and checks the predicate.

…st errors Five crosscheck-driven fixes on the auto-heal clock: - Pass a non-swallowing halt variant to the clock so its 'halt failed' warn actually fires when sdk.session.abort rejects (session.tsx kept a swallowing variant for sessionRevert which already chains its own catch). - Recovery clock now forgets the previous active session's pending timer and lastSeen entry on navigation, so coming back to a still-stuck session re-arms cleanly. Also bounds lastSeen to one entry at a time. - Reverify returns proceed:false on question.list() failures instead of halting blindly: a transient blip should not kill a possibly-healthy session, navigation cleanup will retry the next time the user returns. - Re-check guards 1-3 after the question.list() await so a snapshot or busy transition during the round-trip cannot lead to a stale halt. - Remove the duplicate clock.dispose() onCleanup at the call site; the clock's own onCleanup is the single owner.

Both crosscheck reviewers in round 2 converged on the same dead-end: when reverify returned proceed:false on a question.list() blip, the pending entry was already deleted and lastSeen[sid] still read missingRunning, so a sticky stuck session would never re-arm without user navigation. ReverifyOutcome now carries an optional retry flag; the clock arms one follow-up timer when reverify asks for it. The use-session-blockers list() failure path uses this so a single transient error costs another HEAL_DELAY_MS rather than disabling auto-heal entirely. Snapshot-edge proceed:false cases (state moved away, server confirmed covered) keep the clean dead-end.

Two consecutive crosscheck rounds independently flagged this branch as a "dead end" because the warn line reads as if it just returns proceed:false. The retry:true contract is in question-recovery-clock.ts, one file away. A pointer comment keeps the next reviewer on the rails.

coderabbitai

🧹 Nitpick comments (1)

packages/app/src/pages/session/blockers/question-recovery-clock.test.ts (1)

2-2: ⚡ Quick win

Use a single createStore for the harness state.

These two Solid signals are just holding one small, related state object. Folding them into a store matches the repo convention and keeps the harness updates less split.

Proposed refactor

-import { createRoot, createSignal } from "solid-js"
+import { createRoot } from "solid-js"
+import { createStore } from "solid-js/store"
...
-    const [snap, setS] = createSignal<QuestionRecoverySnapshot>(none)
-    const [sid, setSidSignal] = createSignal<string | undefined>(overrides?.initialSid ?? "s")
+    const [state, setState] = createStore({
+      snap: none as QuestionRecoverySnapshot,
+      sid: overrides?.initialSid ?? "s" as string | undefined,
+    })
     setSnap = (s) => {
-      setS(s)
+      setState("snap", s)
       clock.tick()
     }
     setSid = (s) => {
-      setSidSignal(s)
+      setState("sid", s)
       clock.tick()
     }
     clock = createQuestionRecoveryClock({
-      snapshot: snap,
-      activeSessionID: sid,
+      snapshot: () => state.snap,
+      activeSessionID: () => state.sid,

As per coding guidelines, packages/app/**/*.{ts,tsx,js,jsx}: Always prefer createStore over multiple createSignal calls in SolidJS.

Also applies to: 72-73

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@packages/app/src/pages/session/blockers/question-recovery-clock.test.ts` at
line 2, Replace the two separate Solid signals used in the test harness with a
single createStore to hold the combined state: locate the createSignal usages
(createSignal) inside the createRoot block in this test and create a store via
createStore that contains both pieces of state, update all reads/writes to use
the store accessor/mutator, and remove the extra createSignal imports; also
update imports to include createStore and remove the now-unused createSignal
references and apply the same change for the other occurrences referenced around
lines 72-73.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/app/src/pages/session/blockers/question-recovery-clock.test.ts`:
- Line 2: Replace the two separate Solid signals used in the test harness with a
single createStore to hold the combined state: locate the createSignal usages
(createSignal) inside the createRoot block in this test and create a store via
createStore that contains both pieces of state, update all reads/writes to use
the store accessor/mutator, and remove the extra createSignal imports; also
update imports to include createStore and remove the now-unused createSignal
references and apply the same change for the other occurrences referenced around
lines 72-73.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 17910edc-d9f8-42d9-be9e-03bc1104b05e

📥 Commits

Reviewing files that changed from the base of the PR and between f5adf51 and 39d22d3.

📒 Files selected for processing (8)

packages/app/src/pages/session.tsx
packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
packages/app/src/pages/session/blockers/question-recovery-clock.ts
packages/app/src/pages/session/blockers/question-recovery-snapshot.test.ts
packages/app/src/pages/session/blockers/question-recovery-snapshot.ts
packages/app/src/pages/session/blockers/use-session-blockers.ts
packages/app/src/pages/session/composer/session-composer-state.ts
packages/app/src/pages/session/use-session-followups.test.ts

The retry path could loop forever if question.list() stayed broken — every 3 s the clock would re-arm and pound the failing endpoint. Per arm we now allow at most one follow-up attempt; a second consecutive transient failure stops and waits for a fresh snapshot edge. The new test locks both the bound and the snapshot-edge-revives-it semantics.

The abort listener fires from the JS event loop, outside any fiber, so Effect.runFork(...) was using the empty default runtime — bus.publish worked only because InstanceState.bind restored ALS as a fallback. Capture the parent's Effect context once and Effect.provide it to the publish + Deferred.fail forks, so the InstanceRef + service layer flow through explicitly. The bind wrap is kept for log.info and any consumer still on the ALS path. The existing "ask - publishes question.rejected on input.signal abort" integration test already exercises this path; 40/40 question tests pass.

Extract the 4-guard reverify wiring from createSessionBlockers into a pure module so it can be unit-tested without standing up the full sdk + sync + permission + language provider tree. Locks each guard (snapshot / session+directory / busy / server-still-uncovered), the post-await re-check, and both branches of the server response (covered → hydrate sync; still uncovered → license halt).

The hint must reappear when metadata.interrupted flips from undefined to true *without* a page reload. In Solid that requires reading partMetadata as an accessor over part().state, not a setup-time snapshot. Add a structural assertion locking the accessor pattern plus a unit test on the metadata extractor that covers the shape variations the live message stream actually emits (initial undefined, gained-on-update, fresh reference for downstream equality checks). A full render harness (@solidjs/testing-library + happydom for the ui package) was considered but is infrastructure work outside this PR's scope; the structural + extractor coverage is enough to trip a future "let me memo this once" refactor before it reaches users.

R5: collapse the duplicated AbortSignal vs onInterrupt explainers in Question.ask() into one short note — the lengthier history belongs in #419 and the test names. R6: replace the hand-rolled wait-for-pending loop with a waitForPending() helper that asserts the pending question actually appeared. Without the assertion the abort tests would pass even if Question.ask never reached the publish path (timeout silently → controller.abort() is a no-op → events stay empty → toHaveLength(1) fires only because we expected it).

Replace colloquial phrasing with concise written form: "尚未收到回答" / "如需继续，请在下方重新说明" reads more like a system notice and matches the surrounding zh strings. Behaviour unchanged.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/app/src/pages/session/blockers/question-recovery-reverify.ts`:
- Around line 5-14: The ReverifyDeps<Q> generic is too loose and forces callers
to cast to never, causing questionRecoveryReverify() to pass incorrectly shaped
objects into findRunningQuestionFallbackSession(); tighten the contract by
changing ReverifyDeps<Q> to require the actual synced question/message/part
shapes used by findRunningQuestionFallbackSession() (i.e., questions with
tool.messageID/callID, parts with top-level messageID/callID, and the
messagesFor return type matching the message shape), remove the unsafe "as
never" workarounds, and update listQuestions, partsByMessageID, and messagesFor
signatures to reflect those precise types so callers and tests must provide
correctly shaped data.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0e01a28f-d733-49ea-b4ab-28c94b5a21ee

📥 Commits

Reviewing files that changed from the base of the PR and between 39d22d3 and 1c3fc14.

📒 Files selected for processing (9)

packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
packages/app/src/pages/session/blockers/question-recovery-clock.ts
packages/app/src/pages/session/blockers/question-recovery-reverify.test.ts
packages/app/src/pages/session/blockers/question-recovery-reverify.ts
packages/app/src/pages/session/blockers/use-session-blockers.ts
packages/opencode/src/question/index.ts
packages/opencode/test/question/question.test.ts
packages/ui/src/components/message-part-stale.test.ts
packages/ui/src/i18n/zh.ts

✅ Files skipped from review due to trivial changes (1)

packages/ui/src/i18n/zh.ts

🚧 Files skipped from review as they are similar to previous changes (3)

packages/opencode/test/question/question.test.ts
packages/app/src/pages/session/blockers/question-recovery-clock.ts
packages/opencode/src/question/index.ts

Q generic now extends `{ sessionID; tool?: { messageID; callID }; id? }` matching the real SDK shape, instead of top-level messageID/callID. To let reverify pass `Q[]` straight into findRunningQuestionFallbackSession without `as never`, the fallback's syncQuestions input is widened to a structural QuestionFallbackEntry shape (`{ tool? }`); QuestionRequest[] callers still satisfy it via subtyping. Also fixes the reverify tests that previously passed by coincidence: fake questions used top-level `messageID/callID` (which `q.tool` reads as undefined → legacy bucket), and running parts put `callID` inside `state` instead of at part level. With both shapes corrected, identity matching is now actually exercised. Adds an identity-mismatch test: server returns same-session question with a different `tool.callID` → proceed:true (running call remains uncovered, halt is licensed).

Trim overlapping explanations down to three load-bearing facts: input.signal is the production cancel channel, Effect.onInterrupt is defence for direct fiber kill, and the abort callback fires from the JS event loop so it needs the captured Effect context. No behavioural change.

ReverifyDeps.partsByMessageID and messagesFor previously returned `unknown`, forcing two `as never` casts at the fallback call site that masked any caller wiring up a wrong shape. Tighten to ReadonlyArray<Part> / ReadonlyArray<Message> (the SDK shapes that fallback actually reads), and widen findRunningQuestionFallbackSession's input to the same readonly shapes. QuestionRequest[] callers in snapshot.ts and use-session-blockers .ts still satisfy the contract via covariant subtyping; the test harness casts terse fake fixtures to the SDK shapes at the deps boundary so unit tests stay short while production callers must wire up the real types.

Astro-Han · 2026-05-04T14:24:57Z

Round R15–R18 follow-up

Latest review batch (Codex output) processed:

Accepted + landed in this PR

P1.1 — RejectedError.message branches on cancelled (e295dc2). Closes the gap where processor.failToolCall writes errorMessage(error) for the abort-signal path and the LLM read "The user dismissed this question" even when the run was cancelled mid-question. Now both paths (failToolCall and the legacy fiber-cleanup path at processor.ts:783) produce identical user-facing text. Also closes P3.2 at the same level.
P2.1 — extend retry budget (6435462). Replaced retried: boolean with retries: number capped at MAX_RETRIES (3). Persistent transient question.list() failures now log a structured warn (question-recovery: retry budget exhausted) before the clock stops, instead of dead-ending after a single follow-up. Fresh snapshot edge or session navigation still resets the budget — locked by the updated unit test.
P2.2 — PR body active-only clarification. Body now states the clock is single-session, active-only by design (tracks lastActiveSid, forgets the previous session's pending timer + edge state on navigation). Background sessions are not auto-healed — matches the original symptom and avoids cross-session false positives.

Deferred to #433 (commented as checklist items)

P2.3 — E2E lost-event coverage. Sync-drop simulation infra isn't present today; cleaner to validate against the ledger contract once [Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433 lands rather than against the snapshot/clock fallback we plan to collapse.
P3.3 — auto-heal counters / observability. Now that R16 logs structured warns on give-up, counters/dashboards make more sense after the ledger lands and the snapshot/clock module either shrinks or goes away.

Already filed

P3.1 — SessionBlocker ledger is [Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433.

Astro-Han · 2026-05-04T14:40:41Z

Round R20–R22 follow-up

Latest review batch processed:

Accepted + landed

P2.1 — escalate to halt on retry exhaust (172610b). Persistent transient question.list() failures previously stopped silently after MAX_RETRIES, leaving the user stuck. Now the clock warns (question-recovery: retry budget exhausted, escalating to halt) and falls through to halt — same conservative action the user could trigger manually. Trade-off vs the earlier "warn-only" stance: a multi-blip server outage may now produce an unwanted halt, but the prior behavior could leave hidden blockers indefinitely. Halt-as-recoverable was judged the safer default.
P2.2 — combine publish + fail in abort handler (99b87fb). Replaced two independent Effect.runFork calls with a single Effect.gen pipeline so subscribers see Rejected before any awaiter unblocks, and any internal failure surfaces through one log.error("failFromAbort failed", ...) instead of disappearing into a detached fork.
P2.3 — full chain integration test (2dd8905). New question-recovery-chain.test.ts wires snapshot reducer + clock + reverify against the same harness state and locks the recovery contract end-to-end: missingRunning edge → arm → reverify → halt; server hydration on fire writes back and skips halt; transient list() failure recovers on the bounded follow-up; snapshot flipping out of missingRunning before fire cancels cleanly.

P3 deferred / acknowledged

P3.1 SessionBlocker ledger is [Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433.
P3.2 telemetry counters (armed / refetchRecovered / haltSucceeded / haltFailed / listFailed / retryExhausted) — added to [Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433 checklist (R18). Wiring counters now would instrument a soon-to-shrink module.
P3.3 Refs vs Closes — already Refs #419 in the PR body. Issue stays open until production data confirms hidden-blocker reports stop.

processor.failToolCall writes errorMessage(error) into part.state.error on the abort-signal path. With cancelled === true the message getter now returns the same friendly copy the legacy fiber-cleanup path uses, so consumers (state.error, logs, telemetry) read consistent text. Refs #419.

Replace retried boolean with retries counter capped at MAX_RETRIES (3). Persistent transient list() failures now log a structured warn before the clock stops, instead of dead-ending after a single follow-up. A fresh snapshot edge or session navigation still resets the budget. Refs #419.

Persistent transient list() failures previously stopped silently at MAX_RETRIES, leaving the user stuck on a hidden blocker. Now the clock warns and falls through to halt, which is the same conservative action the user could trigger manually anyway. Refs #419.

Replace two independent runFork calls with a single Effect.gen pipeline so subscribers see Rejected before any awaiter unblocks and any internal failure surfaces through one error log instead of disappearing into a detached fork. Refs #419.

End-to-end test wires snapshot reducer + clock + reverify against the same harness state to lock the recovery contract as a whole: edge into missingRunning arms a timer, server hydration on fire writes back and skips halt, transient list() failure recovers on the bounded follow-up, and snapshot flipping out of missingRunning before fire cancels cleanly. Refs #419.

Astro-Han added 8 commits May 4, 2026 16:03

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/opencode/src/question/index.ts

Astro-Han mentioned this pull request May 4, 2026

[Bug] Question tool can block a session without showing the question UI #419

Closed

Astro-Han added 8 commits May 4, 2026 19:42

feat(app): add question recovery snapshot reducer

c8064c3

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Astro-Han added 5 commits May 4, 2026 21:25

Astro-Han changed the title ~~fix(app): prevent stuck question dock when session is cancelled~~ fix(app): recover hidden question blockers May 4, 2026

This was referenced May 4, 2026

[Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433

Open

[Task] question.rejected carries explicit reason field #434

Closed

chore(ui): tighten cancelled question hint zh copy

1cfec29

Replace colloquial phrasing with concise written form: "尚未收到回答" / "如需继续，请在下方重新说明" reads more like a system notice and matches the surrounding zh strings. Behaviour unchanged.

Astro-Han force-pushed the claude/fix-question-hidden-blocker branch from 1c3fc14 to 1cfec29 Compare May 4, 2026 13:51

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread packages/app/src/pages/session/blockers/question-recovery-reverify.ts

Astro-Han added 3 commits May 4, 2026 21:58

Astro-Han added 5 commits May 4, 2026 22:41

Astro-Han force-pushed the claude/fix-question-hidden-blocker branch from 2dd8905 to 31b256a Compare May 4, 2026 14:42

Astro-Han merged commit 563d689 into dev May 4, 2026
20 checks passed

Astro-Han deleted the claude/fix-question-hidden-blocker branch May 4, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(app): recover hidden question blockers#430

fix(app): recover hidden question blockers#430
Astro-Han merged 30 commits intodevfrom
claude/fix-question-hidden-blocker

Astro-Han commented May 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Astro-Han commented May 4, 2026

Uh oh!

Astro-Han commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Astro-Han commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Related Issue

Human Review Status

Review Focus

Risk Notes

How To Verify

Screenshots or Recordings

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Astro-Han commented May 4, 2026

Round R15–R18 follow-up

Uh oh!

Astro-Han commented May 4, 2026

Round R20–R22 follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Astro-Han commented May 4, 2026 •

edited

Loading

coderabbitai Bot commented May 4, 2026 •

edited

Loading