Skip to content

fix(app): recover hidden question blockers#430

Merged
Astro-Han merged 30 commits intodevfrom
claude/fix-question-hidden-blocker
May 4, 2026
Merged

fix(app): recover hidden question blockers#430
Astro-Han merged 30 commits intodevfrom
claude/fix-question-hidden-blocker

Conversation

@Astro-Han
Copy link
Copy Markdown
Owner

@Astro-Han Astro-Han commented May 4, 2026

Summary

Recovers hidden question blockers across two paths: (1) backend reliably tears down the pending question and publishes question.rejected when a session is cancelled mid-question, and (2) frontend now has a snapshot + auto-heal clock that detects a running question part with no sync coverage and halts the stuck session as a last resort. The cancelled question's tool part renders an interrupted hint in the message stream.

Why

When the user cancels mid-stream while the LLM has invoked the question tool, the prior code left the entry in pending forever (because EffectBridge.run.promise runs the tool via Effect.runPromise, which does not propagate parent fiber interrupts). The dock kept rendering with no way to dismiss, the tool part stayed in running state, and the user was stuck staring at a dead UI. Even after the backend fix, sync drops or worker race conditions could still leave a question part visible without a matching pending entry; the auto-heal clock guarantees recovery in that residual case. See #419.

Related Issue

Refs #419.

Human Review Status

Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.

Review Focus

  • packages/opencode/src/question/index.ts — the cancellation channel: tool's AbortSignal is now the only reliable cancel path; Effect.onInterrupt remains as defence-in-depth for direct fiber kill (layer shutdown, supervisor). failFromAbort mutates pending + publishes Rejected synchronously, captured Effect.context<never>() provides Instance ALS to the fork'd bus.publish when fired from the JS event loop.
  • RejectedError.cancelled flag — distinguishes session-cancel (signal/interrupt/dispose, sets metadata.interrupted) from explicit user dismissal (no hint).
  • packages/app/src/pages/session/blockers/question-fallback.ts — multi-pending recovery: identity matching by (messageID, callID), with pooled buckets so legacy entries without identity can absorb running parts with identity.
  • packages/app/src/pages/session/blockers/question-recovery-{snapshot,clock,reverify}.ts — auto-heal: snapshot reducer classifies (none / ready / missingRunning), clock arms an edge-triggered timer on missingRunning, reverify re-checks four guards and re-pulls question.list() before halting. The clock is single-session, active-only by design: it tracks lastActiveSid and forgets the previous session's pending timer + edge state on navigation. Background sessions are NOT auto-healed — this matches the original symptom (user stares at a stuck active session with no way out) and avoids cross-session false positives. Bounded retry: up to MAX_RETRIES (3) follow-up attempts per arm; if the budget exhausts, the clock logs a structured warn and escalates to halt rather than leave the user stuck on a hidden blocker. Fresh snapshot edge or session navigation still resets the budget for a new arm.
  • packages/app/e2e/backend.ts — host AI provider env vars are scrubbed from the spawned worker backend so the e2e fixture's OPENCODE_E2E_LLM_URL routing always wins.

Risk Notes

  • Behavioral: question tool error rendering changes. A user-dismissed question now does NOT show the "interrupted" hint (previously it did, conflated with session cancel).
  • Behavioral: when the auto-heal clock fires, the session is halted (session.abort) — same effect as the user pressing stop. Guarded by four pre-fire checks + post-await re-check + server reverify, so halt should only trigger on genuinely stuck sessions.
  • Schema: Question.RejectedError gains an optional cancelled boolean. All existing new RejectedError() callers (reject(), finalizer, etc.) keep default behavior.
  • No data migration. No new dependencies. No platform-specific code.

How To Verify

opencode typecheck: clean
app typecheck: clean
opencode unit tests (test/question test/session test/permission): 675 pass / 0 fail
question-fallback.test.ts: 9 pass / 0 fail
question-recovery-snapshot.test.ts + question-recovery-clock.test.ts + question-recovery-reverify.test.ts: pass
message-part-stale.test.ts: 6 pass / 0 fail
e2e cancelled-question test (with GEMINI_API_KEY=fake set): 1 pass / 6.0s
e2e cancelled-question test (env clean): 1 pass / 6.4s

Screenshots or Recordings

N/A — UI change is conditional rendering of an existing tool-error hint variant; covered by the e2e test which asserts the hint copy and dock dismissal. Auto-heal clock is non-visual (it halts the session, no new UI).

Checklist

  • Human review status is stated above as pending, approved, or not required
  • I linked the related issue, or stated why there is no issue
  • This PR has type, scope, and priority labels, or I requested maintainer labeling
  • I described the review focus and any meaningful risks
  • I listed the relevant verification steps and the key result for each
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope
  • I manually checked visible UI or copy changes when needed, with screenshots or recordings
  • I considered macOS and Windows impact for desktop, packaging, updater, signing, paths, shell, or permissions changes
  • I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant
  • I reviewed the final diff for unrelated changes and suspicious dependency changes
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English

Summary by CodeRabbit

  • New Features

    • Sessions can auto-recover stuck question flows and optionally halt a running session to heal pending questions.
    • Interrupted questions now show a clear, localized hint in the message stream prompting users to ask again.
  • Bug Fixes

    • Better heuristics to detect and reconcile pending questions across session sync state.
    • Cleaner, friendlier error text when questions are cancelled.
  • Tests

    • Expanded end-to-end and unit tests covering cancellation, recovery clock behavior, i18n, and environment isolation.

Astro-Han added 8 commits May 4, 2026 16:03
Question.ask used to silently delete its pending entry when the fiber was
interrupted (e.g. session cancel) without telling subscribers. The frontend
question store would then keep an orphan entry forever and the dock could
end up hidden while the assistant still appeared blocked. See issue #419.

Add an Effect.onInterrupt that removes the pending entry FIRST and then
publishes question.rejected, so any subscriber that races on the event
and calls question.list() can never see a ghost entry. The reply / reject
/ instance-dispose paths fail the deferred normally and skip this hook,
so their existing event publishes are unaffected. The interrupt log line
carries reason: "interrupted" so post-mortems can tell user-rejection from
system cancellation.
When the processor cleans up an in-flight question tool after the run was
cancelled, it writes part.state.error which the LLM reads as the tool
result on the next turn. The generic "Tool execution aborted" string was
ambiguous between "user dismissed your question" and "the run was
cancelled before they answered" — the latter is what actually happened
here, and the wrong reading made models assume the user had refused. See
issue #419.

Rewrite to "Question cancelled before the user answered it." for question
tools only; other tools keep the existing message. This states the
certain fact (cancelled before answered) without claiming whether the
user saw the question, since they may have.
The question fallback used to bail whenever sync already held any
question entry, so a model emitting parallel question tool calls with
one or more asked events lost would never recover the missed entries.
See issue #419.

Replace hasQuestionRequest with a per-(messageID, callID) identity check:
trigger recovery whenever a running question tool part on this session
has no matching sync entry. Fall back to a count check for the rare
entries that lack tool identity (seeded test fixtures). Counts and
identities stay scoped to this session so a parent walking the
parent/child tree can't mask a local loss.
When a session is cancelled while a question tool is awaiting an answer
the tool part transitions to error and the message stream renders a
generic ToolErrorCard. That card shows the raw backend error string,
which non-technical users cannot act on. See issue #419.

Recognize the cancelled-question case via metadata.interrupted (already
written by processor cleanup, so this stays decoupled from the exact
backend error string) and render a short, non-blaming hint that states
the certain fact (cancelled, no answer received) and points the user at
the prompt input. Add the i18n key in both packages/ui/src/i18n
locales.
Drive a real question tool through the cancel path: seed a question via
the LLM mock, abort the session, and assert the dock disappears and the
message stream surfaces the friendly cancelled-question hint. This is
the user-path E2E coverage required by AGENTS.md for the #419 fix.
Adds a `cancelled` flag to Question.RejectedError so the processor only
sets metadata.interrupted when the rejection came from session cancel
(signal abort, fiber interrupt, or instance dispose), not from explicit
user dismiss. Without this, an intentional dismissal renders with the
same "session was interrupted" hint as a cancel.

Also wraps the abort-signal callback with InstanceState.bind so the
fork'd bus.publish reliably has Instance ALS context when fired from
the JS event loop, and adds an explicit signal-path test (the prior
test exercised only the fiber-interrupt defence-in-depth arm).
A sync question entry that lacks tool identity (legacy or seeded data)
should still cover any one running part, regardless of whether the
running part has identity. Previously the identity check returned the
session immediately on any uncovered running-with-identity part, so a
mixed old/new state could trigger fallback recovery unnecessarily.

Pool both running-with-identity and running-without-identity into one
uncovered count and only fire when the total exceeds entries-without-
identity.
Without this, a developer with e.g. GEMINI_API_KEY exported on their
host machine inherits that env into the spawned worker backend, and
the auto-picked default model becomes a real provider — bypassing the
in-process OPENCODE_E2E_LLM_URL fixture and silently making real API
calls (or failing with auth errors).

Strip *_API_KEY / *_API_TOKEN plus a small explicit list for the long
tail (GITHUB_TOKEN for Copilot, HF_TOKEN, AWS_BEARER_TOKEN_BEDROCK,
GOOGLE_APPLICATION_CREDENTIALS, etc).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

Warning

Rate limit exceeded

@Astro-Han has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 3 minutes and 21 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6acb3fae-ed49-4673-b0c1-401893fdcfe0

📥 Commits

Reviewing files that changed from the base of the PR and between 1c3fc14 and 31b256a.

📒 Files selected for processing (10)
  • packages/app/src/pages/session/blockers/question-fallback.ts
  • packages/app/src/pages/session/blockers/question-recovery-chain.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-clock.ts
  • packages/app/src/pages/session/blockers/question-recovery-reverify.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-reverify.ts
  • packages/app/src/pages/session/blockers/question-recovery-snapshot.ts
  • packages/opencode/src/question/index.ts
  • packages/opencode/test/question/question.test.ts
  • packages/ui/src/i18n/zh.ts
📝 Walkthrough

Walkthrough

Adds AbortSignal-driven cancellation to Question.ask, marks cancelled question tool parts as interrupted with a friendly error, introduces a recovery snapshot/clock/reverify flow to auto-halt stuck sessions, updates composer wiring and UI/i18n for interrupted hints, extends tests, and scrubs host AI credentials in the e2e backend fixture.

Changes

Question cancellation, recovery clock, UI, and tests

Layer / File(s) Summary
API / Data Shape
packages/opencode/src/question/index.ts
ask(...) gains optional signal?: AbortSignal; RejectedError payload adds optional cancelled?: boolean.
Core Question Cancellation
packages/opencode/src/question/index.ts
ask registers abort listener that deletes pending entry, publishes Event.Rejected, and rejects deferred with RejectedError({ cancelled: true }); ensures cleanup removes listener.
Tool Wiring
packages/opencode/src/tool/question.ts
QuestionTool.execute passes signal: ctx.abort into question.ask.
Session Processor
packages/opencode/src/session/processor.ts
failToolCall and cleanup detect cancelled Question.RejectedError and set metadata.interrupted: true; cancelled question error text becomes "Question cancelled before the user answered it.".
Fallback Heuristic
packages/app/src/pages/session/blockers/question-fallback.ts, packages/app/src/pages/session/blockers/question-fallback.test.ts
findRunningQuestionFallbackSession now accepts syncQuestions, builds covered (messageID,callID) set, counts uncovered running question parts, and triggers fallback only when uncoveredRunning > entriesWithoutTool; tests updated/expanded for identity, count fallback, and regression #419.
Recovery Snapshot & Reverify
packages/app/src/pages/session/blockers/question-recovery-snapshot.ts, packages/app/src/pages/session/blockers/question-recovery-snapshot.test.ts, packages/app/src/pages/session/blockers/question-recovery-reverify.ts, packages/app/src/pages/session/blockers/question-recovery-reverify.test.ts
Adds QuestionRecoverySnapshot (`none
Recovery Clock
packages/app/src/pages/session/blockers/question-recovery-clock.ts, packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
New createQuestionRecoveryClock arms timers on missingRunning edges, calls reverify before halting, supports a single bounded retry, per-session pending state, disposal, and comprehensive unit tests using a fake clock.
Session Blockers / Composer Wiring
packages/app/src/pages/session/blockers/use-session-blockers.ts, packages/app/src/pages/session/composer/session-composer-state.ts, packages/app/src/pages/session.tsx
createSessionBlockers and createSessionComposerState accept optional halt(sessionID); use-session-blockers wires createQuestionRecoveryClock when halt is provided; session.tsx introduces haltAbort and forwards it.
UI, i18n & rendering
packages/ui/src/components/message-part.tsx, packages/ui/src/i18n/en.ts, packages/ui/src/i18n/zh.ts, packages/ui/src/components/message-part-stale.test.ts
Renderer special-cases partMetadata()?.interrupted === true to show interrupted hint; adds ui.messagePart.questions.interrupted (EN/ZH) and tests verifying metadata-driven rendering and reactivity.
Tests & E2E harness
packages/opencode/test/*, packages/app/src/pages/session/blockers/*, packages/app/e2e/backend.ts, packages/app/e2e/session/session-composer-dock.spec.ts
Extensive test additions/updates: question cancellation event and await rejection tests, processor effect test for friendly cancelled message, recovery-clock/reverify/snapshot unit tests, fallback tests, UI tests, e2e backend env scrubbing to remove host provider creds, and an e2e test asserting cancelled-question hint surfaces in message stream.

Sequence Diagram

sequenceDiagram
    participant User
    participant UI as UI/MessagePart
    participant Proc as Session Processor
    participant Tool as Question Tool
    participant Q as Question Service
    participant Abort as AbortController

    User->>Proc: request session.abort / halt
    Proc->>Abort: propagate abort via ctx.abort
    Abort->>Tool: ctx.abort fires
    Tool->>Q: call question.ask(..., signal: ctx.abort)
    Q->>Q: attach abort listener to signal
    Abort->>Q: signal aborts
    Q->>Q: delete pending, publish Question.Event.Rejected
    Q-->>Tool: reject ask promise with RejectedError(cancelled:true)
    Tool-->>Proc: error propagates to processor
    Proc->>Proc: detect cancelled question → set metadata.interrupted = true
    Proc->>UI: update part state (error + interrupted metadata)
    UI->>UI: detect metadata.interrupted === true
    UI-->>User: render interrupted hint (i18n key)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Suggested labels

bug, P1, app, ui, harness

"I hopped through code with a tiny heart drum,
A cancel came in — the pending ones numb.
I scrubbed the env, rang timers to mend,
Marked parts 'interrupted' and chased each loose end.
Now vanished questions show their hint again — yum!" 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(app): recover hidden question blockers' accurately describes the main change—recovering stuck/hidden question tool sessions through server-side teardown and client-side auto-healing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all required sections with sufficient detail about changes, rationale, risks, and verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/fix-question-hidden-blocker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 3 minutes and 21 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issue #419 by improving the handling of cancelled question tools, ensuring that pending UI elements are cleared and the user is provided with a friendly hint. Key changes include integrating AbortSignal into the Question service for reliable cancellation, refactoring the fallback logic to use unique tool identities (messageID and callID), and updating the UI and i18n strings to handle interrupted states. Additionally, the E2E test setup was modified to strip host-provided AI credentials. A critical issue was identified in the Question service regarding the incorrect usage of Effect.runFork within a callback, which lacks the necessary fiber context and requires the use of a captured runtime.

Comment thread packages/opencode/src/question/index.ts
Astro-Han added 8 commits May 4, 2026 19:42
Edge-triggered clock that arms a HEAL_DELAY_MS timer when the snapshot
transitions into missingRunning and clears it when the snapshot leaves.
Map entry is deleted before any await to guarantee at-most-once fire per
arm. Reverify is consumer-supplied so the 4-guard re-check lives at the
call site. tick() is exposed for tests because the SSR build of solid-js
under bun does not propagate signal updates through createEffect.
Hoist the halt helper above createSessionComposerState so the auto-heal
clock inside createSessionBlockers can call it. Threading the dependency
keeps the SDK + sync wiring at the page level and lets the blocker hook
stay free of Page-only context.
The clock arms only when the snapshot reducer reports missingRunning;
its reverify runs four guards before halting:
1. snapshot still missingRunning,
2. active session + directory unchanged since arm,
3. session still busy,
4. server question.list confirms the running part is still uncovered
   (delegates to findRunningQuestionFallbackSession so auto-heal and
   the recovery dock cannot disagree).

When the server already covers the question we write back into sync and
abort the halt. When the server call itself fails we proceed to halt; the
user has been hung for HEAL_DELAY_MS so surfacing the interrupted card is
safer than continuing to wait.
When the recovery clock fires and aborts the session, the queued
followup must auto-send on the next busy=false tick. The new test
walks busy=true→false with blocked=false (matching the auto-heal
flow where the dock never surfaces) and checks the predicate.
…st errors

Five crosscheck-driven fixes on the auto-heal clock:

- Pass a non-swallowing halt variant to the clock so its 'halt failed'
  warn actually fires when sdk.session.abort rejects (session.tsx kept a
  swallowing variant for sessionRevert which already chains its own catch).
- Recovery clock now forgets the previous active session's pending timer
  and lastSeen entry on navigation, so coming back to a still-stuck
  session re-arms cleanly. Also bounds lastSeen to one entry at a time.
- Reverify returns proceed:false on question.list() failures instead of
  halting blindly: a transient blip should not kill a possibly-healthy
  session, navigation cleanup will retry the next time the user returns.
- Re-check guards 1-3 after the question.list() await so a snapshot or
  busy transition during the round-trip cannot lead to a stale halt.
- Remove the duplicate clock.dispose() onCleanup at the call site; the
  clock's own onCleanup is the single owner.
Both crosscheck reviewers in round 2 converged on the same dead-end:
when reverify returned proceed:false on a question.list() blip, the
pending entry was already deleted and lastSeen[sid] still read
missingRunning, so a sticky stuck session would never re-arm without
user navigation.

ReverifyOutcome now carries an optional retry flag; the clock arms one
follow-up timer when reverify asks for it. The use-session-blockers
list() failure path uses this so a single transient error costs another
HEAL_DELAY_MS rather than disabling auto-heal entirely. Snapshot-edge
proceed:false cases (state moved away, server confirmed covered) keep
the clean dead-end.
Two consecutive crosscheck rounds independently flagged this branch as a
"dead end" because the warn line reads as if it just returns proceed:false.
The retry:true contract is in question-recovery-clock.ts, one file away.
A pointer comment keeps the next reviewer on the rails.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/app/src/pages/session/blockers/question-recovery-clock.test.ts (1)

2-2: ⚡ Quick win

Use a single createStore for the harness state.

These two Solid signals are just holding one small, related state object. Folding them into a store matches the repo convention and keeps the harness updates less split.

Proposed refactor
-import { createRoot, createSignal } from "solid-js"
+import { createRoot } from "solid-js"
+import { createStore } from "solid-js/store"
...
-    const [snap, setS] = createSignal<QuestionRecoverySnapshot>(none)
-    const [sid, setSidSignal] = createSignal<string | undefined>(overrides?.initialSid ?? "s")
+    const [state, setState] = createStore({
+      snap: none as QuestionRecoverySnapshot,
+      sid: overrides?.initialSid ?? "s" as string | undefined,
+    })
     setSnap = (s) => {
-      setS(s)
+      setState("snap", s)
       clock.tick()
     }
     setSid = (s) => {
-      setSidSignal(s)
+      setState("sid", s)
       clock.tick()
     }
     clock = createQuestionRecoveryClock({
-      snapshot: snap,
-      activeSessionID: sid,
+      snapshot: () => state.snap,
+      activeSessionID: () => state.sid,

As per coding guidelines, packages/app/**/*.{ts,tsx,js,jsx}: Always prefer createStore over multiple createSignal calls in SolidJS.

Also applies to: 72-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/app/src/pages/session/blockers/question-recovery-clock.test.ts` at
line 2, Replace the two separate Solid signals used in the test harness with a
single createStore to hold the combined state: locate the createSignal usages
(createSignal) inside the createRoot block in this test and create a store via
createStore that contains both pieces of state, update all reads/writes to use
the store accessor/mutator, and remove the extra createSignal imports; also
update imports to include createStore and remove the now-unused createSignal
references and apply the same change for the other occurrences referenced around
lines 72-73.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/app/src/pages/session/blockers/question-recovery-clock.test.ts`:
- Line 2: Replace the two separate Solid signals used in the test harness with a
single createStore to hold the combined state: locate the createSignal usages
(createSignal) inside the createRoot block in this test and create a store via
createStore that contains both pieces of state, update all reads/writes to use
the store accessor/mutator, and remove the extra createSignal imports; also
update imports to include createStore and remove the now-unused createSignal
references and apply the same change for the other occurrences referenced around
lines 72-73.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 17910edc-d9f8-42d9-be9e-03bc1104b05e

📥 Commits

Reviewing files that changed from the base of the PR and between f5adf51 and 39d22d3.

📒 Files selected for processing (8)
  • packages/app/src/pages/session.tsx
  • packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-clock.ts
  • packages/app/src/pages/session/blockers/question-recovery-snapshot.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-snapshot.ts
  • packages/app/src/pages/session/blockers/use-session-blockers.ts
  • packages/app/src/pages/session/composer/session-composer-state.ts
  • packages/app/src/pages/session/use-session-followups.test.ts

Astro-Han added 5 commits May 4, 2026 21:25
The retry path could loop forever if question.list() stayed broken — every
3 s the clock would re-arm and pound the failing endpoint. Per arm we now
allow at most one follow-up attempt; a second consecutive transient
failure stops and waits for a fresh snapshot edge. The new test locks
both the bound and the snapshot-edge-revives-it semantics.
The abort listener fires from the JS event loop, outside any fiber, so
Effect.runFork(...) was using the empty default runtime — bus.publish
worked only because InstanceState.bind restored ALS as a fallback. Capture
the parent's Effect context once and Effect.provide it to the publish +
Deferred.fail forks, so the InstanceRef + service layer flow through
explicitly. The bind wrap is kept for log.info and any consumer still on
the ALS path.

The existing "ask - publishes question.rejected on input.signal abort"
integration test already exercises this path; 40/40 question tests pass.
Extract the 4-guard reverify wiring from createSessionBlockers into a
pure module so it can be unit-tested without standing up the full
sdk + sync + permission + language provider tree. Locks each guard
(snapshot / session+directory / busy / server-still-uncovered), the
post-await re-check, and both branches of the server response (covered
→ hydrate sync; still uncovered → license halt).
The hint must reappear when metadata.interrupted flips from undefined to
true *without* a page reload. In Solid that requires reading partMetadata
as an accessor over part().state, not a setup-time snapshot. Add a
structural assertion locking the accessor pattern plus a unit test on
the metadata extractor that covers the shape variations the live message
stream actually emits (initial undefined, gained-on-update, fresh
reference for downstream equality checks).

A full render harness (@solidjs/testing-library + happydom for the ui
package) was considered but is infrastructure work outside this PR's
scope; the structural + extractor coverage is enough to trip a future
"let me memo this once" refactor before it reaches users.
R5: collapse the duplicated AbortSignal vs onInterrupt explainers in
Question.ask() into one short note — the lengthier history belongs in #419
and the test names.

R6: replace the hand-rolled wait-for-pending loop with a waitForPending()
helper that asserts the pending question actually appeared. Without the
assertion the abort tests would pass even if Question.ask never reached
the publish path (timeout silently → controller.abort() is a no-op →
events stay empty → toHaveLength(1) fires only because we expected it).
@Astro-Han Astro-Han changed the title fix(app): prevent stuck question dock when session is cancelled fix(app): recover hidden question blockers May 4, 2026
Replace colloquial phrasing with concise written form: "尚未收到回答" /
"如需继续,请在下方重新说明" reads more like a system notice and matches
the surrounding zh strings. Behaviour unchanged.
@Astro-Han Astro-Han force-pushed the claude/fix-question-hidden-blocker branch from 1c3fc14 to 1cfec29 Compare May 4, 2026 13:51
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/app/src/pages/session/blockers/question-recovery-reverify.ts`:
- Around line 5-14: The ReverifyDeps<Q> generic is too loose and forces callers
to cast to never, causing questionRecoveryReverify() to pass incorrectly shaped
objects into findRunningQuestionFallbackSession(); tighten the contract by
changing ReverifyDeps<Q> to require the actual synced question/message/part
shapes used by findRunningQuestionFallbackSession() (i.e., questions with
tool.messageID/callID, parts with top-level messageID/callID, and the
messagesFor return type matching the message shape), remove the unsafe "as
never" workarounds, and update listQuestions, partsByMessageID, and messagesFor
signatures to reflect those precise types so callers and tests must provide
correctly shaped data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0e01a28f-d733-49ea-b4ab-28c94b5a21ee

📥 Commits

Reviewing files that changed from the base of the PR and between 39d22d3 and 1c3fc14.

📒 Files selected for processing (9)
  • packages/app/src/pages/session/blockers/question-recovery-clock.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-clock.ts
  • packages/app/src/pages/session/blockers/question-recovery-reverify.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-reverify.ts
  • packages/app/src/pages/session/blockers/use-session-blockers.ts
  • packages/opencode/src/question/index.ts
  • packages/opencode/test/question/question.test.ts
  • packages/ui/src/components/message-part-stale.test.ts
  • packages/ui/src/i18n/zh.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/ui/src/i18n/zh.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/opencode/test/question/question.test.ts
  • packages/app/src/pages/session/blockers/question-recovery-clock.ts
  • packages/opencode/src/question/index.ts

Astro-Han added 3 commits May 4, 2026 21:58
Q generic now extends `{ sessionID; tool?: { messageID; callID }; id? }`
matching the real SDK shape, instead of top-level messageID/callID. To
let reverify pass `Q[]` straight into findRunningQuestionFallbackSession
without `as never`, the fallback's syncQuestions input is widened to a
structural QuestionFallbackEntry shape (`{ tool? }`); QuestionRequest[]
callers still satisfy it via subtyping.

Also fixes the reverify tests that previously passed by coincidence:
fake questions used top-level `messageID/callID` (which `q.tool` reads
as undefined → legacy bucket), and running parts put `callID` inside
`state` instead of at part level. With both shapes corrected, identity
matching is now actually exercised. Adds an identity-mismatch test:
server returns same-session question with a different `tool.callID` →
proceed:true (running call remains uncovered, halt is licensed).
Trim overlapping explanations down to three load-bearing facts:
input.signal is the production cancel channel, Effect.onInterrupt is
defence for direct fiber kill, and the abort callback fires from the
JS event loop so it needs the captured Effect context. No behavioural
change.
ReverifyDeps.partsByMessageID and messagesFor previously returned
`unknown`, forcing two `as never` casts at the fallback call site that
masked any caller wiring up a wrong shape. Tighten to ReadonlyArray<Part>
/ ReadonlyArray<Message> (the SDK shapes that fallback actually reads),
and widen findRunningQuestionFallbackSession's input to the same readonly
shapes. QuestionRequest[] callers in snapshot.ts and use-session-blockers
.ts still satisfy the contract via covariant subtyping; the test harness
casts terse fake fixtures to the SDK shapes at the deps boundary so unit
tests stay short while production callers must wire up the real types.
@Astro-Han
Copy link
Copy Markdown
Owner Author

Round R15–R18 follow-up

Latest review batch (Codex output) processed:

Accepted + landed in this PR

  • P1.1 — RejectedError.message branches on cancelled (e295dc2). Closes the gap where processor.failToolCall writes errorMessage(error) for the abort-signal path and the LLM read "The user dismissed this question" even when the run was cancelled mid-question. Now both paths (failToolCall and the legacy fiber-cleanup path at processor.ts:783) produce identical user-facing text. Also closes P3.2 at the same level.
  • P2.1 — extend retry budget (6435462). Replaced retried: boolean with retries: number capped at MAX_RETRIES (3). Persistent transient question.list() failures now log a structured warn (question-recovery: retry budget exhausted) before the clock stops, instead of dead-ending after a single follow-up. Fresh snapshot edge or session navigation still resets the budget — locked by the updated unit test.
  • P2.2 — PR body active-only clarification. Body now states the clock is single-session, active-only by design (tracks lastActiveSid, forgets the previous session's pending timer + edge state on navigation). Background sessions are not auto-healed — matches the original symptom and avoids cross-session false positives.

Deferred to #433 (commented as checklist items)

  • P2.3 — E2E lost-event coverage. Sync-drop simulation infra isn't present today; cleaner to validate against the ledger contract once [Task] Backend SessionBlocker ledger replaces frontend snapshot/clock auto-heal #433 lands rather than against the snapshot/clock fallback we plan to collapse.
  • P3.3 — auto-heal counters / observability. Now that R16 logs structured warns on give-up, counters/dashboards make more sense after the ledger lands and the snapshot/clock module either shrinks or goes away.

Already filed

@Astro-Han
Copy link
Copy Markdown
Owner Author

Round R20–R22 follow-up

Latest review batch processed:

Accepted + landed

  • P2.1 — escalate to halt on retry exhaust (172610b). Persistent transient question.list() failures previously stopped silently after MAX_RETRIES, leaving the user stuck. Now the clock warns (question-recovery: retry budget exhausted, escalating to halt) and falls through to halt — same conservative action the user could trigger manually. Trade-off vs the earlier "warn-only" stance: a multi-blip server outage may now produce an unwanted halt, but the prior behavior could leave hidden blockers indefinitely. Halt-as-recoverable was judged the safer default.
  • P2.2 — combine publish + fail in abort handler (99b87fb). Replaced two independent Effect.runFork calls with a single Effect.gen pipeline so subscribers see Rejected before any awaiter unblocks, and any internal failure surfaces through one log.error("failFromAbort failed", ...) instead of disappearing into a detached fork.
  • P2.3 — full chain integration test (2dd8905). New question-recovery-chain.test.ts wires snapshot reducer + clock + reverify against the same harness state and locks the recovery contract end-to-end: missingRunning edge → arm → reverify → halt; server hydration on fire writes back and skips halt; transient list() failure recovers on the bounded follow-up; snapshot flipping out of missingRunning before fire cancels cleanly.

P3 deferred / acknowledged

Astro-Han added 5 commits May 4, 2026 22:41
processor.failToolCall writes errorMessage(error) into part.state.error
on the abort-signal path. With cancelled === true the message getter
now returns the same friendly copy the legacy fiber-cleanup path uses,
so consumers (state.error, logs, telemetry) read consistent text.

Refs #419.
Replace retried boolean with retries counter capped at MAX_RETRIES (3).
Persistent transient list() failures now log a structured warn before
the clock stops, instead of dead-ending after a single follow-up. A
fresh snapshot edge or session navigation still resets the budget.
Refs #419.
Persistent transient list() failures previously stopped silently at
MAX_RETRIES, leaving the user stuck on a hidden blocker. Now the clock
warns and falls through to halt, which is the same conservative action
the user could trigger manually anyway. Refs #419.
Replace two independent runFork calls with a single Effect.gen pipeline
so subscribers see Rejected before any awaiter unblocks and any internal
failure surfaces through one error log instead of disappearing into a
detached fork. Refs #419.
End-to-end test wires snapshot reducer + clock + reverify against the
same harness state to lock the recovery contract as a whole: edge into
missingRunning arms a timer, server hydration on fire writes back and
skips halt, transient list() failure recovers on the bounded follow-up,
and snapshot flipping out of missingRunning before fire cancels cleanly.
Refs #419.
@Astro-Han Astro-Han force-pushed the claude/fix-question-hidden-blocker branch from 2dd8905 to 31b256a Compare May 4, 2026 14:42
@Astro-Han Astro-Han merged commit 563d689 into dev May 4, 2026
20 checks passed
@Astro-Han Astro-Han deleted the claude/fix-question-hidden-blocker branch May 4, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant