Summary
When an in-flight chat turn is interrupted (client WebSocket drops mid-stream) at a point where the last streamed assistant part is a tool call still in the input-streaming state — i.e. the model had begun emitting a tool-use block but never finished streaming its input, so the call was never finalized or dispatched — the turn-recovery path cannot make progress. The recovery uses a "continue"/resume strategy that tries to resume the in-flight generation, but a non-finalized tool call has no resumption point, so every attempt yields zero new tokens, is judged "no progress" (stable), and after a fixed number of attempts the turn terminates with a stable-timeout and surfaces a "Session interrupted — send a new message to continue" error to the user.
This state is logically recoverable — the transcript before the partial tool call is fully valid — so recovery should succeed automatically rather than dead-ending.
Production evidence
Observed once in production (a long, multi-step app-build turn):
- Interruption was a client WebSocket close mid-stream; the persisted partial ended on a tool call in
input-streaming (input never completed; tool never executed).
- Recovery attempted 7 resume attempts over ~88s, every one producing no new tokens; the persisted stream status stayed
partial the entire time, ending in a stable-timeout.
- The agent/Durable Object itself was healthy — sibling turns in the same session completed successfully both shortly before and ~8s after the exhausted turn. The failure was isolated to recovering this one mid-tool-input interruption, not a crashed or evicted agent.
- A platform deploy was rolling out concurrently, but this agent instance was never reset by it (it kept running and completed other turns) — the deploy only contributed by nudging the client to reconnect. The root cause is the recovery strategy, not the interruption source.
Net user impact: one turn's output is permanently lost and the user must manually re-send, despite the state being recoverable.
Why "continue" can't work here
A tool call that is still input-streaming has no continuation token — there is nothing to resume. The resume strategy implicitly assumes the persisted stream ends at a resumable boundary (completed text, completed tool call, or a tool result). When it ends mid-tool-input, "continue" is a no-op that loops until the attempt budget is exhausted.
Proposed resolution
Add a fallback so recovery routes a mid-tool-input interruption to regenerate-from-last-valid-step instead of resume:
- Detect the non-continuable boundary. On recovery, if the trailing persisted assistant part is an unfinished tool call (
input-streaming, no finalized input, no result), classify it as not resumable rather than attempting to continue it.
- Truncate to the last clean boundary. Drop that partial tool part (and any orphaned
step-start) back to the last complete part, yielding a valid message history.
- Re-run inference (regenerate the step) from the truncated history, rather than resuming a stream that has no continuation point.
- No reconciliation needed in this case. An
input-streaming tool call never executed, so there are no side effects to undo — this is a pure regenerate. (The harder case — recovery landing after a tool has already executed — is out of scope here and would need idempotency handling.)
In short: when the interruption lands mid-tool-input, fall back from "continue" to "regenerate from the last valid step" instead of spinning to a stable-timeout.
Summary
When an in-flight chat turn is interrupted (client WebSocket drops mid-stream) at a point where the last streamed assistant part is a tool call still in the
input-streamingstate — i.e. the model had begun emitting a tool-use block but never finished streaming its input, so the call was never finalized or dispatched — the turn-recovery path cannot make progress. The recovery uses a "continue"/resume strategy that tries to resume the in-flight generation, but a non-finalized tool call has no resumption point, so every attempt yields zero new tokens, is judged "no progress" (stable), and after a fixed number of attempts the turn terminates with a stable-timeout and surfaces a "Session interrupted — send a new message to continue" error to the user.This state is logically recoverable — the transcript before the partial tool call is fully valid — so recovery should succeed automatically rather than dead-ending.
Production evidence
Observed once in production (a long, multi-step app-build turn):
input-streaming(input never completed; tool never executed).partialthe entire time, ending in a stable-timeout.Net user impact: one turn's output is permanently lost and the user must manually re-send, despite the state being recoverable.
Why "continue" can't work here
A tool call that is still
input-streaminghas no continuation token — there is nothing to resume. The resume strategy implicitly assumes the persisted stream ends at a resumable boundary (completed text, completed tool call, or a tool result). When it ends mid-tool-input, "continue" is a no-op that loops until the attempt budget is exhausted.Proposed resolution
Add a fallback so recovery routes a mid-tool-input interruption to regenerate-from-last-valid-step instead of resume:
input-streaming, no finalized input, no result), classify it as not resumable rather than attempting to continue it.step-start) back to the last complete part, yielding a valid message history.input-streamingtool call never executed, so there are no side effects to undo — this is a pure regenerate. (The harder case — recovery landing after a tool has already executed — is out of scope here and would need idempotency handling.)In short: when the interruption lands mid-tool-input, fall back from "continue" to "regenerate from the last valid step" instead of spinning to a stable-timeout.