You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A normal (non-recovery) in-flight chat turn is not resumed to a client that reconnects or re-mounts during the window between message accepted and first streamed chunk. The reconnecting/re-mounting client gets no "a turn is in flight" signal, so useAgentChat stays at status: "submitted" with no first token — even though the server produces the turn's tokens and completes it normally. Any client-side "no first token" timeout then fires a false "broken chat" on a turn the backend handled successfully.
Server: a Think-based DurableObject (the same resume core exists in AIChatAgent, so it reproduces there too)
Client: useAgentChat (@cloudflare/ai-chat/react)
Root cause
Three gaps in the resume/replay core combine so a churning/re-mounting client has no path to learn about — or receive — an in-flight normal turn:
Resume is gated on an already-active stream.onConnect does if (this._resumableStream.hasActiveStream()) this._notifyStreamResuming(connection). But a stream only becomes "active" at _startStream, i.e. at/after the first chunk. During the accept→first-chunk window hasActiveStream() is false, so the proactive notify is a no-op.
The recovering-state replay covers only durable recovery.getInitialMessages replays the cf:chat:recovering record on (re)connect, but that record is only set during a durable recovery turn. A normal turn that simply hasn't produced its first chunk yet is not "recovering," so nothing is replayed for it.
Tool-continuation connection-affinity refuses resume to a new connection. In the cf_agent_stream_resume_request handler, when the active stream is a tool-continuation whose _continuation.activeConnectionId !== connection.id, the server returns cf_agent_stream_resume_none. After a reconnect, connection.id is new, so an in-flight continuation is refused resume to the reconnected client.
Net: across the pre-first-token window (and for tool continuations after a reconnect), the client neither receives the live broadcast (it wasn't connected at the moment the chunk went out) nor a resume/replay (none of the three paths fire), and it has no client-side trigger to re-request once the stream does start.
Production evidence
One interactive build session (production), reconstructed from server logs + product analytics:
11 user messages → 11 server turns, all completed (73 generations), 9 preview reloads, 8 in-app interactions, ending in a successful publish.
2 of the 11 turns: the server produced a first token within ~4.5 s and completed the turn, but the client never received the stream → useAgentChat stayed submitted for 180 s on each. Both coincided with a connection event — one with a WebSocket 1006 immediately before the turn, one with a client re-mount (user navigated back to the chat view).
The other 9 turns streamed normally (the client happened to be connected when the chunks broadcast).
Both affected turns were tool-heavy build continuations (gap Version Packages #3 applies).
Tell-tale on the client: it emitted far fewer turn-start signals than the server ran (the broadcast agentState.turnId goes stale across the reconnect), and agent_status/turn_id read idle/null at the stall — while the server clearly logged first_token + finished: completed for the same turns.
Expected behavior / suggested resolution
Resume/notify an accepted-but-pre-stream turn on (re)connect and re-mount, not only an actively-buffering stream or a durable-recovery state:
Treat "a turn has been accepted and is in flight" (request id known, no terminal yet) as resumable. On onConnect, on cf_agent_stream_resume_request, and in getInitialMessages, notify the connection that request <id> is in flight, so the client keeps its expectation and observes the stream once it starts (the existing handshake already supports an empty-then-live replay — cf_agent_stream_resuming → ACK → chunks as they arrive).
Relax the tool-continuation affinity on reconnect: allow the new connection to observe/resume the active continuation (or re-bind _continuation.activeConnectionId to the reconnected connection) instead of returning cf_agent_stream_resume_none.
This is the symmetric counterpart to the terminal-replay (#1645) and recovering-replay work, extended to the normal pre-first-token case.
Scope notes
Distinct from the recovering-replay convergence in design/rfc-chat-recovery-foundation.md — that targets the durable cf:chat:recovering state; the case here is a normal, healthy turn that is not in durable recovery, so that work would not cover it.
Summary
A normal (non-recovery) in-flight chat turn is not resumed to a client that reconnects or re-mounts during the window between message accepted and first streamed chunk. The reconnecting/re-mounting client gets no "a turn is in flight" signal, so
useAgentChatstays atstatus: "submitted"with no first token — even though the server produces the turn's tokens and completes it normally. Any client-side "no first token" timeout then fires a false "broken chat" on a turn the backend handled successfully.Affected versions
@cloudflare/ai-chat0.8.6 (latest),@cloudflare/think0.10.0,agents0.16.2Think-basedDurableObject(the same resume core exists inAIChatAgent, so it reproduces there too)useAgentChat(@cloudflare/ai-chat/react)Root cause
Three gaps in the resume/replay core combine so a churning/re-mounting client has no path to learn about — or receive — an in-flight normal turn:
onConnectdoesif (this._resumableStream.hasActiveStream()) this._notifyStreamResuming(connection). But a stream only becomes "active" at_startStream, i.e. at/after the first chunk. During the accept→first-chunk windowhasActiveStream()isfalse, so the proactive notify is a no-op.getInitialMessagesreplays thecf:chat:recoveringrecord on (re)connect, but that record is only set during a durable recovery turn. A normal turn that simply hasn't produced its first chunk yet is not "recovering," so nothing is replayed for it.cf_agent_stream_resume_requesthandler, when the active stream is a tool-continuation whose_continuation.activeConnectionId !== connection.id, the server returnscf_agent_stream_resume_none. After a reconnect,connection.idis new, so an in-flight continuation is refused resume to the reconnected client.Net: across the pre-first-token window (and for tool continuations after a reconnect), the client neither receives the live broadcast (it wasn't connected at the moment the chunk went out) nor a resume/replay (none of the three paths fire), and it has no client-side trigger to re-request once the stream does start.
Production evidence
One interactive build session (production), reconstructed from server logs + product analytics:
completed(73 generations), 9 preview reloads, 8 in-app interactions, ending in a successful publish.useAgentChatstayedsubmittedfor 180 s on each. Both coincided with a connection event — one with a WebSocket1006immediately before the turn, one with a client re-mount (user navigated back to the chat view).Tell-tale on the client: it emitted far fewer turn-start signals than the server ran (the broadcast
agentState.turnIdgoes stale across the reconnect), andagent_status/turn_idread idle/null at the stall — while the server clearly loggedfirst_token+finished: completedfor the same turns.Expected behavior / suggested resolution
Resume/notify an accepted-but-pre-stream turn on (re)connect and re-mount, not only an actively-buffering stream or a durable-recovery state:
onConnect, oncf_agent_stream_resume_request, and ingetInitialMessages, notify the connection that request<id>is in flight, so the client keeps its expectation and observes the stream once it starts (the existing handshake already supports an empty-then-live replay —cf_agent_stream_resuming→ ACK → chunks as they arrive)._continuation.activeConnectionIdto the reconnected connection) instead of returningcf_agent_stream_resume_none.This is the symmetric counterpart to the terminal-replay (#1645) and recovering-replay work, extended to the normal pre-first-token case.
Scope notes
design/rfc-chat-recovery-foundation.md— that targets the durablecf:chat:recoveringstate; the case here is a normal, healthy turn that is not in durable recovery, so that work would not cover it.input-streaming) — exhausts to a stable-timeout instead of regenerating the step #1781 (mid-tool-input interruption can't be recovered).