Skip to content

ai-chat/think: an in-flight (non-recovery) turn isn't resumed to a client that reconnects or re-mounts during the pre-first-token window #1784

@rwdaigle

Description

@rwdaigle

Summary

A normal (non-recovery) in-flight chat turn is not resumed to a client that reconnects or re-mounts during the window between message accepted and first streamed chunk. The reconnecting/re-mounting client gets no "a turn is in flight" signal, so useAgentChat stays at status: "submitted" with no first token — even though the server produces the turn's tokens and completes it normally. Any client-side "no first token" timeout then fires a false "broken chat" on a turn the backend handled successfully.

Affected versions

  • @cloudflare/ai-chat 0.8.6 (latest), @cloudflare/think 0.10.0, agents 0.16.2
  • Server: a Think-based DurableObject (the same resume core exists in AIChatAgent, so it reproduces there too)
  • Client: useAgentChat (@cloudflare/ai-chat/react)

Root cause

Three gaps in the resume/replay core combine so a churning/re-mounting client has no path to learn about — or receive — an in-flight normal turn:

  1. Resume is gated on an already-active stream. onConnect does if (this._resumableStream.hasActiveStream()) this._notifyStreamResuming(connection). But a stream only becomes "active" at _startStream, i.e. at/after the first chunk. During the accept→first-chunk window hasActiveStream() is false, so the proactive notify is a no-op.
  2. The recovering-state replay covers only durable recovery. getInitialMessages replays the cf:chat:recovering record on (re)connect, but that record is only set during a durable recovery turn. A normal turn that simply hasn't produced its first chunk yet is not "recovering," so nothing is replayed for it.
  3. Tool-continuation connection-affinity refuses resume to a new connection. In the cf_agent_stream_resume_request handler, when the active stream is a tool-continuation whose _continuation.activeConnectionId !== connection.id, the server returns cf_agent_stream_resume_none. After a reconnect, connection.id is new, so an in-flight continuation is refused resume to the reconnected client.

Net: across the pre-first-token window (and for tool continuations after a reconnect), the client neither receives the live broadcast (it wasn't connected at the moment the chunk went out) nor a resume/replay (none of the three paths fire), and it has no client-side trigger to re-request once the stream does start.

Production evidence

One interactive build session (production), reconstructed from server logs + product analytics:

  • 11 user messages → 11 server turns, all completed (73 generations), 9 preview reloads, 8 in-app interactions, ending in a successful publish.
  • 2 of the 11 turns: the server produced a first token within ~4.5 s and completed the turn, but the client never received the stream → useAgentChat stayed submitted for 180 s on each. Both coincided with a connection event — one with a WebSocket 1006 immediately before the turn, one with a client re-mount (user navigated back to the chat view).
  • The other 9 turns streamed normally (the client happened to be connected when the chunks broadcast).
  • Both affected turns were tool-heavy build continuations (gap Version Packages #3 applies).

Tell-tale on the client: it emitted far fewer turn-start signals than the server ran (the broadcast agentState.turnId goes stale across the reconnect), and agent_status/turn_id read idle/null at the stall — while the server clearly logged first_token + finished: completed for the same turns.

Expected behavior / suggested resolution

Resume/notify an accepted-but-pre-stream turn on (re)connect and re-mount, not only an actively-buffering stream or a durable-recovery state:

  • Treat "a turn has been accepted and is in flight" (request id known, no terminal yet) as resumable. On onConnect, on cf_agent_stream_resume_request, and in getInitialMessages, notify the connection that request <id> is in flight, so the client keeps its expectation and observes the stream once it starts (the existing handshake already supports an empty-then-live replay — cf_agent_stream_resuming → ACK → chunks as they arrive).
  • Relax the tool-continuation affinity on reconnect: allow the new connection to observe/resume the active continuation (or re-bind _continuation.activeConnectionId to the reconnected connection) instead of returning cf_agent_stream_resume_none.

This is the symmetric counterpart to the terminal-replay (#1645) and recovering-replay work, extended to the normal pre-first-token case.

Scope notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions