Skip to content

Session rollback fails with "no assistant message found" after timeout corrupts conversation state #955

@DavidBellamy

Description

@DavidBellamy

Summary

When an httpx.ReadTimeout occurs mid-conversation, the session state on the SGLang engine becomes inconsistent with what the client expects. On the next request, the session server's rollback mechanism tries to restore to the last valid checkpoint but fails because the expected assistant message is missing from the stored history.

Error

litellm.BadRequestError: OpenAIException - Error code: 400
{'error': 'rollback failed: no assistant message found in the first 1 matched messages
 (stored has 2 messages, request has 1 messages)'}

Reproduction

This occurs reliably during timeout storms (see #920). In a 1h25m test run with 33 nodes:

  • 127 rollback failures observed
  • Every failure results in the trial dying with exit_status=AgentError, reward=0.0
  • All affected trials were battery-charging-optimization and similar multi-turn agent tasks

Mechanism

  1. Client sends a multi-turn conversation to the session server
  2. Engine processes the request but the router times out before the response arrives (300s miles-router-timeout)
  3. The engine has already appended the assistant response to the session's stored messages
  4. Client retries with the original message list (without the assistant response it never received)
  5. Session server detects mismatch: stored has 2 messages (including the phantom assistant response), request has 1
  6. Rollback tries to find an assistant message at the expected position but the message indices don't align
  7. Rollback fails with 400 error

Impact

  • Each failure wastes all GPU compute spent on that rollout (often 5-15 minutes of agent interaction)
  • The error is not recoverable within the current session; the trial must be abandoned
  • In the test run, 127 failures contributed to the training loop being stuck on step 1 for the entire run

Suggested fix

When a rollback fails due to message count mismatch, the session server should:

  1. Detect the timeout-induced state corruption pattern (stored > request message count)
  2. Reset the session to the last valid checkpoint before the mismatch
  3. Return a retryable error code (e.g., 409 Conflict) instead of 400 Bad Request so the client can retry cleanly

Related: #920 (root cause of the timeouts that trigger this), #936 (timeout measurement)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions