Summary
When an httpx.ReadTimeout occurs mid-conversation, the session state on the SGLang engine becomes inconsistent with what the client expects. On the next request, the session server's rollback mechanism tries to restore to the last valid checkpoint but fails because the expected assistant message is missing from the stored history.
Error
litellm.BadRequestError: OpenAIException - Error code: 400
{'error': 'rollback failed: no assistant message found in the first 1 matched messages
(stored has 2 messages, request has 1 messages)'}
Reproduction
This occurs reliably during timeout storms (see #920). In a 1h25m test run with 33 nodes:
- 127 rollback failures observed
- Every failure results in the trial dying with
exit_status=AgentError, reward=0.0
- All affected trials were
battery-charging-optimization and similar multi-turn agent tasks
Mechanism
- Client sends a multi-turn conversation to the session server
- Engine processes the request but the router times out before the response arrives (300s
miles-router-timeout)
- The engine has already appended the assistant response to the session's stored messages
- Client retries with the original message list (without the assistant response it never received)
- Session server detects mismatch: stored has 2 messages (including the phantom assistant response), request has 1
- Rollback tries to find an assistant message at the expected position but the message indices don't align
- Rollback fails with 400 error
Impact
- Each failure wastes all GPU compute spent on that rollout (often 5-15 minutes of agent interaction)
- The error is not recoverable within the current session; the trial must be abandoned
- In the test run, 127 failures contributed to the training loop being stuck on step 1 for the entire run
Suggested fix
When a rollback fails due to message count mismatch, the session server should:
- Detect the timeout-induced state corruption pattern (stored > request message count)
- Reset the session to the last valid checkpoint before the mismatch
- Return a retryable error code (e.g., 409 Conflict) instead of 400 Bad Request so the client can retry cleanly
Related: #920 (root cause of the timeouts that trigger this), #936 (timeout measurement)
Summary
When an
httpx.ReadTimeoutoccurs mid-conversation, the session state on the SGLang engine becomes inconsistent with what the client expects. On the next request, the session server's rollback mechanism tries to restore to the last valid checkpoint but fails because the expected assistant message is missing from the stored history.Error
Reproduction
This occurs reliably during timeout storms (see #920). In a 1h25m test run with 33 nodes:
exit_status=AgentError, reward=0.0battery-charging-optimizationand similar multi-turn agent tasksMechanism
miles-router-timeout)Impact
Suggested fix
When a rollback fails due to message count mismatch, the session server should:
Related: #920 (root cause of the timeouts that trigger this), #936 (timeout measurement)