Session rollback fails with "no assistant message found" after timeout corrupts conversation state

## Summary

When an `httpx.ReadTimeout` occurs mid-conversation, the session state on the SGLang engine becomes inconsistent with what the client expects. On the next request, the session server's rollback mechanism tries to restore to the last valid checkpoint but fails because the expected assistant message is missing from the stored history.

## Error

```
litellm.BadRequestError: OpenAIException - Error code: 400
{'error': 'rollback failed: no assistant message found in the first 1 matched messages
 (stored has 2 messages, request has 1 messages)'}
```

## Reproduction

This occurs reliably during timeout storms (see #920). In a 1h25m test run with 33 nodes:
- **127 rollback failures** observed
- Every failure results in the trial dying with `exit_status=AgentError, reward=0.0`
- All affected trials were `battery-charging-optimization` and similar multi-turn agent tasks

## Mechanism

1. Client sends a multi-turn conversation to the session server
2. Engine processes the request but the router times out before the response arrives (300s `miles-router-timeout`)
3. The engine has already appended the assistant response to the session's stored messages
4. Client retries with the original message list (without the assistant response it never received)
5. Session server detects mismatch: stored has 2 messages (including the phantom assistant response), request has 1
6. Rollback tries to find an assistant message at the expected position but the message indices don't align
7. Rollback fails with 400 error

## Impact

- Each failure wastes all GPU compute spent on that rollout (often 5-15 minutes of agent interaction)
- The error is not recoverable within the current session; the trial must be abandoned
- In the test run, 127 failures contributed to the training loop being stuck on step 1 for the entire run

## Suggested fix

When a rollback fails due to message count mismatch, the session server should:
1. Detect the timeout-induced state corruption pattern (stored > request message count)
2. Reset the session to the last valid checkpoint before the mismatch
3. Return a retryable error code (e.g., 409 Conflict) instead of 400 Bad Request so the client can retry cleanly

Related: #920 (root cause of the timeouts that trigger this), #936 (timeout measurement)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session rollback fails with "no assistant message found" after timeout corrupts conversation state #955

Summary

Error

Reproduction

Mechanism

Impact

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Session rollback fails with "no assistant message found" after timeout corrupts conversation state #955

Description

Summary

Error

Reproduction

Mechanism

Impact

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions