TITO prefill storms cause GPU contention that starves concurrent decode requests

## Context

Analysis of Harbor RL training jobs revealed `httpx.ReadTimeout` errors in the session server proxy. Initial hypothesis was thundering herd at rollout batch boundaries, but further investigation showed a different root cause.

## Analysis

We reproduced the timeout with a **minimal job: 2 tblite tasks, 2 rollouts, single node, 8 DP shards**. No thundering herd possible with 4 concurrent requests. The timeouts still occurred.

We believe the root cause is **GPU contention between prefill and decode within a shared engine**, not request flooding:

1. When a TITO session accumulates a long conversation (50K-100K+ tokens) and starts a new turn on a different DP shard, the full context must be prefilled from scratch
2. This prefill monopolizes the GPU for 30-40 seconds (visible in logs as 80+ consecutive 512-token prefill batches on one DP shard)
3. During the prefill storm, all other DP shards' decode throughput drops from ~68 tok/s to 3-8 tok/s
4. Already-long-running decode requests accumulate enough stall time to exceed the proxy timeout

The burst pattern observed in larger jobs (300+ errors at batch boundaries) is this same mechanism amplified: more concurrent requests means more TITO sessions starting simultaneously, more cross-shard prefills, more GPU contention.

## Evidence from our job logs

Single engine, DP5 decoding at 68 tok/s. DP7 starts a massive TITO prefill:
- 19:05:19: DP7 begins 512-token prefill batches (80+ consecutive)
- 19:05:24: DP5 throughput drops to `gen throughput (token/s): 4.41`
- 19:05:33: DP5 at `gen throughput (token/s): 3.64`
- 19:05:44: DP5 at `gen throughput (token/s): 3.45`
- Prefill continues for ~35 seconds

This directly causes the in-flight decode request to stall, pushing it past the 600s timeout.

## Recommendation

1. **Primary fix**: Restructure the timeout to measure engine processing time, not wall-clock time from submission (filed separately as issue #936)
2. **Secondary fix**: Investigate whether chunked prefill can interleave with decode batches to prevent the GPU monopolization during TITO session prefills
3. **Batch staggering** is useful as a mitigation at scale (reduces the number of simultaneous TITO prefills) but is not the root cause fix

/cc @mingshanhee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TITO prefill storms cause GPU contention that starves concurrent decode requests #920

Context

Analysis

Evidence from our job logs

Recommendation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TITO prefill storms cause GPU contention that starves concurrent decode requests #920

Description

Context

Analysis

Evidence from our job logs

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions