Skip to content

TITO prefill storms cause GPU contention that starves concurrent decode requests #920

@DavidBellamy

Description

@DavidBellamy

Context

Analysis of Harbor RL training jobs revealed httpx.ReadTimeout errors in the session server proxy. Initial hypothesis was thundering herd at rollout batch boundaries, but further investigation showed a different root cause.

Analysis

We reproduced the timeout with a minimal job: 2 tblite tasks, 2 rollouts, single node, 8 DP shards. No thundering herd possible with 4 concurrent requests. The timeouts still occurred.

We believe the root cause is GPU contention between prefill and decode within a shared engine, not request flooding:

  1. When a TITO session accumulates a long conversation (50K-100K+ tokens) and starts a new turn on a different DP shard, the full context must be prefilled from scratch
  2. This prefill monopolizes the GPU for 30-40 seconds (visible in logs as 80+ consecutive 512-token prefill batches on one DP shard)
  3. During the prefill storm, all other DP shards' decode throughput drops from ~68 tok/s to 3-8 tok/s
  4. Already-long-running decode requests accumulate enough stall time to exceed the proxy timeout

The burst pattern observed in larger jobs (300+ errors at batch boundaries) is this same mechanism amplified: more concurrent requests means more TITO sessions starting simultaneously, more cross-shard prefills, more GPU contention.

Evidence from our job logs

Single engine, DP5 decoding at 68 tok/s. DP7 starts a massive TITO prefill:

  • 19:05:19: DP7 begins 512-token prefill batches (80+ consecutive)
  • 19:05:24: DP5 throughput drops to gen throughput (token/s): 4.41
  • 19:05:33: DP5 at gen throughput (token/s): 3.64
  • 19:05:44: DP5 at gen throughput (token/s): 3.45
  • Prefill continues for ~35 seconds

This directly causes the in-flight decode request to stall, pushing it past the 600s timeout.

Recommendation

  1. Primary fix: Restructure the timeout to measure engine processing time, not wall-clock time from submission (filed separately as issue Session server timeout should measure engine processing time, not wall-clock time from submission #936)
  2. Secondary fix: Investigate whether chunked prefill can interleave with decode batches to prevent the GPU monopolization during TITO session prefills
  3. Batch staggering is useful as a mitigation at scale (reduces the number of simultaneous TITO prefills) but is not the root cause fix

/cc @mingshanhee

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions