You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analysis of Harbor RL training jobs revealed httpx.ReadTimeout errors in the session server proxy. Initial hypothesis was thundering herd at rollout batch boundaries, but further investigation showed a different root cause.
Analysis
We reproduced the timeout with a minimal job: 2 tblite tasks, 2 rollouts, single node, 8 DP shards. No thundering herd possible with 4 concurrent requests. The timeouts still occurred.
We believe the root cause is GPU contention between prefill and decode within a shared engine, not request flooding:
When a TITO session accumulates a long conversation (50K-100K+ tokens) and starts a new turn on a different DP shard, the full context must be prefilled from scratch
This prefill monopolizes the GPU for 30-40 seconds (visible in logs as 80+ consecutive 512-token prefill batches on one DP shard)
During the prefill storm, all other DP shards' decode throughput drops from ~68 tok/s to 3-8 tok/s
Already-long-running decode requests accumulate enough stall time to exceed the proxy timeout
The burst pattern observed in larger jobs (300+ errors at batch boundaries) is this same mechanism amplified: more concurrent requests means more TITO sessions starting simultaneously, more cross-shard prefills, more GPU contention.
Evidence from our job logs
Single engine, DP5 decoding at 68 tok/s. DP7 starts a massive TITO prefill:
Context
Analysis of Harbor RL training jobs revealed
httpx.ReadTimeouterrors in the session server proxy. Initial hypothesis was thundering herd at rollout batch boundaries, but further investigation showed a different root cause.Analysis
We reproduced the timeout with a minimal job: 2 tblite tasks, 2 rollouts, single node, 8 DP shards. No thundering herd possible with 4 concurrent requests. The timeouts still occurred.
We believe the root cause is GPU contention between prefill and decode within a shared engine, not request flooding:
The burst pattern observed in larger jobs (300+ errors at batch boundaries) is this same mechanism amplified: more concurrent requests means more TITO sessions starting simultaneously, more cross-shard prefills, more GPU contention.
Evidence from our job logs
Single engine, DP5 decoding at 68 tok/s. DP7 starts a massive TITO prefill:
gen throughput (token/s): 4.41gen throughput (token/s): 3.64gen throughput (token/s): 3.45This directly causes the in-flight decode request to stall, pushing it past the 600s timeout.
Recommendation
/cc @mingshanhee