Skip to content

[Bug] _get_capped_partitions crashes when a single sample exceeds max_tokens_per_gpu #1839

@leofan-lab

Description

@leofan-lab

Bug Description

PR: #1823 (Add fallback for getseqlenbalanced_partitions)

Issue: When rollout-max-response-len > max-tokens-per-gpu, a single sample's total length (prompt + response) can exceed max_tokens_per_gpu. get_minimum_num_micro_batch_size handles this correctly by isolating the oversized sample in its own micro-batch.

However, the fallback _get_capped_partitions enforces sums[i] + length <= max_tokens strictly, so it can't place the sample in any partition and hits raise AssertionError("This should never happen.").

Steps to Reproduce

Repro config:

--rollout-max-response-len 8192
--max-tokens-per-gpu 4096

Any sample with prompt (~400 tokens) + response (>3696 tokens) triggers the crash.

Expected Behavior

Expected: _get_capped_partitions should match get_minimum_num_micro_batch_size's behavior — when a sample can't fit in any existing partition, place it alone in an empty partition (even if it exceeds max_tokens).

If this is not desired, like we want to enforce max tokens per gpu, we should add a more meaningful error message and update the documentation about the limitation, or enforce this when parsing the config.

Actual Behavior

raise AssertionError("This should never happen.").

Environment

  • slime version: v0.2.4 (commit 286750a)
  • Python version: 3.12.3
  • PyTorch version: 2.9.1+cu129
  • CUDA version: 12.9
  • GPU type and count: NVIDIA H200, 8 per node (4 nodes, 32 total)
  • OS: Linux (Amazon Linux 2023, kernel 6.1.141)
  • SGLang version: 0.5.9
  • Megatron-LM version: 0.16.0

Logs

Additional Context

No response

Pre-submission Checklist

  • I have read the CONTRIBUTING.md and understand the collaboration scope.
  • I have read the documentation and my issue is not addressed there.
  • I have searched for existing issues and this is not a duplicate.
  • I have provided a minimal, reproducible example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions