[trainer, perf] feat: wire use_mask_nesting end-to-end, nest prompts/responses#6009
Draft
tongyx361 wants to merge 13 commits intoverl-project:mainfrom
Draft
[trainer, perf] feat: wire use_mask_nesting end-to-end, nest prompts/responses#6009tongyx361 wants to merge 13 commits intoverl-project:mainfrom
tongyx361 wants to merge 13 commits intoverl-project:mainfrom
Conversation
- MaskNestingSpec-based nest/unnest for TensorDict (mask-driven RLE) - Field registry (KNOWN_FIELD_TO_MASK_AND_PAD) with DynamicPadValue for token IDs - Dtype compression (KNOWN_FIELD_DTYPE_COMPRESSIONS) and shape permutation (KNOWN_FIELD_PERMUTATIONS) - Two-layer response extraction: UnnestContext + ResponseSliceContext (library vs PPO) - extract_response() dispatching over legacy/new path for A/B experiments - data_io timing metrics: compress, decompress, wg_dispatch/execute/collect per method - Tests: nest/unnest roundtrip, dispatch compat, response extraction, legacy equivalence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`nest_batch_by_mask` pops response_mask, so the post-nest
`if "response_mask" in batch_td` check never fired and loss_mask was
never aliased in the nesting path. Workers then hit
KeyError('loss_mask') in forward_backward_batch.
Alias before nesting and register loss_mask in the nesting spec so it
gets nested alongside response_mask-paired fields as an all-ones
jagged tensor with per-row length equal to the response token count.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`old_log_probs`, `ref_log_prob`, `rollout_log_probs`, `entropys` are stored back into the batch as (bsz, max_response_len) after `_decompress_model_outputs` -> `extract_response`. Pairing them with `attention_mask` (full prompt+response axis) in KNOWN_FIELD_TO_MASK_AND_PAD tripped the shape-prefix assertion in `make_mask_nesting_specs` the next time `_compress_batch` was called (from `_update_actor`): AssertionError: field 'old_log_probs' has shape (8192, 14336); expected leading (*batch_dims, *sample_mask_dims) to match mask 'attention_mask' shape (8192, 16384). Move them to `response_mask` alongside the other response-only quantities (advantages / returns / ...). The `teacher_*` distillation pair and `routed_experts` remain on `attention_mask` since they are full-length. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Worker-side loss fns (ppo_loss, value_loss) still select "response_mask" via `data.select(...)`. In the nested path, nest_in_td pops response_mask, so train_batch hit: KeyError: 'key "response_mask" not found in TensorDict ...' (in losses.py ppo_loss: data = data.select(*fields).to_padded_tensor()) Alias `response_mask = loss_mask` post-nest — both refer to the same all-ones nested tensor in this mode. Drop the alias before unnest_batch_by_mask in _decompress_batch so the real 2D mask rehydrates correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`prepare_response_slice` pinned ``max_response_len`` to the spec-stashed full-batch ``orig_response_len`` (e.g. 14336), but worker-side code paths such as ``ppo_loss`` unpack nested response-axis fields via ``TensorDict.to_padded_tensor()`` which pads to the *per-micro-batch* local max (e.g. 301). The two shapes then refused to broadcast: RuntimeError: The size of tensor a (14336) must match the size of tensor b (301) at non-singleton dimension 1 (in compute_policy_loss_vanilla: log_prob - old_log_prob) Use ``response_lens.max()`` so extract_response's output dense tensor has the same local-max width as the other response-axis fields after ``.to_padded_tensor()``. The mask still carries exact per-row lengths, so nothing downstream is lost by the tighter padding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit 568e93e.
…cal width
Response-axis shapes were derived from tensor shapes (MaskNestingSpec
``mask_shape``, or per-micro-batch jagged max) in two independent places
and drifted apart after worker chunking:
* ``extract_response``/``slice_response`` pad to the nesting spec's
stashed ``mask_shape[-1]`` — the *original full-batch* max (e.g.
14336).
* ``TensorDict.to_padded_tensor()`` in worker loss code pads to the
*per-micro-batch local* max (e.g. 301).
When the two widths collide downstream (``log_prob - old_log_prob``
inside ``compute_policy_loss_vanilla``) PyTorch refuses to broadcast:
RuntimeError: The size of tensor a (14336) must match the size of
tensor b (301) at non-singleton dimension 1
Record the task-level ``max_prompt_length`` / ``max_response_length``
straight from ``config.data`` onto the batch (via ``tu.assign_non_tensor``)
in ``_compress_batch``. Read them back as the single source of truth:
* ``prepare_response_slice`` prefers the stashed config width over
shape inference.
* New ``select_and_pad_to_response`` helper replaces
``.select(...).to_padded_tensor()`` in ``ppo_loss`` / ``value_loss``
and right-pads the response dim of every selected field up to the
same canonical width.
Both sides of every response-axis arithmetic now produce
``(bs, max_response_length)``, independent of whichever chunk landed on
a worker. Reverts 568e93e's narrower local-max workaround.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Snapshots the real pre-compress TensorDict, runs the legacy (left_right_2_no_padding) and nesting (nest_batch_by_mask) paths side-by-side on identical clones, and prints per-field dtype/shape/nested flag/bytes plus total raw bytes and cloudpickle size. Fires exactly once per trainer instance (guarded by ``_bench_compress_done``), so later steps are untouched. Temporary — revert after collecting the numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s_batch" This reverts commit fa5691f.
Payload bench on a real training batch (8192 samples, fill ratio 8.4%)
showed that ``nest_batch_by_mask`` was achieving only ~3.4× compression
over the legacy dense path, far below the ~12× ceiling implied by the
actual token occupancy. The culprit was that ``prompts`` (134 MB) and
``responses`` (940 MB) were missing from ``KNOWN_FIELD_TO_MASK_AND_PAD``
and silently stayed dense — together accounting for ~78% of the
post-nest bytes on the wire.
Fix it in three parts:
1. Registry entries for both:
* ``"responses": ("response_mask", PAD_TOKEN_ID)`` — right-padded
response token IDs, already response-axis aligned.
* ``"prompts": ("prompt_mask", PAD_TOKEN_ID)`` — left-padded prompt
token IDs, paired with a new ``prompt_mask`` field derived from
``attention_mask[:, :max_prompt_length]`` in ``_compress_batch``.
``_decompress_batch`` drops ``prompt_mask`` post-unnest since the
trainer never had it to begin with.
2. Coverage check inside ``nest_batch_by_mask``: after all specs run,
scan the batch for any tensor field that is still dense (ignoring
RLE offsets/lengths and a small default-ignore set). Report each
offender with ``shape`` / ``dtype`` / ``bytes`` so the reader can
either add a registry entry or opt out via ``ignore_dense_fields``.
3. New ``strict`` kwarg on ``nest_batch_by_mask`` (wired to
``trainer.strict_mask_nesting``, default False). On ``True`` the
coverage check raises ``RuntimeError`` instead of warning — useful
in CI / dev setups to catch regressions when new fields are added
to the batch without a matching registry entry.
Expected post-change compression on the same batch: ~11.9× vs legacy
(close to the 12× theoretical upper bound set by the fill ratio).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Snapshots the real pre-compress TensorDict, runs the legacy (left_right_2_no_padding) and nesting (nest_batch_by_mask) paths side-by-side on identical clones, and prints per-field dtype/shape/nested flag/bytes plus total raw bytes and cloudpickle size. Fires exactly once per trainer instance (guarded by ``_bench_compress_done``), so later steps are untouched. Temporary — revert after collecting the numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s_batch" This reverts commit de349a9.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a mask-driven Run-Length Encoding (RLE) nesting mechanism for variable-length sequences, providing an alternative to the legacy flash_attn-dependent padding logic. Key additions include RLE utilities in verl/utils/nested_tensor.py, integration into the RayPPOTrainer for batch compression/decompression, and updated loss functions. Review feedback points out a potential IndexError in the response slicing logic when handling empty rows and advises using public PyTorch APIs instead of internal NestedTensor attributes to maintain compatibility with future versions.
Addresses gemini-code-assist review on verl-project#6009: * ``_rle_scatter_indices`` in ``verl/utils/nested_tensor.py`` was reaching into the private ``NestedTensor._offsets`` attribute. Switch to the public ``offsets()`` method, which returns the same jagged-layout offsets and is stable across PyTorch releases. * ``prepare_response_slice`` in ``verl/workers/utils/padding.py`` did the same, and additionally would pick up the next non-empty row's first segment when a row's mask had zero True positions (pre-LLM edge cases, defence-in-depth for reuse outside RL). Switch to the public API and gate the scatter on ``has_segments`` so empty rows keep ``abs_start = 0`` and don't corrupt the slice bounds for rows that follow them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Make the
use_mask_nesting=Truepath usable end-to-end and push itsdispatch payload compression close to the theoretical ceiling.
The base commit
b6e6562e feat: nested tensor utilitieslanded thelibrary-level utilities (
nest_batch_by_mask/MaskNestingSpec/extract_response) but running withtrainer.use_mask_nesting=Truecrashed at the first
_compress_batchcall. This PR:path (
loss_mask/response_maskplumbing, response-axis shapemismatches).
max_prompt_length/max_response_lengthstashed on the batch so
extract_responseandTensorDict.to_padded_tensoragree on the response-axis width.KNOWN_FIELD_TO_MASK_AND_PADwithprompts/responses(which were silently staying dense and accounting for ~78% of
post-nest wire bytes) by deriving a
prompt_maskfromattention_mask[:, :max_prompt]at compress time.nest_batch_by_maskthat reports (or,with
trainer.strict_mask_nesting=True, raises on) any tensorfield still dense after nesting.
No related issue; Slack / GH discussion on the feature commit
b6e6562eis the context. Duplicate search:gh pr list --repo verl-project/verl --state open --search "use_mask_nesting"/
"nest_batch_by_mask"returned none relevant.Checklist Before Starting
gh pr list --repo verl-project/verl --state open --search "use_mask_nesting"[{modules}] {type}: {description}:[trainer, perf] feat: make use_mask_nesting usable end-to-end, nest prompts/responses(no
[BREAKING]— defaults unchanged; newtrainer.strict_mask_nestingdefaults to
False).Test
Not CI-testable in its full form (requires Ray + 16×A100 + vLLM
rollout). Validated on a live RL run: Qwen2.5-Math-7B, 2 nodes × 8×A100,
DAPO-Math-17k,
max_prompt_length=2048,max_response_length=14336,train_prompt_bsz=512,rollout.n=16(flat batch 8192),adv=rloo,ULysses SP=4, FSDP=8, vLLM TP=1.
Dispatch timings, steady-state averages over steps 2-4:
(
use_mask_nesting=False)(pre-registry-fix)
wg_dispatch/actor_rollout_compute_log_probwg_dispatch/actor_rollout_update_actorModelling dispatch as
T = α + β · bytessolves toα ≈ 6 s(Ray RPC /serialize fixed cost) +
β ≈ 14 s(payload-proportional). The observed2.63× corresponds to a back-solved payload compression ≈ 11.2×,
against a ~11.9× theoretical ceiling from the batch's 8.4% valid-token
fill ratio. Coverage check reports no remaining dense tensor fields on
this task.
Training loss / grad-norm / MFU remain stable across observed steps
(step 4 throughput 1216 tok/s, actor MFU 39.5%, actor_infer MFU 60.6%).
Validation runs:
use_mask_nesting=False): jobfd2a521f2e3e80d0, 100+steps completed — dispatch reference numbers.
b09b7e4e0efee766run_times=0 —crashes at
forward_backward_batchwithKeyError('loss_mask').b09b7e4e0efee766run_times=13 — training proceeds, dispatch 1.96×.23498c720a9d1e9c— 4 training stepscomplete, dispatch 2.63×, no warning, loss stable.
Numerical equivalence vs baseline (first 4 steps, same config fork,
same model init):
critic/score/meancritic/score/meanscore/{max,min} = ±1match exactly on all four steps;entropy,grad_norm,pg_loss,response_length/meanall land in the samedistribution. The millidecimal-level drift on
score/meancorrespondsto a few dozen of the 8192 samples flipping binary correctness — well
within vLLM rollout non-determinism (PagedAttention scheduling order
perturbs logits at ULP scale, which under
temperature=1.0samplingdiverges trajectories even with identical prompts + model weights).
No evidence of systematic bias; consistent with nesting being purely
a transport-layer change.
Local checks:
pre-commit install pre-commit run --all-files --show-diff-on-failure --color=always # passesCI unit tests still cover the utilities via the existing
tests/utils/test_nested_tensor_on_cpu.py/tests/utils/test_padding_on_cpu.pyfrom the base feature commit.API and Usage Example
No required user-facing changes. Defaults preserve existing behaviour.
Two new
trainer.*knobs:Registry extension (library-side, no user action required):
New
nest_batch_by_maskkwargs (both have backward-compatible defaults):New library helper used by
ppo_loss/value_loss:Design & Code Changes
Bug fixes required to execute
use_mask_nesting=TrueEach surfaced as the previous one was unblocked; cumulative in the
final diff.
_compress_batch: aliasloss_mask = response_maskbeforenest_batch_by_mask(and listloss_maskinfield_to_mask_and_pad) so it becomes a nested all-ones jaggedfield. Workers'
forward_backward_batchwas hittingKeyError('loss_mask')becausenest_in_tdpopsresponse_mask,making the old post-nest alias branch dead.
KNOWN_FIELD_TO_MASK_AND_PAD: re-pairold_log_probs/ref_log_prob/rollout_log_probs/entropyswithresponse_mask._decompress_model_outputsstores them as(bsz, max_response_len), so the shape-prefix assertion againstattention_mask(max_total) was firing in_update_actor._compress_batch: restoreresponse_maskas a nested alias ofloss_maskafter nesting so worker-sideppo_loss/value_lossdata.select("response_mask", ...)keeps working;_decompress_batchdrops the alias beforeunnest_batch_by_mask.Canonical max-length carried on the batch
Two independent code paths were deriving the response-axis output
width from tensor shapes and drifting after worker chunking:
extract_response/slice_responsepadded to the nesting spec'sstashed
mask_shape[-1](original full-batch max).TensorDict.to_padded_tensor()in loss fns padded to theper-micro-batch local max.
The two collided in
compute_policy_loss_vanilla(
log_prob - old_log_prob:(bs, 14336)vs(bs, 301)). Fix:_compress_batchwritesconfig.data.{max_prompt_length, max_response_length}onto the batchvia
tu.assign_non_tensor.prepare_response_slicereads the stashed value first, falls backto shape inference.
select_and_pad_to_responseinverl/workers/utils/padding.pyreplaces
data.select(...).to_padded_tensor()inppo_loss/value_lossand right-pads each selected field's response dim up tothe canonical width, so every response-axis arithmetic sees matching
shapes regardless of chunking.
Coverage / strict mode
promptspaired with a newprompt_mask(derived fromattention_mask[:, :max_prompt_length]in_compress_batch, poppedby
_decompress_batchafter unnest) andresponsespaired withresponse_mask.nest_batch_by_maskwalks the TD once after nesting and reports anytensor field that stayed dense, excluding
_DEFAULT_IGNORE_DENSE = {"dummy_tensor"}and spec-owned RLEoffsets/lengths. Offenders are printed with
shape/dtype/bytes.trainer.strict_mask_nestingflag (defaultFalse): whenTrue, the coverage check raisesRuntimeErrorinstead of warning.Files touched
Checklist Before Submitting
pre-commit run --all-filespasses.b6e6562edid not add user docs foruse_mask_nesting; afollow-up doc PR will describe the switch and the new
strict_mask_nestingflag once this lands.tests/utils/test_nested_tensor_on_cpu.pyandtests/utils/test_padding_on_cpu.pycontinue to cover thelibrary utilities. The end-to-end behaviour (training-step
convergence with
use_mask_nesting=True) requires a GPU-backede2e test; I can extend
.github/workflows/e2e_ppo_trainer.ymlwith ause_mask_nesting=Truevariant in a follow-up once reviewersare okay with the test-hour budget.
ci-requestSlack ping — will do after addressing review.recipesubmodule change.AI-assisted: written with Claude (Claude Opus 4.6 1M context, via
Claude Code). Every changed line was reviewed by the submitter and the
rationale was validated end-to-end on the running RL workload before
being committed. Commits carry
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>trailers.