[data, trainer] fix: batch padding for multi-trajectory by ZhentaoFan · Pull Request #5969 · verl-project/verl

ZhentaoFan · 2026-04-11T04:48:04Z

What does this PR do?

Background

Inside the current tq_trainer, AgentLoopWorkerTQ already supports the multi-trajectory feature. However, during actual training, sample (trajectory)-level padding is still required for each batch so that the number of samples is divisible by both dp_size and mini_batch_size; otherwise, an error will be thrown. This PR fixes the bug and addresses the following considerations:

Upsampling:

LCM alignment: % dp_size == 0 and % mini_batch_size == 0 (and % critic_mini_batch_size == 0 if training the critic).

Padding:

Padded samples use independent UIDs to avoid interfering with GRPO advantage computation.
Padded samples are constructed with the shortest possible sequence — one prompt token + one response token — to minimize redundant computation.
An is_padding flag is added to the tags of padded samples to avoid impacting the accuracy metrics such as score, reward, and response length (while performance metrics still include padded samples).
Maintains the position_ids shape and add multi-modal inputs placeholder in VLM case when necessary.

Verification Experiments with Multi-Trajectory Agent:

The three smallest primes — 2, 3, and 5 — are chosen to form the relevant hyperparameters:

[dp=2, batch_size=45, mini_batch_size=15, rollout.n=8].

gemini-code-assist

Code Review

This pull request implements batch upsampling using synthetic padding sequences to ensure global batch sizes align with PPO mini-batch requirements for actors and critics. Key additions include methods for calculating the required batch multiple, constructing minimal padding templates, and filtering these samples during metric computation. Review feedback highlights potential compatibility issues with VLM and MoE models, specifically regarding the shape of response masks, the dimensionality of position IDs, and the removal of multi-modal or expert-related fields in padding samples.

gemini-code-assist · 2026-04-11T04:51:17Z

verl/trainer/main_ppo_sync.py

+        prompts = torch.full((1,), token_id, dtype=torch.int64)
+        input_ids = prompts.repeat(2)
+        attention_mask = torch.ones_like(input_ids, dtype=torch.int64)
+        response_mask = torch.zeros_like(prompts)


The response_mask should have the same shape as input_ids (length 2) to ensure consistency with real samples. In verl, sequence-level masks like response_mask and loss_mask typically match the full sequence length (input_ids). A shape mismatch here will cause list_of_dict_to_tensordict to fail when stacking padding samples into a batch, or lead to errors in the model's forward pass.

Suggested change

response_mask = torch.zeros_like(prompts)

response_mask = torch.zeros_like(input_ids)

gemini-code-assist · 2026-04-11T04:51:17Z

verl/trainer/main_ppo_sync.py

+            responses=prompts.clone(),
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=compute_position_id_with_mask(attention_mask.unsqueeze(0)).squeeze(0),


Manually computing position_ids as a 1D tensor may cause crashes with models that expect multi-dimensional position IDs (e.g., VLMs with 2D RoPE or specific multimodal architectures). If the padding samples have 1D position_ids while the real samples in the batch have 2D position_ids, batch collation or the model's forward pass will fail. It is safer to derive the structure from the source_td or use the model's specific position ID computation logic.

gemini-code-assist · 2026-04-11T04:51:17Z

verl/trainer/main_ppo_sync.py

+        template_sample.pop("multi_modal_inputs", None)
+        template_sample.pop("routed_experts", None)


Popping multi_modal_inputs and routed_experts makes the padding samples incompatible with VLM and MoE models. These fields are typically required by the model during the forward pass (e.g., in compute_log_prob). If these keys are present in real samples but missing in padding samples, tq.kv_batch_get will fail to retrieve them for the padding keys, or the model will crash due to missing inputs. Instead of popping them, consider providing dummy values (e.g., zeros or empty structures) that match the padding sequence length.

ZhentaoFan · 2026-04-14T03:37:22Z

@eric-haibin-lin @vermouth1992 @tongyx361 @PeterSH6 Ready for review.

[data] fix batch padding for multi-traj PPO sync

f8717bc

ZhentaoFan requested review from PeterSH6, eric-haibin-lin, tongyx361 and vermouth1992 as code owners April 11, 2026 04:48

gemini-code-assist bot reviewed Apr 11, 2026

View reviewed changes

ZhentaoFan marked this pull request as draft April 11, 2026 05:05

[data] fix cases of routing replay and multi-modality

85d39b8

ZhentaoFan marked this pull request as ready for review April 11, 2026 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data, trainer] fix: batch padding for multi-trajectory#5969

[data, trainer] fix: batch padding for multi-trajectory#5969
ZhentaoFan wants to merge 2 commits intoverl-project:mainfrom
ZhentaoFan:pr/batch_padding

ZhentaoFan commented Apr 11, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

gemini-code-assist bot Apr 11, 2026

Uh oh!

ZhentaoFan commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	response_mask = torch.zeros_like(prompts)
	response_mask = torch.zeros_like(input_ids)

		template_sample.pop("multi_modal_inputs", None)
		template_sample.pop("routed_experts", None)

Conversation

ZhentaoFan commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Background

Upsampling:

Padding:

Verification Experiments with Multi-Trajectory Agent:

The three smallest primes — 2, 3, and 5 — are chosen to form the relevant hyperparameters:

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

ZhentaoFan commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhentaoFan commented Apr 11, 2026 •

edited

Loading