[rollout] feat: chunk large tensors in bucketed weight transfer by nathon-lee · Pull Request #5980 · verl-project/verl

nathon-lee · 2026-04-12T08:27:05Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

This PR resolves an issue in the vLLM Rollout where the process crashes if a single model weight (Tensor) exceeds the pre-configured shared memory bucket limit (update_weights_bucket_megabytes).
By introducing a Chunked Weight Transfer mechanism, the sender now automatically slices oversized tensors into multiple chunks to fit the bucket size, and the receiver caches and reassembles these chunks. This removes the restriction that forced users to manually increase the bucket size for large models.

Related Issue: Fixes #5836

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: [is:pr is:open weight transfer bucket] No similar PRs found.
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- Suggested PR title: [vllm, rollout] fix: support chunked weight transfer for large tensors

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Validated the correct chunking and reassembly of both small tensors and large tensors that span across multiple buckets via a new unit test. All tests passed successfully:

root@f228dd876cae:/workspace/verl_woo# python tests/workers/rollout/rollout_vllm/test_bucketed_weight_transfer.py 
Tensor small_1: shape=torch.Size([128, 512]), Match = True
Tensor large_1: shape=torch.Size([640, 512]), Match = True
Tensor large_2: shape=torch.Size([1280, 512]), Match = True
Tensor small_2: shape=torch.Size([64, 256]), Match = True
All tests passed! Chunked weight transfer is successful.

…project#5836) Signed-off-by: nathon <[email protected]>

Signed-off-by: nathon <[email protected]>

gemini-code-assist

Code Review

This pull request implements chunked weight transfer for vLLM rollout, allowing tensors that exceed the bucket size to be split and transmitted across multiple buckets. It includes a new test suite for large tensor chunking and refactors the weight receiver to handle partial tensor reassembly while improving memory management by ensuring garbage collection occurs before shared memory is closed. A performance optimization was identified in the weight receiver where an inefficient double copy is performed when moving chunks to the GPU; directly copying to the target slice is recommended instead.

gemini-code-assist · 2026-04-12T08:32:20Z

verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py

+                t = chunk_tensor.to(self.device, non_blocking=True)
+                partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(t)


When use_shm is enabled and the target device is a GPU, this implementation performs an inefficient double copy. First, chunk_tensor.to(self.device) allocates a temporary GPU tensor and copies data from CPU to GPU. Then, partial_1d.copy_(t) copies data from that temporary GPU tensor to the final destination.

You can achieve the same result with a single Host-to-Device copy by directly using copy_ on the target tensor slice.

else: partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(chunk_tensor, non_blocking=True)

References

In PyTorch, gradients can flow back through in-place assignments to slices of a tensor, making this optimization safe for autograd.

Thanks for the suggestion!

I have updated the code to remove the inefficient double copy. The chunk is now copied directly to the pre-allocated slice on the target GPU bypassing the temporary .to(device) allocation, which looks like this:

partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(chunk_tensor, non_blocking=True)

This reduces the intermediate memory allocation and improves the throughput when reassembling large tensors. The changes have been pushed!

…ssembly

nathon-lee · 2026-04-13T07:35:51Z

Hi @wuxibin89, gentle ping for review when you have a moment. This PR adds support for chunking large tensors during transfer. Please let me know if any changes are needed. Thanks!

nathon-lee added 2 commits April 12, 2026 07:35

feat(rollout): chunk large tensors in bucketed weight transfer (verl-…

fa36db7

…project#5836) Signed-off-by: nathon <[email protected]>

feat(rollout): chunk large tensors to fit limit instead of asserting

80ac4b1

Signed-off-by: nathon <[email protected]>

nathon-lee requested review from PeterSH6, chenhaiq and wuxibin89 as code owners April 12, 2026 08:27

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

refactor: remove redundant tensor double-copy during chunk weight rea…

84a4b6f

…ssembly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] feat: chunk large tensors in bucketed weight transfer#5980

[rollout] feat: chunk large tensors in bucketed weight transfer#5980
nathon-lee wants to merge 3 commits intoverl-project:mainfrom
nathon-lee:fix/issue-5836-bucket-size-limit

nathon-lee commented Apr 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 12, 2026

Uh oh!

nathon-lee Apr 12, 2026

Uh oh!

nathon-lee commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		t = chunk_tensor.to(self.device, non_blocking=True)
		partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(t)

Conversation

nathon-lee commented Apr 12, 2026

What does this PR do?

Checklist Before Starting

Test

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant