Skip to content

[rollout] feat: chunk large tensors in bucketed weight transfer#5980

Open
nathon-lee wants to merge 3 commits intoverl-project:mainfrom
nathon-lee:fix/issue-5836-bucket-size-limit
Open

[rollout] feat: chunk large tensors in bucketed weight transfer#5980
nathon-lee wants to merge 3 commits intoverl-project:mainfrom
nathon-lee:fix/issue-5836-bucket-size-limit

Conversation

@nathon-lee
Copy link
Copy Markdown

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

This PR resolves an issue in the vLLM Rollout where the process crashes if a single model weight (Tensor) exceeds the pre-configured shared memory bucket limit (update_weights_bucket_megabytes).
By introducing a Chunked Weight Transfer mechanism, the sender now automatically slices oversized tensors into multiple chunks to fit the bucket size, and the receiver caches and reassembles these chunks. This removes the restriction that forced users to manually increase the bucket size for large models.

Related Issue: Fixes #5836

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: [is:pr is:open weight transfer bucket] No similar PRs found.
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • Suggested PR title: [vllm, rollout] fix: support chunked weight transfer for large tensors

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

Validated the correct chunking and reassembly of both small tensors and large tensors that span across multiple buckets via a new unit test. All tests passed successfully:

root@f228dd876cae:/workspace/verl_woo# python tests/workers/rollout/rollout_vllm/test_bucketed_weight_transfer.py 
Tensor small_1: shape=torch.Size([128, 512]), Match = True
Tensor large_1: shape=torch.Size([640, 512]), Match = True
Tensor large_2: shape=torch.Size([1280, 512]), Match = True
Tensor small_2: shape=torch.Size([64, 256]), Match = True
All tests passed! Chunked weight transfer is successful.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements chunked weight transfer for vLLM rollout, allowing tensors that exceed the bucket size to be split and transmitted across multiple buckets. It includes a new test suite for large tensor chunking and refactors the weight receiver to handle partial tensor reassembly while improving memory management by ensuring garbage collection occurs before shared memory is closed. A performance optimization was identified in the weight receiver where an inefficient double copy is performed when moving chunks to the GPU; directly copying to the target slice is recommended instead.

Comment on lines +300 to +301
t = chunk_tensor.to(self.device, non_blocking=True)
partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(t)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When use_shm is enabled and the target device is a GPU, this implementation performs an inefficient double copy. First, chunk_tensor.to(self.device) allocates a temporary GPU tensor and copies data from CPU to GPU. Then, partial_1d.copy_(t) copies data from that temporary GPU tensor to the final destination.

You can achieve the same result with a single Host-to-Device copy by directly using copy_ on the target tensor slice.

            else:
                partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(chunk_tensor, non_blocking=True)
References
  1. In PyTorch, gradients can flow back through in-place assignments to slices of a tensor, making this optimization safe for autograd.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

I have updated the code to remove the inefficient double copy. The chunk is now copied directly to the pre-allocated slice on the target GPU bypassing the temporary .to(device) allocation, which looks like this:

partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(chunk_tensor, non_blocking=True)

This reduces the intermediate memory allocation and improves the throughput when reassembling large tensors. The changes have been pushed!

@nathon-lee
Copy link
Copy Markdown
Author

Hi @wuxibin89, gentle ping for review when you have a moment. This PR adds support for chunking large tensors during transfer. Please let me know if any changes are needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[roadmap] verl 26Q2 roadmap

1 participant