[rollout] feat: chunk large tensors in bucketed weight transfer#5980
[rollout] feat: chunk large tensors in bucketed weight transfer#5980nathon-lee wants to merge 3 commits intoverl-project:mainfrom
Conversation
…project#5836) Signed-off-by: nathon <[email protected]>
Signed-off-by: nathon <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request implements chunked weight transfer for vLLM rollout, allowing tensors that exceed the bucket size to be split and transmitted across multiple buckets. It includes a new test suite for large tensor chunking and refactors the weight receiver to handle partial tensor reassembly while improving memory management by ensuring garbage collection occurs before shared memory is closed. A performance optimization was identified in the weight receiver where an inefficient double copy is performed when moving chunks to the GPU; directly copying to the target slice is recommended instead.
| t = chunk_tensor.to(self.device, non_blocking=True) | ||
| partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(t) |
There was a problem hiding this comment.
When use_shm is enabled and the target device is a GPU, this implementation performs an inefficient double copy. First, chunk_tensor.to(self.device) allocates a temporary GPU tensor and copies data from CPU to GPU. Then, partial_1d.copy_(t) copies data from that temporary GPU tensor to the final destination.
You can achieve the same result with a single Host-to-Device copy by directly using copy_ on the target tensor slice.
else:
partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(chunk_tensor, non_blocking=True)References
- In PyTorch, gradients can flow back through in-place assignments to slices of a tensor, making this optimization safe for autograd.
There was a problem hiding this comment.
Thanks for the suggestion!
I have updated the code to remove the inefficient double copy. The chunk is now copied directly to the pre-allocated slice on the target GPU bypassing the temporary .to(device) allocation, which looks like this:
partial_1d[tensor_offset : tensor_offset + chunk_bytes].copy_(chunk_tensor, non_blocking=True)
This reduces the intermediate memory allocation and improves the throughput when reassembling large tensors. The changes have been pushed!
|
Hi @wuxibin89, gentle ping for review when you have a moment. This PR adds support for chunking large tensors during transfer. Please let me know if any changes are needed. Thanks! |
What does this PR do?
This PR resolves an issue in the vLLM Rollout where the process crashes if a single model weight (Tensor) exceeds the pre-configured shared memory bucket limit (
update_weights_bucket_megabytes).By introducing a Chunked Weight Transfer mechanism, the sender now automatically slices oversized tensors into multiple chunks to fit the bucket size, and the receiver caches and reassembles these chunks. This removes the restriction that forced users to manually increase the bucket size for large models.
Related Issue: Fixes #5836
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI)[vllm, rollout] fix: support chunked weight transfer for large tensorsTest
Validated the correct chunking and reassembly of both small tensors and large tensors that span across multiple buckets via a new unit test. All tests passed successfully:
root@f228dd876cae:/workspace/verl_woo# python tests/workers/rollout/rollout_vllm/test_bucketed_weight_transfer.py Tensor small_1: shape=torch.Size([128, 512]), Match = True Tensor large_1: shape=torch.Size([640, 512]), Match = True Tensor large_2: shape=torch.Size([1280, 512]), Match = True Tensor small_2: shape=torch.Size([64, 256]), Match = True All tests passed! Chunked weight transfer is successful.