[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size#5963
[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size#5963Silas-11 wants to merge 4 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an automatic adjustment for the weight transfer bucket size based on the model's embedding dimensions to prevent transfer errors. The changes include a new utility function get_minimum_bucket_size_mb and its integration into the CheckpointEngineWorker and vLLMRollout classes. Feedback suggests extending this logic to trainer-side workers to ensure consistency and avoid potential hangs. Additionally, it is recommended to use the model's actual data type for size calculations instead of hardcoding a float32 assumption to prevent unnecessary memory over-allocation.
verl/checkpoint_engine/base.py
Outdated
| bucket_size_mb = get_minimum_bucket_size_mb( | ||
| hf_config=self.model_config.hf_config, | ||
| current_bucket_size_mb=self.rollout_config.checkpoint_engine.update_weights_bucket_megabytes, | ||
| ) |
There was a problem hiding this comment.
This auto-adjustment logic is currently only applied to CheckpointEngineWorker, which is used on the rollout side. However, the trainer-side workers (such as FSDPWorker or MegatronWorker) also instantiate a CheckpointEngine (e.g., NCCLCheckpointEngine) using the same update_weights_bucket_megabytes configuration.
If the trainer side does not also auto-adjust its bucket size, it will encounter the same AssertionError when attempting to send large embedding weights, or it may cause a hang/crash during NCCL initialization if the bucket sizes between the trainer and rollout ranks are inconsistent. You should ensure this logic is applied to the trainer-side worker initialization as well.
verl/checkpoint_engine/base.py
Outdated
| return current_bucket_size_mb | ||
|
|
||
| # embed_tokens: [vocab_size, hidden_size] in float32 = 4 bytes | ||
| embed_size_mb = math.ceil(vocab_size * hidden_size * 4 / 1024 / 1024) |
There was a problem hiding this comment.
The calculation hardcodes 4 bytes per element, assuming float32. While this is a safe upper bound, it can lead to significant memory over-allocation (2x) if the model is loaded in bfloat16 or float16. For very large models (e.g., a 16GB embedding in bf16), this logic would allocate a 32GB bucket, which could lead to unnecessary Out-Of-Memory (OOM) errors on memory-constrained devices. Consider using the actual torch_dtype from hf_config to calculate a more precise minimum size.
…mbedding weight size
…ig is not HFModelConfig instance
…ig is not HFModelConfig instance and replace rollout_config of config
…odelConfig model_config
What does this PR do?
Auto-adjust
update_weights_bucket_megabytesat runtime based on the model's embeddingweight size, so that training no longer fails with an
AssertionErrorwhen the embeddingweight exceeds the default bucket size (2048 MB). This fix supports both standard LLMs
(top-level
vocab_size/hidden_size) and multimodal VLMs (nested undertext_config).Fixes #5952
Checklist Before Starting
[{modules}] {type}: {description}[rollout, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight sizerollout,ckptfix[BREAKING]needed — this is an internal runtime adjustment, users do not need to change their launch scriptsTest
This fix targets weight transfer initialization and requires a full training environment
to validate end-to-end, which cannot be covered by the existing CI.
Manual verification was conducted with the following setup:
examples/grpo_trainer/run_qwen3_vl-8b_npu.shAssertionError: Weight model.language_model.embed_tokens.weight is too large to fit in the bucketThe following warning log is emitted when the auto-adjustment is triggered:
CI coverage is not feasible for this change because:
API and Usage Example
No API changes. This fix is fully transparent to users — no modifications to launch
scripts or configs are required.
verl will automatically emit a warning and adjust the bucket size internally when the
embedding weight exceeds the current bucket size:
Users can still manually override the value if needed:
In this case, if the manually specified value is already larger than the embedding weight size, no adjustment will be made and no warning will be emitted.
Design & Code Changes
Root cause: The default bucket size (2048 MB) is insufficient for large VL models whose
embed_tokens.weight(e.g. Qwen3-VL-8B: shape[151936, 4096]in float32 ≈ 2.38 GB) exceeds the limit, causing anAssertionErrorinbucketed_weight_transfer.pyat training startup. The assert exists because the current implementation does not support splitting a single tensor across multiple buckets (noted as a TODO in the code).Changes:
get_minimum_bucket_size_mbin verl/workers/rollout/utils.py:vocab_sizeandhidden_sizefromhf_configtext_configif top-level fields are absentServerAdapter.__init__inverl/workers/rollout/vllm_rollout/vllm_rollout.pycalls the helper and stores the adjusted value as an instance variable:Then
update_weightsreadsself.bucket_size_mbdirectly when constructingBucketedWeightSender:Key design decisions:
verl/workers/rollout/utils.pywhere bucket logic lives, keeping it cohesivehf_config, no extra file I/Otext_confignested) correctlyChecklist Before Submitting
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel.recipesubmodule, no submodule update needed.