Skip to content

[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size#5963

Open
Silas-11 wants to merge 4 commits intoverl-project:mainfrom
Silas-11:fix-5952-bug
Open

[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size#5963
Silas-11 wants to merge 4 commits intoverl-project:mainfrom
Silas-11:fix-5952-bug

Conversation

@Silas-11
Copy link
Copy Markdown
Contributor

@Silas-11 Silas-11 commented Apr 10, 2026


What does this PR do?

Auto-adjust update_weights_bucket_megabytes at runtime based on the model's embedding
weight size, so that training no longer fails with an AssertionError when the embedding
weight exceeds the default bucket size (2048 MB). This fix supports both standard LLMs
(top-level vocab_size/hidden_size) and multimodal VLMs (nested under text_config).

Fixes #5952

Checklist Before Starting

Test

This fix targets weight transfer initialization and requires a full training environment
to validate end-to-end, which cannot be covered by the existing CI.

Manual verification was conducted with the following setup:

  • Model: Qwen3-VL-8B-Instruct
  • Script: examples/grpo_trainer/run_qwen3_vl-8b_npu.sh
  • Hardware: Ascend NPU, 4 nodes × 8 devices
Before fix After fix
Result AssertionError: Weight model.language_model.embed_tokens.weight is too large to fit in the bucket ✅ Training starts normally, bucket size auto-adjusted to 2560 MB

The following warning log is emitted when the auto-adjustment is triggered:

[Rank 14 | Local Rank 0] WARNING [verl.checkpoint_engine.base:71] => Embedding weight size (2374 MB) exceeds update_weights_bucket_megabytes (2048 MB), automatically increasing to 2560 MB. [repeated 15x across cluster]

CI coverage is not feasible for this change because:

  • It requires a multimodal model and dataset environment not available in CI
  • The fix is a simple runtime size check with no algorithmic change

API and Usage Example

No API changes. This fix is fully transparent to users — no modifications to launch
scripts or configs are required.

verl will automatically emit a warning and adjust the bucket size internally when the
embedding weight exceeds the current bucket size:

[Rank 14 | Local Rank 0] WARNING [verl.checkpoint_engine.base:71] => Embedding weight size (2374 MB) exceeds update_weights_bucket_megabytes (2048 MB), automatically increasing to 2560 MB.

Users can still manually override the value if needed:

actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096

In this case, if the manually specified value is already larger than the embedding weight size, no adjustment will be made and no warning will be emitted.

Design & Code Changes

Root cause: The default bucket size (2048 MB) is insufficient for large VL models whose embed_tokens.weight (e.g. Qwen3-VL-8B: shape [151936, 4096] in float32 ≈ 2.38 GB) exceeds the limit, causing an AssertionError in bucketed_weight_transfer.py at training startup. The assert exists because the current implementation does not support splitting a single tensor across multiple buckets (noted as a TODO in the code).

Changes:

  1. New helper function get_minimum_bucket_size_mbin verl/workers/rollout/utils.py:
    • Reads vocab_size and hidden_size from hf_config
    • For multimodal models (e.g. Qwen3-VL), falls back to text_config if top-level fields are absent
    • Only adjusts when necessary, rounds up to the next 512 MB boundary for safety margin
def get_minimum_bucket_size_mb(hf_config, current_bucket_size_mb: int) -> int:
    text_config = getattr(hf_config, "text_config", None)
    if text_config is not None:
        vocab_size = getattr(text_config, "vocab_size", 0)
        hidden_size = getattr(text_config, "hidden_size", 0)
    else:
        vocab_size = getattr(hf_config, "vocab_size", 0)
        hidden_size = getattr(hf_config, "hidden_size", 0)

    if not (vocab_size and hidden_size):
        return current_bucket_size_mb

    embed_size_mb = math.ceil(vocab_size * hidden_size * 4 / 1024 / 1024)
    if embed_size_mb <= current_bucket_size_mb:
        return current_bucket_size_mb

    recommended_mb = (embed_size_mb // 512 + 1) * 512
    logger.warning(
        f"Embedding weight size ({embed_size_mb} MB) exceeds "
        f"update_weights_bucket_megabytes ({current_bucket_size_mb} MB), "
        f"automatically increasing to {recommended_mb} MB."
    )
    return recommended_mb
  1. ServerAdapter.__init__ in verl/workers/rollout/vllm_rollout/vllm_rollout.py calls the helper and stores the adjusted value as an instance variable:
# Auto-adjust bucket size based on embedding weight size
from verl.checkpoint_engine.base import get_minimum_bucket_size_mb
self.bucket_size_mb = get_minimum_bucket_size_mb(
    hf_config=self.model_config.hf_config,
    current_bucket_size_mb=self.config.checkpoint_engine.update_weights_bucket_megabytes,
)

Then update_weights reads self.bucket_size_mb directly when constructing BucketedWeightSender:

sender = BucketedWeightSender(
    zmq_handle=self.zmq_handle,
    bucket_size_mb=self.bucket_size_mb,
    use_shm=self.use_shm,
)

Key design decisions:

  • Function defined in verl/workers/rollout/utils.py where bucket logic lives, keeping it cohesive
  • Reads from already-loaded hf_config, no extra file I/O
  • Handles both LLM (top-level config) and VLM (text_config nested) correctly
  • Only adjusts when necessary, respects user-specified values larger than the embedding size
  • No changes to example scripts or user launch configs

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add / Update the documentation.
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: This fix requires a full multimodal training environment (large VL model + dataset) which is not available in CI. Manual end-to-end verification has been conducted as described in the Test section above.
  • Once your PR is ready for CI, send a message in the ci-request channel.
  • Not related to the recipe submodule, no submodule update needed.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an automatic adjustment for the weight transfer bucket size based on the model's embedding dimensions to prevent transfer errors. The changes include a new utility function get_minimum_bucket_size_mb and its integration into the CheckpointEngineWorker and vLLMRollout classes. Feedback suggests extending this logic to trainer-side workers to ensure consistency and avoid potential hangs. Additionally, it is recommended to use the model's actual data type for size calculations instead of hardcoding a float32 assumption to prevent unnecessary memory over-allocation.

Comment on lines +320 to +323
bucket_size_mb = get_minimum_bucket_size_mb(
hf_config=self.model_config.hf_config,
current_bucket_size_mb=self.rollout_config.checkpoint_engine.update_weights_bucket_megabytes,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This auto-adjustment logic is currently only applied to CheckpointEngineWorker, which is used on the rollout side. However, the trainer-side workers (such as FSDPWorker or MegatronWorker) also instantiate a CheckpointEngine (e.g., NCCLCheckpointEngine) using the same update_weights_bucket_megabytes configuration.

If the trainer side does not also auto-adjust its bucket size, it will encounter the same AssertionError when attempting to send large embedding weights, or it may cause a hang/crash during NCCL initialization if the bucket sizes between the trainer and rollout ranks are inconsistent. You should ensure this logic is applied to the trainer-side worker initialization as well.

return current_bucket_size_mb

# embed_tokens: [vocab_size, hidden_size] in float32 = 4 bytes
embed_size_mb = math.ceil(vocab_size * hidden_size * 4 / 1024 / 1024)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation hardcodes 4 bytes per element, assuming float32. While this is a safe upper bound, it can lead to significant memory over-allocation (2x) if the model is loaded in bfloat16 or float16. For very large models (e.g., a 16GB embedding in bf16), this logic would allocate a 32GB bucket, which could lead to unnecessary Out-Of-Memory (OOM) errors on memory-constrained devices. Consider using the actual torch_dtype from hf_config to calculate a more precise minimum size.

wuxibin89
wuxibin89 previously approved these changes Apr 13, 2026
…ig is not HFModelConfig instance and replace rollout_config of config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] AssertionError: Weight too large to fit in bucket during update_weights in vllm_rollout (Qwen3-VL-8B)

2 participants