[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size by Silas-11 · Pull Request #5963 · verl-project/verl

Silas-11 · 2026-04-10T15:31:35Z

What does this PR do?

Auto-adjust update_weights_bucket_megabytes at runtime based on the model's embedding
weight size, so that training no longer fails with an AssertionError when the embedding
weight exceeds the default bucket size (2048 MB). This fix supports both standard LLMs
(top-level vocab_size/hidden_size) and multimodal VLMs (nested under text_config).

Fixes #5952

Checklist Before Starting

Search for similar PRs. Paste at least one query link here:
- https://github.com/verl-project/verl/pulls?q=update_weights_bucket_megabytes
- https://github.com/verl-project/verl/pulls?q=bucket+size+embedding
Format the PR title as [{modules}] {type}: {description}
- Title: [rollout, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size
- Modules: rollout, ckpt
- Type: fix
- No [BREAKING] needed — this is an internal runtime adjustment, users do not need to change their launch scripts

Test

This fix targets weight transfer initialization and requires a full training environment
to validate end-to-end, which cannot be covered by the existing CI.

Manual verification was conducted with the following setup:

Model: Qwen3-VL-8B-Instruct
Script: examples/grpo_trainer/run_qwen3_vl-8b_npu.sh
Hardware: Ascend NPU, 4 nodes × 8 devices

	Before fix	After fix
Result	❌ `AssertionError: Weight model.language_model.embed_tokens.weight is too large to fit in the bucket`	✅ Training starts normally, bucket size auto-adjusted to 2560 MB

The following warning log is emitted when the auto-adjustment is triggered:

[Rank 14 | Local Rank 0] WARNING [verl.checkpoint_engine.base:71] => Embedding weight size (2374 MB) exceeds update_weights_bucket_megabytes (2048 MB), automatically increasing to 2560 MB. [repeated 15x across cluster]

CI coverage is not feasible for this change because:

It requires a multimodal model and dataset environment not available in CI
The fix is a simple runtime size check with no algorithmic change

API and Usage Example

No API changes. This fix is fully transparent to users — no modifications to launch
scripts or configs are required.

verl will automatically emit a warning and adjust the bucket size internally when the
embedding weight exceeds the current bucket size:

[Rank 14 | Local Rank 0] WARNING [verl.checkpoint_engine.base:71] => Embedding weight size (2374 MB) exceeds update_weights_bucket_megabytes (2048 MB), automatically increasing to 2560 MB.

Users can still manually override the value if needed:

actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096

In this case, if the manually specified value is already larger than the embedding weight size, no adjustment will be made and no warning will be emitted.

Design & Code Changes

Root cause: The default bucket size (2048 MB) is insufficient for large VL models whose embed_tokens.weight (e.g. Qwen3-VL-8B: shape [151936, 4096] in float32 ≈ 2.38 GB) exceeds the limit, causing an AssertionError in bucketed_weight_transfer.py at training startup. The assert exists because the current implementation does not support splitting a single tensor across multiple buckets (noted as a TODO in the code).

Changes:

New helper function get_minimum_bucket_size_mbin verl/workers/rollout/utils.py:
- Reads vocab_size and hidden_size from hf_config
- For multimodal models (e.g. Qwen3-VL), falls back to text_config if top-level fields are absent
- Only adjusts when necessary, rounds up to the next 512 MB boundary for safety margin

def get_minimum_bucket_size_mb(hf_config, current_bucket_size_mb: int) -> int:
    text_config = getattr(hf_config, "text_config", None)
    if text_config is not None:
        vocab_size = getattr(text_config, "vocab_size", 0)
        hidden_size = getattr(text_config, "hidden_size", 0)
    else:
        vocab_size = getattr(hf_config, "vocab_size", 0)
        hidden_size = getattr(hf_config, "hidden_size", 0)

    if not (vocab_size and hidden_size):
        return current_bucket_size_mb

    embed_size_mb = math.ceil(vocab_size * hidden_size * 4 / 1024 / 1024)
    if embed_size_mb <= current_bucket_size_mb:
        return current_bucket_size_mb

    recommended_mb = (embed_size_mb // 512 + 1) * 512
    logger.warning(
        f"Embedding weight size ({embed_size_mb} MB) exceeds "
        f"update_weights_bucket_megabytes ({current_bucket_size_mb} MB), "
        f"automatically increasing to {recommended_mb} MB."
    )
    return recommended_mb

ServerAdapter.__init__ in verl/workers/rollout/vllm_rollout/vllm_rollout.py calls the helper and stores the adjusted value as an instance variable:

# Auto-adjust bucket size based on embedding weight size
from verl.checkpoint_engine.base import get_minimum_bucket_size_mb
self.bucket_size_mb = get_minimum_bucket_size_mb(
    hf_config=self.model_config.hf_config,
    current_bucket_size_mb=self.config.checkpoint_engine.update_weights_bucket_megabytes,
)

Then update_weights reads self.bucket_size_mb directly when constructing BucketedWeightSender:

sender = BucketedWeightSender(
    zmq_handle=self.zmq_handle,
    bucket_size_mb=self.bucket_size_mb,
    use_shm=self.use_shm,
)

Key design decisions:

Function defined in verl/workers/rollout/utils.py where bucket logic lives, keeping it cohesive
Reads from already-loaded hf_config, no extra file I/O
Handles both LLM (top-level config) and VLM (text_config nested) correctly
Only adjusts when necessary, respects user-specified values larger than the embedding size
No changes to example scripts or user launch configs

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: This fix requires a full multimodal training environment (large VL model + dataset) which is not available in CI. Manual end-to-end verification has been conducted as described in the Test section above.
Once your PR is ready for CI, send a message in the ci-request channel.
Not related to the recipe submodule, no submodule update needed.

gemini-code-assist

Code Review

This pull request introduces an automatic adjustment for the weight transfer bucket size based on the model's embedding dimensions to prevent transfer errors. The changes include a new utility function get_minimum_bucket_size_mb and its integration into the CheckpointEngineWorker and vLLMRollout classes. Feedback suggests extending this logic to trainer-side workers to ensure consistency and avoid potential hangs. Additionally, it is recommended to use the model's actual data type for size calculations instead of hardcoding a float32 assumption to prevent unnecessary memory over-allocation.

gemini-code-assist · 2026-04-10T15:35:54Z

verl/checkpoint_engine/base.py

+        bucket_size_mb = get_minimum_bucket_size_mb(
+            hf_config=self.model_config.hf_config,
+            current_bucket_size_mb=self.rollout_config.checkpoint_engine.update_weights_bucket_megabytes,
+        )


This auto-adjustment logic is currently only applied to CheckpointEngineWorker, which is used on the rollout side. However, the trainer-side workers (such as FSDPWorker or MegatronWorker) also instantiate a CheckpointEngine (e.g., NCCLCheckpointEngine) using the same update_weights_bucket_megabytes configuration.

If the trainer side does not also auto-adjust its bucket size, it will encounter the same AssertionError when attempting to send large embedding weights, or it may cause a hang/crash during NCCL initialization if the bucket sizes between the trainer and rollout ranks are inconsistent. You should ensure this logic is applied to the trainer-side worker initialization as well.

gemini-code-assist · 2026-04-10T15:35:54Z

verl/checkpoint_engine/base.py

+        return current_bucket_size_mb
+
+    # embed_tokens: [vocab_size, hidden_size] in float32 = 4 bytes
+    embed_size_mb = math.ceil(vocab_size * hidden_size * 4 / 1024 / 1024)


The calculation hardcodes 4 bytes per element, assuming float32. While this is a safe upper bound, it can lead to significant memory over-allocation (2x) if the model is loaded in bfloat16 or float16. For very large models (e.g., a 16GB embedding in bf16), this logic would allocate a 32GB bucket, which could lead to unnecessary Out-Of-Memory (OOM) errors on memory-constrained devices. Consider using the actual torch_dtype from hf_config to calculate a more precise minimum size.

…mbedding weight size

…ig is not HFModelConfig instance

…ig is not HFModelConfig instance and replace rollout_config of config

…odelConfig model_config

Silas-11 requested review from PeterSH6, chenhaiq and wuxibin89 as code owners April 10, 2026 15:31

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

Kirrito-k423 mentioned this pull request Apr 11, 2026

[NPU] Ulysses Sequence Parallelism 导致 logits_rmpad 与 temperature_rmpad 维度不匹配 (Ascend NPU) #5957

Open

4 tasks

[rollout] fix: auto-adjust update_weights_bucket_megabytes based on e…

d23faa5

…mbedding weight size

Silas-11 force-pushed the fix-5952-bug branch from 5996ef1 to d23faa5 Compare April 11, 2026 16:31

wuxibin89 previously approved these changes Apr 13, 2026

View reviewed changes

[rollout, ckpt] fix: skip bucket size auto-adjustment when model_conf…

124d0df

…ig is not HFModelConfig instance

Silas-11 dismissed wuxibin89’s stale review via 124d0df April 13, 2026 06:09

Silas-11 added 2 commits April 13, 2026 15:08

[rollout, ckpt] fix: skip bucket size auto-adjustment when model_conf…

83a086c

…ig is not HFModelConfig instance and replace rollout_config of config

[rollout, ckpt] fix: guard get_minimum_bucket_size_mb against non-HFM…

11add50

…odelConfig model_config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size#5963

[rollout, vllm, ckpt] fix: auto-adjust update_weights_bucket_megabytes based on embedding weight size#5963
Silas-11 wants to merge 4 commits intoverl-project:mainfrom
Silas-11:fix-5952-bug

Silas-11 commented Apr 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Silas-11 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Silas-11 commented Apr 10, 2026 •

edited

Loading