[main] mlp weight prefetch in Qwen Dense Models #2762

rjg-lyh · 2025-09-04T15:29:29Z

What this PR does / why we need it?

This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed with new added/existing test.

vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@e599e2c

Signed-off-by: rjg-lyh <[email protected]>

github-actions · 2025-09-04T15:29:39Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces MLP weight prefetching for Qwen Dense Models to optimize performance, primarily in the decode phase. This is achieved by adding new flashcomm and dense_optimize features, controlled by environment variables. The changes include new custom operators for communication, and specialized linear layers that use these operators.

While the overall approach seems promising for performance, I've found several critical issues that must be addressed. There are broken imports in vllm_ascend/ops/linear.py and vllm_ascend/worker/model_runner_v1.py that will prevent the code from running. Additionally, there is brittle logic in AscendDenseQKVParallelLinear for identifying the layer number which is not robust. I've also pointed out a magic number that should be refactored into a constant for better maintainability. Please review the comments for details.

gemini-code-assist · 2025-09-04T15:31:37Z

vllm_ascend/ops/linear.py

+from vllm_ascend.utils import (all_gather_and_maybe_unpad,
+                               maybe_pad_and_reduce_scatter)


These functions all_gather_and_maybe_unpad and maybe_pad_and_reduce_scatter are imported from vllm_ascend.utils, but they are not defined in that module. This will cause an ImportError when this module is loaded. It seems these imports are not used in this file anyway and should be removed.

gemini-code-assist · 2025-09-04T15:31:37Z

vllm_ascend/ops/linear.py

+        # Matrix multiply.
+        assert self.quant_method is not None
+
+        layer_num = self.prefix.split('.')[2]


The logic to determine the layer number by splitting the prefix string (self.prefix.split('.')[2]) is very brittle and assumes a fixed model architecture naming scheme. This can easily break if a model with a different naming convention is used (e.g., model.decoder.layers.0...), leading to incorrect behavior or crashes. It would be more robust to pass the layer index or an is_first_layer flag explicitly during the layer's initialization.

gemini-code-assist · 2025-09-04T15:31:37Z

vllm_ascend/worker/model_runner_v1.py

+        if get_forward_context().flashcomm_v1_enabled:
+            from vllm_ascend.utils import all_gather_and_maybe_unpad
+            hidden_states = all_gather_and_maybe_unpad(
+                hidden_states, get_forward_context().pad_size, dim=0)


The function all_gather_and_maybe_unpad is imported from vllm_ascend.utils, but it is not defined there, which will cause an ImportError. Furthermore, the way it's called all_gather_and_maybe_unpad(hidden_states, get_forward_context().pad_size, dim=0) does not match the signature of the related custom op maybe_all_gather_and_maybe_unpad(x: Tensor, label: bool). You should probably be calling torch.ops.vllm.maybe_all_gather_and_maybe_unpad with the correct arguments.

if get_forward_context().flashcomm_v1_enabled: hidden_states = torch.ops.vllm.maybe_all_gather_and_maybe_unpad( hidden_states, True)

gemini-code-assist · 2025-09-04T15:31:38Z

vllm_ascend/ascend_forward_context.py

+        flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \
+            num_tokens is not None and num_tokens > 1000


The value 1000 is a magic number. It's used as a threshold to enable the flashcomm_v1 optimization. This makes the code harder to understand and maintain. It should be defined as a named constant with a comment explaining its purpose and how this value was determined. This would improve readability and make it easier to tune this threshold in the future.

Suggested change

flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \

num_tokens is not None and num_tokens > 1000

# e.g. FLASHCOMM_V1_TOKEN_THRESHOLD = 1000 at the top of the file

flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \

num_tokens is not None and num_tokens > FLASHCOMM_V1_TOKEN_THRESHOLD

Signed-off-by: rjg-lyh <[email protected]>

github-actions · 2025-09-07T02:37:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

[main] flashcomm_v1 before mlp weight prefetch in Qwen Dense Models

3ae9205

Signed-off-by: rjg-lyh <[email protected]>

github-actions bot added module:tests module:ops module:core labels Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

add mlp weight prefetch

5ffd8db

rjg-lyh force-pushed the pr-prefetch branch 4 times, most recently from cdedbf9 to c19031d Compare September 5, 2025 04:44

[main] refactor and support in aclgraph

617af62

Signed-off-by: rjg-lyh <[email protected]>

rjg-lyh force-pushed the pr-prefetch branch from c19031d to 617af62 Compare September 5, 2025 06:21

github-actions bot added the merge-conflicts label Sep 7, 2025

rjg-lyh closed this Sep 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[main] mlp weight prefetch in Qwen Dense Models #2762

[main] mlp weight prefetch in Qwen Dense Models #2762

rjg-lyh commented Sep 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

gemini-code-assist bot Sep 4, 2025

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

Uh oh!

		from vllm_ascend.utils import (all_gather_and_maybe_unpad,
		maybe_pad_and_reduce_scatter)

		flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \
		num_tokens is not None and num_tokens > 1000

[main] mlp weight prefetch in Qwen Dense Models #2762

[main] mlp weight prefetch in Qwen Dense Models #2762

Conversation

rjg-lyh commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 7, 2025

Uh oh!

Uh oh!

rjg-lyh commented Sep 4, 2025 •

edited by github-actions bot

Loading