Skip to content

Conversation

rjg-lyh
Copy link
Contributor

@rjg-lyh rjg-lyh commented Sep 4, 2025

What this PR does / why we need it?

This PR prefetchs the weight of mlp layers in Qwen Dense Models to optimize the performance in Decode phase mainly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed with new added/existing test.

Copy link

github-actions bot commented Sep 4, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces MLP weight prefetching for Qwen Dense Models to optimize performance, primarily in the decode phase. This is achieved by adding new flashcomm and dense_optimize features, controlled by environment variables. The changes include new custom operators for communication, and specialized linear layers that use these operators.

While the overall approach seems promising for performance, I've found several critical issues that must be addressed. There are broken imports in vllm_ascend/ops/linear.py and vllm_ascend/worker/model_runner_v1.py that will prevent the code from running. Additionally, there is brittle logic in AscendDenseQKVParallelLinear for identifying the layer number which is not robust. I've also pointed out a magic number that should be refactored into a constant for better maintainability. Please review the comments for details.

Comment on lines 36 to 37
from vllm_ascend.utils import (all_gather_and_maybe_unpad,
maybe_pad_and_reduce_scatter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

These functions all_gather_and_maybe_unpad and maybe_pad_and_reduce_scatter are imported from vllm_ascend.utils, but they are not defined in that module. This will cause an ImportError when this module is loaded. It seems these imports are not used in this file anyway and should be removed.

# Matrix multiply.
assert self.quant_method is not None

layer_num = self.prefix.split('.')[2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic to determine the layer number by splitting the prefix string (self.prefix.split('.')[2]) is very brittle and assumes a fixed model architecture naming scheme. This can easily break if a model with a different naming convention is used (e.g., model.decoder.layers.0...), leading to incorrect behavior or crashes. It would be more robust to pass the layer index or an is_first_layer flag explicitly during the layer's initialization.

Comment on lines 1146 to 1149
if get_forward_context().flashcomm_v1_enabled:
from vllm_ascend.utils import all_gather_and_maybe_unpad
hidden_states = all_gather_and_maybe_unpad(
hidden_states, get_forward_context().pad_size, dim=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function all_gather_and_maybe_unpad is imported from vllm_ascend.utils, but it is not defined there, which will cause an ImportError. Furthermore, the way it's called all_gather_and_maybe_unpad(hidden_states, get_forward_context().pad_size, dim=0) does not match the signature of the related custom op maybe_all_gather_and_maybe_unpad(x: Tensor, label: bool). You should probably be calling torch.ops.vllm.maybe_all_gather_and_maybe_unpad with the correct arguments.

        if get_forward_context().flashcomm_v1_enabled:
            hidden_states = torch.ops.vllm.maybe_all_gather_and_maybe_unpad(
                hidden_states, True)

Comment on lines +112 to +113
flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \
num_tokens is not None and num_tokens > 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The value 1000 is a magic number. It's used as a threshold to enable the flashcomm_v1 optimization. This makes the code harder to understand and maintain. It should be defined as a named constant with a comment explaining its purpose and how this value was determined. This would improve readability and make it easier to tune this threshold in the future.

Suggested change
flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \
num_tokens is not None and num_tokens > 1000
# e.g. FLASHCOMM_V1_TOKEN_THRESHOLD = 1000 at the top of the file
flashcomm_v1_enabled = envs_ascend.VLLM_ASCEND_ENABLE_FLASHCOMM and \
num_tokens is not None and num_tokens > FLASHCOMM_V1_TOKEN_THRESHOLD

@rjg-lyh rjg-lyh force-pushed the pr-prefetch branch 4 times, most recently from cdedbf9 to c19031d Compare September 5, 2025 04:44
Copy link

github-actions bot commented Sep 7, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@rjg-lyh rjg-lyh closed this Sep 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants