-
Notifications
You must be signed in to change notification settings - Fork 466
replace npu_incre_flash_attention with npu_fused_infer_attention_score #2792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request replaces the deprecated npu_incre_flash_attention
with the newer npu_fused_infer_attention_score
function. This is a good maintenance update. The new function call appears to be mostly correct, but I've identified a potential issue with how the atten_mask
parameter is being passed, which could lead to incorrect behavior. My review includes a specific suggestion to address this.
num_key_value_heads=self.num_kv_heads, | ||
input_layout='BSH', | ||
block_size=block_size) | ||
atten_mask=attn_metadata.attn_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency and correctness, atten_mask
should be sourced from decode_meta
like other parameters in this function call (e.g., block_table
and seq_lens
). In the DecodeOnly
attention state, an attention mask is typically not required, and the previous API (npu_incre_flash_attention
) did not accept one. Using decode_meta.attn_mask
ensures that None
is passed, preventing an unintended mask from being applied, which could happen if attn_metadata.attn_mask
is not None
.
atten_mask=attn_metadata.attn_mask, | |
atten_mask=decode_meta.attn_mask, |
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
please note that FIA might not support 300I Duo platform, so we need a branch here. |
Signed-off-by: p00465316 <[email protected]>
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2792 +/- ##
==========================================
+ Coverage 72.99% 75.39% +2.39%
==========================================
Files 153 155 +2
Lines 21331 21123 -208
==========================================
+ Hits 15571 15925 +354
+ Misses 5760 5198 -562
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
The npu_incre_flash_attention interface is no longer maintained and is replaced by npu_fused_infer_attention_score
Does this PR introduce any user-facing change?
No
How was this patch tested?