-
Notifications
You must be signed in to change notification settings - Fork 453
fix qwen torchair attention PrefillCacheHit #2787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zhaozixin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an issue with PrefillCacheHit
in the torchair attention mechanism. The changes involve correctly updating the instance's key/value cache and slicing the block_tables
to remove padding before passing it to the attention kernel. While the changes appear to fix the immediate issue, I have a suggestion to make the cache update logic more robust.
if attn_metadata.attn_state == AscendAttentionState.PrefillCacheHit: | ||
self.key_cache = key_cache | ||
self.value_cache = value_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self.key_cache
and self.value_cache
are used by the PrefillCacheHit
attention state. This change correctly updates them when the state is PrefillCacheHit
. However, the cache update in lines 375-376 happens for any state that has a kv_cache
(e.g., DecodeOnly
as well). To make the logic more robust and prevent potential issues if other states start using self.key_cache
in the future, it would be better to update self.key_cache
and self.value_cache
unconditionally whenever the cache is modified. This ensures that self.key_cache
and self.value_cache
always reflect the latest state of the cache tensors passed into this forward pass.
self.key_cache = key_cache
self.value_cache = value_cache
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2787 +/- ##
==========================================
- Coverage 74.76% 74.75% -0.02%
==========================================
Files 150 150
Lines 20891 20896 +5
==========================================
Hits 15620 15620
- Misses 5271 5276 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
### What this PR does / why we need it? Fix qwen torchair attention PrefillCacheHit ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: v0.10.1.1 vLLM main: vllm-project/vllm@e599e2c - vLLM version: main - vLLM main: vllm-project/vllm@0b9a612 Signed-off-by: zhaozixin <[email protected]> Co-authored-by: zhaozixin <[email protected]> Signed-off-by: Yizhou Liu <[email protected]>
### What this PR does / why we need it? Fix qwen torchair attention PrefillCacheHit ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? vLLM version: v0.10.1.1 vLLM main: vllm-project/vllm@e599e2c - vLLM version: main - vLLM main: vllm-project/vllm@0b9a612 Signed-off-by: zhaozixin <[email protected]> Co-authored-by: zhaozixin <[email protected]> Signed-off-by: offline0806 <[email protected]>
What this PR does / why we need it?
Fix qwen torchair attention PrefillCacheHit
Does this PR introduce any user-facing change?
How was this patch tested?
vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@e599e2c