-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
Remove Vision FA warning #18522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove Vision FA warning #18522
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it would be best for this PR to have some benchmark to see the performance improvement as well.
@@ -329,6 +329,24 @@ def forward( | |||
|
|||
q, k, v = (rearrange(x, "b s ... -> (b s) ...") for x in [q, k, v]) | |||
|
|||
output = flash_attn_varlen_func(q, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If flash_attn_varlen_func
from vllm-flash-attn
can work well, we can remove original FA implementation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep original FA implementation for other platform like ROCm which do not support FA through vllm-flash-attn
?
context_layer = rearrange(output, | ||
"(b s) ... -> b s ...", | ||
b=batch_size) | ||
elif self.attn_backend == _Backend.FLASH_ATTN_VLLM_V1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif self.attn_backend == _Backend.FLASH_ATTN_VLLM_V1: | |
elif self.attn_backend in (_Backend.FLASH_ATTN_VLLM_V1, _Backend.FLASH_ATTN): |
Signed-off-by: Vinay Damodaran <[email protected]>
Signed-off-by: Vinay Damodaran <[email protected]>
Signed-off-by: Vinay Damodaran <[email protected]>
Signed-off-by: Vinay Damodaran <[email protected]>
Signed-off-by: Vinay Damodaran <[email protected]>
0486de7
to
37e7446
Compare
Looks like |
The check seems to be happening here |
Signed-off-by: Vinay Damodaran <[email protected]>
It loads fine with the normal |
Okay. So I did a little digging and it looks like this PR (#8910) was opened to fix that issue which relied on this PR (vllm-project/flash-attention#21) to actually disable that flag and build it with the new heads. It seems the conversation ran stale after @njhill and @WoosukKwon discussed this for a bit. Is there any reason why we can't go ahead with this and make sure we build for all head sizes? It seems to be a more pragmatic decision than to have users who rely on a single docker image to have to install both |
Just wanted to bring this back up in case anyone can provide any directions on what the path forward can be. If we feel that enabling this in |
Unfortunately we don't have much wheel size to spare with the recent Blackwell addition; I am going OOO for the next 2 weeks but I can investigate the wheel size impacts when I get back. It looks like this is mostly an FA2 problem so we can potentially use |
So, if I understood this issue correctly, I can use flash_attn backend with multimodal model with original flash attention 2, correct? |
FIX #18324