-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
[Core] Support all head sizes up to 256 with FlashAttention backend #8910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We were previously restricting to specific sizes, but the native FA kernels pad and support arbitrary sizes up to 256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some unit tests? Looks like we may be able to just extend this list here🤞
vllm/tests/kernels/test_attention.py
Lines 32 to 34 in c2ec430
# FlashAttention forward only supports head dimension at most 128 | |
# https://github.com/ROCmSoftwarePlatform/flash-attention/blob/3d2b6f5d037782cc2c906909a46fb7e2e1b48b25/csrc/flash_attn_rocm/flash_api.cpp#L62 | |
HEAD_SIZES = [64, 80, 96, 112, 120, 128, 192, 256] |
Looks like we need to build flash without the |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
We were previously restricting to specific sizes, but the native FA kernels pad and support arbitrary sizes up to 256.