-
Notifications
You must be signed in to change notification settings - Fork 140
update test_attention_perf.py #922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a set of valuable improvements to the flash attention implementation. The key changes include updating the performance benchmarks with more realistic data from a Qwen model, which will lead to more accurate performance evaluation. A critical illegal memory access bug in the Triton kernel has been fixed by adding necessary boundary checks, enhancing the operator's stability. Additionally, the heuristics for selecting MHA kernel configurations have been significantly improved to be more aware of hardware and workload characteristics, which should yield better performance. Overall, these are excellent changes that improve correctness, performance, and testing. I have one minor suggestion to improve code clarity in the Triton kernel.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Tongxin Bai <[email protected]>
PR Category
Operator & benchmark refactor
Type of Change
Bug Fix, Refactor
Description
This PR introduces several changes.
test_perf_flash_attn_varlen_func
, reflecting different decoding phases.Issue
Progress
Performance
Kernel benchmark results on H800