Skip to content

Conversation

tongxin
Copy link
Contributor

@tongxin tongxin commented Aug 26, 2025

PR Category

Operator & benchmark refactor

Type of Change

Bug Fix, Refactor

Description

This PR introduces several changes.

  • Updates the input sequence lengths totest_perf_flash_attn_varlen_func, reflecting different decoding phases.
  • Fixes illegal memory access beyond the boundary of the block table, which was observed when running the new benchmark cases.
  • Introduces new varlen_fwd configs and updates the config heuristics for better wave efficiency.

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

Kernel benchmark results on H800

test_attention_perf.py 
Operator: flash_attn_varlen_func  Performance Test (dtype=torch.float16, mode=kernel,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.019840            0.021984               0.902          [torch.Size([512, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 512, torch.Size([2]), 512, None, torch.Size([1]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([1, 32]), False, torch.Size([512, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.010832            0.018688               0.580          [torch.Size([72, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 70, torch.Size([4]), 70, None, torch.Size([3]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([3, 5]), False, torch.Size([72, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.081408            0.078784               1.033          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 61, torch.Size([56]), 515, None, torch.Size([55]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([55, 33]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.869696            0.710128               1.225          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 16, torch.Size([201]), 2333, None, torch.Size([200]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([200, 146]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]


Operator: flash_attn_varlen_func  Performance Test (dtype=torch.bfloat16, mode=kernel,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.019136            0.022240               0.860          [torch.Size([512, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 512, torch.Size([2]), 512, None, torch.Size([1]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([1, 32]), False, torch.Size([512, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.011008            0.017056               0.645          [torch.Size([72, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 70, torch.Size([4]), 70, None, torch.Size([3]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([3, 5]), False, torch.Size([72, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.081472            0.078784               1.034          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 61, torch.Size([56]), 515, None, torch.Size([55]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([55, 33]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.873664            0.699280               1.249          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 16, torch.Size([201]), 2333, None, torch.Size([200]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([200, 146]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]

@tongxin tongxin closed this Aug 26, 2025
@tongxin tongxin reopened this Aug 26, 2025
@tongxin tongxin marked this pull request as draft August 26, 2025 02:47
@tongxin tongxin requested a review from meinie0826 September 12, 2025 12:37
@tongxin tongxin marked this pull request as ready for review September 12, 2025 12:37
@tongxin
Copy link
Contributor Author

tongxin commented Sep 12, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a set of valuable improvements to the flash attention implementation. The key changes include updating the performance benchmarks with more realistic data from a Qwen model, which will lead to more accurate performance evaluation. A critical illegal memory access bug in the Triton kernel has been fixed by adding necessary boundary checks, enhancing the operator's stability. Additionally, the heuristics for selecting MHA kernel configurations have been significantly improved to be more aware of hardware and workload characteristics, which should yield better performance. Overall, these are excellent changes that improve correctness, performance, and testing. I have one minor suggestion to improve code clarity in the Triton kernel.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Tongxin Bai <[email protected]>
@tongxin tongxin merged commit 402c53f into master Sep 23, 2025
11 of 14 checks passed
@tongxin tongxin deleted the update-varlen-bench branch September 23, 2025 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants