update test_attention_perf.py #922

tongxin · 2025-08-26T02:40:48Z

PR Category

Operator & benchmark refactor

Type of Change

Bug Fix, Refactor

Description

This PR introduces several changes.

Updates the input sequence lengths totest_perf_flash_attn_varlen_func, reflecting different decoding phases.
Fixes illegal memory access beyond the boundary of the block table, which was observed when running the new benchmark cases.
Introduces new varlen_fwd configs and updates the config heuristics for better wave efficiency.

Issue

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

Kernel benchmark results on H800

test_attention_perf.py 
Operator: flash_attn_varlen_func  Performance Test (dtype=torch.float16, mode=kernel,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.019840            0.021984               0.902          [torch.Size([512, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 512, torch.Size([2]), 512, None, torch.Size([1]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([1, 32]), False, torch.Size([512, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.010832            0.018688               0.580          [torch.Size([72, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 70, torch.Size([4]), 70, None, torch.Size([3]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([3, 5]), False, torch.Size([72, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.081408            0.078784               1.033          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 61, torch.Size([56]), 515, None, torch.Size([55]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([55, 33]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.869696            0.710128               1.225          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 16, torch.Size([201]), 2333, None, torch.Size([200]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([200, 146]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]


Operator: flash_attn_varlen_func  Performance Test (dtype=torch.bfloat16, mode=kernel,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.019136            0.022240               0.860          [torch.Size([512, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 512, torch.Size([2]), 512, None, torch.Size([1]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([1, 32]), False, torch.Size([512, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.011008            0.017056               0.645          [torch.Size([72, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 70, torch.Size([4]), 70, None, torch.Size([3]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([3, 5]), False, torch.Size([72, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.081472            0.078784               1.034          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 61, torch.Size([56]), 515, None, torch.Size([55]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([55, 33]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]
SUCCESS               0.873664            0.699280               1.249          [torch.Size([265, 16, 128]), torch.Size([2000, 16, 8, 128]), torch.Size([2000, 16, 8, 128]), 16, torch.Size([201]), 2333, None, torch.Size([200]), None, 0.0, 0.08838834764831845, True, [-1, -1], 0, None, False, False, torch.Size([200, 146]), False, torch.Size([265, 16, 128]), None, None, None, None, 0, 2]

tongxin · 2025-09-12T12:43:09Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a set of valuable improvements to the flash attention implementation. The key changes include updating the performance benchmarks with more realistic data from a Qwen model, which will lead to more accurate performance evaluation. A critical illegal memory access bug in the Triton kernel has been fixed by adding necessary boundary checks, enhancing the operator's stability. Additionally, the heuristics for selecting MHA kernel configurations have been significantly improved to be more aware of hardware and workload characteristics, which should yield better performance. Overall, these are excellent changes that improve correctness, performance, and testing. I have one minor suggestion to improve code clarity in the Triton kernel.

src/flag_gems/ops/flash_kernel.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Tongxin Bai <[email protected]>

update test_attention_perf.py

278686b

tongxin closed this Aug 26, 2025

tongxin reopened this Aug 26, 2025

tongxin marked this pull request as draft August 26, 2025 02:47

meinie0826 and others added 6 commits August 27, 2025 11:23

fix .

9353240

merged with master

3ce03fd

update test_attention_perf

116e0ec

Merge branch 'master' into update-varlen-bench

42c6e57

update flash_varlen_fwd to have block table boundary check.

98b86f1

update varlen_fwd config heuristics for better wave efficiency.

a235c6c

tongxin requested a review from meinie0826 September 12, 2025 12:37

tongxin marked this pull request as ready for review September 12, 2025 12:37

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

src/flag_gems/ops/flash_kernel.py Outdated Show resolved Hide resolved

Update src/flag_gems/ops/flash_kernel.py

a3f4169

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Tongxin Bai <[email protected]>

tongxin requested review from 0x45f, kiddyjinjin and iclementine September 17, 2025 15:26

sgjzfzzf approved these changes Sep 22, 2025

View reviewed changes

tongxin merged commit 402c53f into master Sep 23, 2025
11 of 14 checks passed

tongxin deleted the update-varlen-bench branch September 23, 2025 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update test_attention_perf.py #922

update test_attention_perf.py #922

Uh oh!

tongxin commented Aug 26, 2025 •

edited

Loading

Uh oh!

tongxin commented Sep 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

update test_attention_perf.py #922

update test_attention_perf.py #922

Uh oh!

Conversation

tongxin commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

Type of Change

Description

Issue

Progress

Performance

Uh oh!

tongxin commented Sep 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tongxin commented Aug 26, 2025 •

edited

Loading