feat: support custom mask for flash attention #20

Iamleos · 2025-09-30T09:05:31Z

No description provided.

gemini-code-assist · 2025-09-30T09:05:51Z

Summary of Changes

Hello @Iamleos, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant feature to the Flash Attention mechanism: the ability to apply custom attention masks. This enhancement provides greater flexibility for advanced attention patterns, such as those required for speculative decoding verification. The changes involve modifying the underlying Pallas kernel, updating the reference implementation, integrating the new parameters into the attention backend, and adding comprehensive test coverage to ensure correctness.

Highlights

Custom Mask Support in Flash Attention Kernel: The core Flash Attention kernel (_ragged_paged_attention_kernel) has been enhanced to accept and utilize a custom_mask array and a causal boolean flag. This allows for flexible attention masking beyond standard causal patterns.
Reference Implementation and Validation: The reference implementation (ref_ragged_paged_attention) now supports custom_mask and causal parameters, including validation logic to prevent conflicting usage (e.g., using a custom mask when causal masking is enabled).
Integration with Pallas Call: The ragged_paged_attention function, which orchestrates the Pallas kernel call, has been updated to pass the new causal and custom_mask parameters, modifying static_argnames, input specifications (in_specs), and scratch memory allocations accordingly.
Attention Backend Updates: The FlashAttentionMetadata class and the FlashAttentionBackend now include support for custom_mask, enabling its propagation through the attention pipeline, particularly when ForwardMode.TARGET_VERIFY is active.
New Test Coverage for Custom Masks: New test cases, including a create_custom_mask helper function and test_mha_prefill_with_custom_mask, have been added to verify the correctness and functionality of the custom mask feature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for custom attention masks in the flash attention kernel, which is a valuable feature for non-causal attention patterns like those in speculative decoding. The changes are comprehensive, affecting the reference implementation, the optimized Pallas kernel, and the testing suite. While the core logic for the custom mask in the kernel seems sound, I've identified several issues, including critical bugs in the reference implementation and test setup that will prevent the code from running as intended. There are also some inconsistencies in mask handling between the reference and kernel implementations that should be reconciled. My review provides specific suggestions to address these points and improve the overall quality and correctness of the implementation.

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/srt/layers/attention/flash_attn_kernel/flash_attention.py

+            mask_start = cu_kv_lens[i]
+            mask = custom_mask[mask_start : mask_start + kv_len]
        if sliding_window is not None:
            mask = jnp.logical_or(mask, q_span - sliding_window >= kv_span)


This line will cause a NameError when causal=False and sliding_window is not None, because q_span and kv_span are only defined within the if causal: block.

Since sliding window attention is a form of causal masking, it probably doesn't make sense to use it with a custom non-causal mask. You should consider raising a ValueError if causal=False and sliding_window is provided, or define q_span and kv_span for the non-causal case if this combination is intended to be supported.

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/srt/layers/attention/flash_attn_kernel/flash_attention.py

+        if custom_mask == None or custom_mask.size() < jnp.cumsum(kv_lens)[-1]:
+            raise ValueError(
+                f"use custom_mask, custom_mask length must larger than total kv length"
+            )


There's a bug here. .size is an attribute of a JAX array, not a method. Calling custom_mask.size() will raise a TypeError. It should be custom_mask.size.

Additionally, for style and correctness, it's better to use is None and is not None for checking against None instead of == None and != None.

Suggested change

if custom_mask == None or custom_mask.size() < jnp.cumsum(kv_lens)[-1]:

raise ValueError(

f"use custom_mask, custom_mask length must larger than total kv length"

)

if custom_mask is None or custom_mask.size < jnp.cumsum(kv_lens)[-1]:

raise ValueError(

f"use custom_mask, custom_mask length must larger than total kv length"

)

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/test/test_flashattention.py

+        spec_info = EagleVerifyInput(
+            custom_mask=custom_mask,
+        )


Instantiating EagleVerifyInput with only custom_mask will raise a TypeError because the EagleVerifyInput dataclass has other required fields that do not have default values. The test will fail to run. You need to provide all required arguments to instantiate EagleVerifyInput correctly.

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/srt/layers/attention/flash_attn_kernel/flash_attention.py

    cu_q_lens: jax.Array,  # i32[padded_batch_size + 1]
    cu_kv_lens: jax.Array,  # i32[padded_batch_size + 1]
    distribution: jax.Array,  # i32[3]
+    custom_mask: jax.Array,  # if causal is True, custom_mask shape is [patten_total_kv_len], else [0]


The comment for custom_mask is confusing and seems to have the logic reversed. The custom mask is used when causal is False (i.e., causal=0), not when it's True. Please correct the comment to avoid confusion.

Suggested change

custom_mask: jax.Array, # if causal is True, custom_mask shape is [patten_total_kv_len], else [0]

custom_mask: jax.Array, # if causal is False, custom_mask shape is [patten_total_kv_len], else [0]

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/srt/layers/attention/flash_attn_kernel/flash_attention.py

+    print(
+        f"{cu_q_lens.shape=} {cu_kv_lens.shape=} {bq_sz=} {bkv_p=} {distribution=} {chunk_prefill_size=}"
+    )


This print statement appears to be for debugging purposes. It should be removed before merging.

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/test/test_flashattention.py

+        if len(mode_args) == 5:
+            num_heads, head_dim, num_kv_heads, page_size, dtype = mode_args
+            causal = True
+        else:
+            num_heads, head_dim, num_kv_heads, page_size, dtype, causal = mode_args


Unpacking mode_args based on its length is fragile. This can easily break if more arguments are added in the future. Consider refactoring run_test to accept causal as a keyword argument, or use a more structured approach for passing these test parameters.

gemini-code-assist · 2025-09-30T09:08:00Z

python/sgl_jax/test/test_flashattention.py

+    def test_mha_decode_with_custom_mask(self):
+        pass
+
+    def test_gqa_prefill_with_custom_mask(self):
+        pass
+
+    def test_gqa_decode_with_custom_mask(self):
+        pass


These test cases are currently empty placeholders. To ensure complete test coverage for the new custom mask functionality, these tests should be implemented. They should cover both MHA/GQA attention and prefill/decode modes with custom masks.

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

feat: support custom mask for flash attention

feb3247

Iamleos force-pushed the fix/eagle-kkx branch 7 times, most recently from 6054f6b to f5b4347 Compare September 30, 2025 10:14

fix

4c7aab3

Iamleos force-pushed the fix/eagle-kkx branch 2 times, most recently from 3bd665d to 4d18b12 Compare September 30, 2025 10:22

fix

335b350

Iamleos force-pushed the fix/eagle-kkx branch from 4d18b12 to 335b350 Compare September 30, 2025 10:32

fix

708d66c

Iamleos force-pushed the fix/eagle-kkx branch from 43d8b3c to 708d66c Compare September 30, 2025 10:37

Iamleos added 3 commits September 30, 2025 18:40

fix

be0a044

fix

c3d902d

fix

57d7f3f

Iamleos force-pushed the fix/eagle-kkx branch 2 times, most recently from 115b2da to f79dd69 Compare September 30, 2025 12:44

use scalar prefetch to get custom mask

a463733

Iamleos force-pushed the fix/eagle-kkx branch from f79dd69 to a463733 Compare September 30, 2025 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support custom mask for flash attention #20

feat: support custom mask for flash attention #20

Uh oh!

Iamleos commented Sep 30, 2025

Uh oh!

gemini-code-assist bot commented Sep 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

gemini-code-assist bot Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	custom_mask: jax.Array, # if causal is True, custom_mask shape is [patten_total_kv_len], else [0]
	custom_mask: jax.Array, # if causal is False, custom_mask shape is [patten_total_kv_len], else [0]

feat: support custom mask for flash attention #20

Are you sure you want to change the base?

feat: support custom mask for flash attention #20

Uh oh!

Conversation

Iamleos commented Sep 30, 2025

Uh oh!

gemini-code-assist bot commented Sep 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants