[Cute, Bwd, Sm100] Add varlen for sm100 bwd #2150

jayhshah · 2026-01-08T08:10:23Z

We add varlen to sm100 backward pass and expose this capability through the flash_attn_varlen_func API.

Head-to-head benchmark against non-varlen sm100 backward for equal sequence lengths shows minimal overhead:

### headdim = 128, causal = False, seqlen = 8192, batch_size = 4, nheads = 16, nheads_kv = 16 ###
FA Python fwd: 1.445ms, 1522.0 TFLOPS
FA Python varlen fwd: 1.485ms, 1481.1 TFLOPS
FA Python bwd: 4.461ms, 1232.2 TFLOPS
FA Python varlen bwd: 4.577ms, 1201.0 TFLOPS
FA Python bwd (deterministic): 6.007ms, 915.1 TFLOPS
FA Python varlen bwd (deterministic): 6.020ms, 913.3 TFLOPS

### headdim = 128, causal = True, seqlen = 8192, batch_size = 4, nheads = 16, nheads_kv = 16 ###
FA Python fwd: 0.775ms, 1419.4 TFLOPS
FA Python varlen fwd: 0.792ms, 1388.1 TFLOPS
FA Python bwd: 2.447ms, 1123.4 TFLOPS
FA Python varlen bwd: 2.542ms, 1081.4 TFLOPS
FA Python bwd (deterministic): 2.923ms, 940.4 TFLOPS
FA Python varlen bwd (deterministic): 3.012ms, 912.6 TFLOPS

To fix an alignment issue with loading padded LSE in the backward kernel, we change padded offsets to FA3 style, e.g.:

padded_offset_q = seqlen.offset_q + batch_idx * self.m_block_size
if cutlass.const_expr(self.arch >= 90):
    padded_offset_q = padded_offset_q // self.m_block_size * self.m_block_size

kiddyboots216 · 2026-01-08T18:23:49Z

varlen_fwd (and bwd) training matches FA2 on Blackwell

v0i0 · 2026-01-09T21:21:18Z

flash_attn/cute/flash_bwd_sm100.py

-        if const_expr(self.qhead_per_kvhead > 1):
+        self.is_varlen_k = mCuSeqlensK is not None or mSeqUsedK is not None
+        self.is_varlen_q = mCuSeqlensQ is not None or mSeqUsedQ is not None
+        self.use_tma_store = not (self.qhead_per_kvhead == 1 and self.is_varlen_k)


is that meant to be an or? what is the logic here?

You're right to call this out, I only need to not use tma store dK/dV for cu_seqlens_k. I'll change this.

In general:

varlen k is the condition for the varlen scheduler

varlen q is a condition for checking process tile (since mblocks processed may then equal 0 if length for that batch is 0)

Though we also disable tma store for seqused_q only in the forward kernel, so there will be an inconsistency here (albeit with rarely used settings)

To address your other implicit question: note that since we use a special padded intermediate tensor for doing TMA reduce add with dK/dV accum when gqa, we are free to use tma store then without worrying about the usual problem of overwriting other batch's outputs. So it should be 'and' and not 'or'.

For the same reason we can use TMA reduce add for dQ accum.

so the idea is: if we have gqa we post-process, so we can always use tma, even with seqlen_k. and we can use it without seqlen_k. so the only case where we need to not use tma is not seqlen_k and not gqa.

Yes (assuming you mean not tma = cu_seqlens_k and mha). I also tried using postprocess with mha and cu_seqlens_k to allow for tma store (hence why I had separated out the dKV_postprocess boolean) but that was slightly slower.

flash_attn/cute/flash_bwd_sm100.py

flash_attn/cute/interface.py

jayhshah force-pushed the jshah/sm100-varlen-bwd branch from 3b798d3 to 3d5f721 Compare January 8, 2026 08:17

jayhshah force-pushed the jshah/sm100-varlen-bwd branch from 10ccba2 to 63147ed Compare January 8, 2026 18:51

jayhshah added 12 commits January 8, 2026 18:52

varlen bwd with rounded padded offsets

d7f8e3f

fix mha

de14672

change offset mode to round down multiple

7730368

enable varlen bwd tests

30c6c22

enable deterministic mode

107bed6

fix deadlock and switch mha to no postprocess

34a72ac

reenable tests

09c21b0

abfe1f0

fix lint error

3dda105

use head swizzle/spt for deterministic, update tests

d7eeffc

change padding offset based on arch

68d5fbf

rebase and update interface, tests

f3a4610

jayhshah requested a review from tridao January 8, 2026 19:07

add arch dispatch for padded offset q to postprocess

7b7c045

jayhshah force-pushed the jshah/sm100-varlen-bwd branch from 63147ed to 7b7c045 Compare January 8, 2026 19:17

v0i0 self-requested a review January 9, 2026 19:26

v0i0 reviewed Jan 9, 2026

View reviewed changes

jayhshah added 2 commits January 9, 2026 22:36

address comments

f84e776

remove tile sizes from seqlen info class vars

92744ce

v0i0 approved these changes Jan 9, 2026

View reviewed changes

jayhshah merged commit ed6a82f into main Jan 9, 2026
1 check passed

reubenconducts mentioned this pull request Jan 9, 2026

CuTe interface doesn't support seqused_q/k in backward #2163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cute, Bwd, Sm100] Add varlen for sm100 bwd #2150

[Cute, Bwd, Sm100] Add varlen for sm100 bwd #2150

jayhshah commented Jan 8, 2026 •

edited

Loading

Uh oh!

kiddyboots216 commented Jan 8, 2026

Uh oh!

v0i0 Jan 9, 2026

Uh oh!

jayhshah Jan 9, 2026

Uh oh!

jayhshah Jan 9, 2026

Uh oh!

jayhshah Jan 9, 2026 •

edited

Loading

Uh oh!

v0i0 Jan 9, 2026

Uh oh!

jayhshah Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Cute, Bwd, Sm100] Add varlen for sm100 bwd #2150

[Cute, Bwd, Sm100] Add varlen for sm100 bwd #2150

Conversation

jayhshah commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiddyboots216 commented Jan 8, 2026

Uh oh!

v0i0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jayhshah Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jayhshah Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jayhshah Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

v0i0 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

jayhshah Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jayhshah commented Jan 8, 2026 •

edited

Loading

jayhshah Jan 9, 2026 •

edited

Loading

jayhshah Jan 9, 2026 •

edited

Loading