Save size in scalar scratch for bo and bq #1201

rupengliu-meta · 2025-12-01T18:44:39Z

Description

In bo and bq, we could save size in smem to avoid calculation. This will reduce unnecessary computation.
seems only having pretty minimal throughput improvement, but the improvement is consistently around 1%-2% before tuning!
Tests have passed for both kernels

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Ran unit tests and done local e2e testing
Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure:

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: rupengliu-meta <[email protected]>

yaochengji

Thanks for the contribution! I think the trade-off is between scalar computation and scalar load/store, do you have any performance number after the modification?

rupengliu-meta · 2025-12-01T19:17:42Z

Thanks for the contribution! I think the trade-off is between scalar computation and scalar load/store, do you have any performance number after the modification?

yes, I will update the perf numbers later

rupengliu-meta · 2025-12-02T00:44:17Z

seems only having pretty minimal throughput improvement, but the improvement is consistently around 1%-2%. tested through the kernel benchmarking script (not e2e)

tpu_inference/kernels/ragged_paged_attention/v3/kernel.py

kyuyeunk

Isn't this change also applicable for bkv as well? i.e., save bkv sz to a scalar scratch?

rupeng-liu · 2025-12-03T20:12:49Z

@kyuyeunk yep, good idea. I just checked the bkv sz, the sz is offset + bkv_sz_frm_new, which during wait is False, there is no existing value for this, we need to still do the extra calculation if added in the wait=false. So this might not be applicable for bkv?

kyuyeunk · 2025-12-06T08:19:43Z

Thanks for the contribution! I think the trade-off is between scalar computation and scalar load/store, do you have any performance number after the modification?

yes, I will update the perf numbers later

Ping on updating perf numbers on the pr description.

rupengliu-meta · 2025-12-08T17:54:01Z

Thanks for the contribution! I think the trade-off is between scalar computation and scalar load/store, do you have any performance number after the modification?

yes, I will update the perf numbers later

Ping on updating perf numbers on the pr description.

Updated, thanks!

kyuyeunk

lgtm but requires approval from @bythew3i

bythew3i · 2025-12-09T22:34:02Z

tpu_inference/kernels/ragged_paged_attention/v3/kernel.py

-    bo_ids_ref,  # [4] (bo_sem_0_seq_idx, bo_sem_1_seq_idx, bo_sem_0_bo_idx, bo_sem_1_bo_idx)
+    bo_ids_ref,  # [6] (bo_sem_0_seq_idx, bo_sem_1_seq_idx, bo_sem_0_bo_idx, bo_sem_1_bo_idx, bo_sem_0_sz, bo_sem_1_sz)
    bkv_update_ids_ref,  # [6] (bkv_sem_0_seq_idx, bkv_sem_1_seq_idx, bkv_sem_0_offset, bkv_sem_1_offset, bkv_sem_0_sz, bkv_sem_1_sz)
+    bq_fetch_ids_ref,  # [2] (bq_sem_0_sz, bq_sem_1_sz)


nit: just call bq_ids_ref

bythew3i · 2025-12-09T22:45:43Z

tpu_inference/kernels/ragged_paged_attention/v3/kernel.py

+            )
+        else:
+            # Retrieve sz from scratch instead of recalculating
+            sz = bq_fetch_ids_ref[bq_sem_idx]


Definitely need to retune and update the tuned block sizes. I understand you may not have autotuen script. But please write a benchmarking script even with same block size, we want to see perf on different block sizes and different models. I am very strict with this in Google internal kernel development as well. We don't want to just check in the code without really understanding how much it can bring in different model(shapes) and block sizes.

Even appending throughput change on different models is acceptable. Thanks

Save size in scalar for bo and bq

18e55a4

Signed-off-by: rupengliu-meta <[email protected]>

rupengliu-meta marked this pull request as ready for review December 1, 2025 18:46

rupengliu-meta requested review from bythew3i, kyuyeunk and yaochengji as code owners December 1, 2025 18:46

rupengliu-meta changed the title ~~Save size in scalar for bo and bq~~ Save size in scalar scratch for bo and bq Dec 1, 2025

Merge branch 'main' into rupliu/k2

c2b91a7

yaochengji reviewed Dec 1, 2025

View reviewed changes

Merge branch 'main' into rupliu/k2

4920fb7

vanbasten23 reviewed Dec 2, 2025

View reviewed changes

tpu_inference/kernels/ragged_paged_attention/v3/kernel.py Show resolved Hide resolved

kyuyeunk reviewed Dec 2, 2025

View reviewed changes

Merge branch 'main' into rupliu/k2

90f185c

Merge branch 'main' into rupliu/k2

754250d

kyuyeunk approved these changes Dec 9, 2025

View reviewed changes

kyuyeunk added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 9, 2025

bythew3i reviewed Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Save size in scalar scratch for bo and bq #1201

Save size in scalar scratch for bo and bq #1201

rupengliu-meta commented Dec 1, 2025 •

edited

Loading

Uh oh!

yaochengji left a comment

Uh oh!

rupengliu-meta commented Dec 1, 2025

Uh oh!

rupengliu-meta commented Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

kyuyeunk left a comment

Uh oh!

rupeng-liu commented Dec 3, 2025 •

edited

Loading

Uh oh!

kyuyeunk commented Dec 6, 2025

Uh oh!

rupengliu-meta commented Dec 8, 2025

Uh oh!

kyuyeunk left a comment

Uh oh!

bythew3i Dec 9, 2025

Uh oh!

bythew3i Dec 9, 2025

Uh oh!

bythew3i Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Save size in scalar scratch for bo and bq #1201

Are you sure you want to change the base?

Save size in scalar scratch for bo and bq #1201

Conversation

rupengliu-meta commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

rupengliu-meta commented Dec 1, 2025

Uh oh!

rupengliu-meta commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

rupeng-liu commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyuyeunk commented Dec 6, 2025

Uh oh!

rupengliu-meta commented Dec 8, 2025

Uh oh!

kyuyeunk left a comment

Choose a reason for hiding this comment

Uh oh!

bythew3i Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

bythew3i Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rupengliu-meta commented Dec 1, 2025 •

edited

Loading

rupengliu-meta commented Dec 2, 2025 •

edited

Loading

rupeng-liu commented Dec 3, 2025 •

edited

Loading