Skip to content

Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)#5641

Open
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D99746894
Open

Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)#5641
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D99746894

Conversation

@q10
Copy link
Copy Markdown
Contributor

@q10 q10 commented Apr 15, 2026

Summary: Pull Request resolved: #5569

Test Plan:

Test Commands

buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp16
buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp32

What Is Tested

  1. test_forward_gpu_no_cache_fp16 - Exercises the v2 forward kernel (when use_experimental_tbe=True, generated by Hypothesis) with D in {2..256 step 16, 1024, 1280, 1536, 2048}. Validates FP16 forward correctness for both the small-L and large-L paths with the dynamic early exit fix. T, B, L are capped (T<=2, B<=16, L<=4) for D > 256 to prevent OOM.

  2. test_forward_gpu_no_cache_fp32 - Same D range as above with FP32 weights. Uses proportional max_TBL = max(1, 2048/D) scaling for large D to prevent OOM while still exercising the v2 kernel max_num_warps_per_row computation.

Both tests use Hypothesis-generated use_experimental_tbe in {True, False}, covering both the v1 (legacy) and v2 (experimental) kernel paths.

BUCK target: fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward

Reviewed By: henrylhtsang

Differential Revision: D99746894

Pulled By: q10

…torch#5569)

Summary: Pull Request resolved: pytorch#5569

Test Plan:
## Test Commands

buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp16
buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp32

## What Is Tested

1. **test_forward_gpu_no_cache_fp16** - Exercises the v2 forward kernel (when use_experimental_tbe=True, generated by Hypothesis) with D in {2..256 step 16, 1024, 1280, 1536, 2048}. Validates FP16 forward correctness for both the small-L and large-L paths with the dynamic early exit fix. T, B, L are capped (T<=2, B<=16, L<=4) for D > 256 to prevent OOM.

2. **test_forward_gpu_no_cache_fp32** - Same D range as above with FP32 weights. Uses proportional max_TBL = max(1, 2048/D) scaling for large D to prevent OOM while still exercising the v2 kernel max_num_warps_per_row computation.

Both tests use Hypothesis-generated use_experimental_tbe in {True, False}, covering both the v1 (legacy) and v2 (experimental) kernel paths.

BUCK target: fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward

Reviewed By: henrylhtsang

Differential Revision: D99746894

Pulled By: q10
@meta-cla meta-cla Bot added the cla signed label Apr 15, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 15, 2026

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99746894.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants