Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569) by q10 · Pull Request #5641 · pytorch/FBGEMM

q10 · 2026-04-15T19:39:39Z

Summary: Pull Request resolved: #5569

Test Plan:

Test Commands

buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp16
buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp32

What Is Tested

test_forward_gpu_no_cache_fp16 - Exercises the v2 forward kernel (when use_experimental_tbe=True, generated by Hypothesis) with D in {2..256 step 16, 1024, 1280, 1536, 2048}. Validates FP16 forward correctness for both the small-L and large-L paths with the dynamic early exit fix. T, B, L are capped (T<=2, B<=16, L<=4) for D > 256 to prevent OOM.
test_forward_gpu_no_cache_fp32 - Same D range as above with FP32 weights. Uses proportional max_TBL = max(1, 2048/D) scaling for large D to prevent OOM while still exercising the v2 kernel max_num_warps_per_row computation.

Both tests use Hypothesis-generated use_experimental_tbe in {True, False}, covering both the v1 (legacy) and v2 (experimental) kernel paths.

BUCK target: fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward

Reviewed By: henrylhtsang

Differential Revision: D99746894

Pulled By: q10

…torch#5569) Summary: Pull Request resolved: pytorch#5569 Test Plan: ## Test Commands buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp16 buck2 test fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward -- ForwardTest.test_forward_gpu_no_cache_fp32 ## What Is Tested 1. **test_forward_gpu_no_cache_fp16** - Exercises the v2 forward kernel (when use_experimental_tbe=True, generated by Hypothesis) with D in {2..256 step 16, 1024, 1280, 1536, 2048}. Validates FP16 forward correctness for both the small-L and large-L paths with the dynamic early exit fix. T, B, L are capped (T<=2, B<=16, L<=4) for D > 256 to prevent OOM. 2. **test_forward_gpu_no_cache_fp32** - Same D range as above with FP32 weights. Uses proportional max_TBL = max(1, 2048/D) scaling for large D to prevent OOM while still exercising the v2 kernel max_num_warps_per_row computation. Both tests use Hypothesis-generated use_experimental_tbe in {True, False}, covering both the v1 (legacy) and v2 (experimental) kernel paths. BUCK target: fbcode//deeplearning/fbgemm/fbgemm_gpu/test/tbe:forward Reviewed By: henrylhtsang Differential Revision: D99746894 Pulled By: q10

meta-codesync · 2026-04-15T19:39:47Z

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99746894.

meta-cla Bot added the cla signed label Apr 15, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)#5641

Fix TBE v2 forward kernel for embedding dim > 1024 (#5326) (#5569)#5641
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D99746894

q10 commented Apr 15, 2026

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

q10 commented Apr 15, 2026

Test Commands

What Is Tested

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants