Skip to content

Conversation

@soodoshll
Copy link
Collaborator

@soodoshll soodoshll commented Dec 10, 2025

Still cleaning up the code and tuning performance.

Benchmark on H200

w/ warp specialization:

       m      n      k   name  latency (ms)      tflops
0   4096   4096   4096  torch      0.179568  765.386669
1   4096   4096   4096  tilus      0.217712  631.287910
2   4096   4096  14336  torch      0.611840  786.212620
3   4096   4096  14336  tilus      0.679744  707.672791
4   8192   8192   8192  torch      1.448576  759.029314
5   8192   8192   8192  tilus      1.771616  620.626388
6  10240  10240  10240  torch      2.862048  750.331101
7  10240  10240  10240  tilus      3.750112  572.645213

w/o warp specialization:

       m      n      k   name  latency (ms)      tflops
0   4096   4096   4096  torch      0.179392  766.137605
1   4096   4096   4096  tilus      0.221408  620.749733
2   4096   4096  14336  torch      0.612464  785.411598
3   4096   4096  14336  tilus      0.692160  694.978516
4   8192   8192   8192  torch      1.447856  759.406751
5   8192   8192   8192  tilus      1.840032  597.550281
6  10240  10240  10240  torch      2.987184  718.899036
7  10240  10240  10240  tilus      3.776592  568.630034

Benchmark on H100

Performance w/ warp specializaiton:

       m      n      k   name  latency (ms)      tflops
0   4096   4096   4096  torch      0.285840  480.824791
1   4096   4096   4096  tilus      0.342080  401.774306
2   4096   4096  14336  torch      1.050144  458.067017
3   4096   4096  14336  tilus      1.329120  361.920928
4   8192   8192   8192  torch      2.372816  463.378361
5   8192   8192   8192  tilus      3.895360  282.261878
6  10240  10240  10240  torch      5.452080  393.883370
7  10240  10240  10240  tilus      7.888288  272.236972

Performance w/o warp specialization:

       m      n      k   name  latency (ms)      tflops
0   4096   4096   4096  torch      0.286512  479.697019
1   4096   4096   4096  tilus      0.323552  424.781656
2   4096   4096  14336  torch      1.052896  456.869745
3   4096   4096  14336  tilus      1.522928  315.862823
4   8192   8192   8192  torch      2.381680  461.653800
5   8192   8192   8192  tilus      4.800304  229.050419
6  10240  10240  10240  torch      5.461200  393.225600
7  10240  10240  10240  tilus      9.372672  229.121816

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@soodoshll soodoshll marked this pull request as ready for review December 14, 2025 04:47
@soodoshll soodoshll requested a review from yaoyaoding December 14, 2025 04:47
@yaoyaoding
Copy link
Member

Thanks @soodoshll, the PR LGTM!

@yaoyaoding
Copy link
Member

/ok to run 3507ec3

@yaoyaoding
Copy link
Member

Consider using the sign-commits script to sign all commits, so that the ci will automatically run.

@soodoshll soodoshll changed the title [WIP][Example] warp specialization gemm for hopper [Example] warp specialization gemm for hopper Dec 15, 2025
@soodoshll
Copy link
Collaborator Author

/ok to test 3507ec3

@soodoshll
Copy link
Collaborator Author

/ok to test 27a5b82

@soodoshll soodoshll force-pushed the hopper-matmul3 branch 2 times, most recently from de8b304 to 117a67f Compare December 16, 2025 01:27
yaoyaoding and others added 11 commits December 15, 2025 17:27
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
@yaoyaoding
Copy link
Member

Thanks @soodoshll !

@yaoyaoding yaoyaoding merged commit 002eec3 into NVIDIA:main Dec 16, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants