Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add matmul with transpose #35

Merged

Conversation

junjihashimoto
Copy link
Collaborator

This PR implements matrix multiplication with transposed weight.
In case of NVIDIA, it will be 1.5 times faster.

# In case of NVIDIA GeForce RTX 3080 Laptop GPU
$ yes |  MATMUL_VERSION=8  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 8: 2D blocktiling with loop unrolling and vectorization, 30 iterations ...
113.3 milliseconds / dispatch ~ 2426.99 GFLOPS
$ yes |  MATMUL_VERSION=9  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 9: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
74.7 milliseconds / dispatch ~ 3680.25 GFLOPS

# In case of M2 pro
$ yes |  MATMUL_VERSION=8  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 8: 2D blocktiling with loop unrolling and vectorization, 30 iterations ...
164.3 milliseconds / dispatch ~ 1672.82 GFLOPS
$ yes |  MATMUL_VERSION=9  make  | grep 'Kernel version\|GFLOPS'
[info] Dispatching Kernel version 9: 2D blocktiling with loop unrolling, vectorization and transpose, 30 iterations ...
160.7 milliseconds / dispatch ~ 1710.18 GFLOPS

@junjihashimoto junjihashimoto force-pushed the feature/transposed-matmul branch 2 times, most recently from 3aee496 to cf19e1a Compare July 30, 2024 13:53
@austinvhuang
Copy link
Contributor

Nice! That's a bit surprising - most examples I've seen seem to prefer the transposed forms of weights for efficiency, with both weights and input element-wise contributions into the dot product moving from left to right columns. This is already counter to the mathematical representation, so I'm surprised transposing the weights from that helps but there's probably something I'm missing.

I would suggest just having a different shader including comments on the diffs instead of relying on the removeUnnecessaryIfStatements. This makes it more explicit. The regex transformation works of course, but if we start composing these regex manipulations too deeply I think things will start to getting painful to debug + understand what's going on.

If we feel a strong need to do more sophisticated code transformations and being explicit in writing WGSL code isn't sufficient, we could consider code generation mechanisms - can chat in the discord about what this might look like. But I'd only reach for that if we're really hitting a wall of what's feasible.

Comment on lines 306 to 310
if ({{TRANSPOSE}}) {
localN[idx] = tileB[dotIdx * {{BN}} + threadCol + idx];
} else {
localN[idx] = tileB[(threadCol + idx) * {{BK}} + dotIdx];
}
Copy link
Collaborator Author

@junjihashimoto junjihashimoto Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! That's a bit surprising - most examples I've seen seem to prefer the transposed forms of weights for efficiency, with both weights and input element-wise contributions into the dot product moving from left to right columns. This is already counter to the mathematical representation, so I'm surprised transposing the weights from that helps but there's probably something I'm missing.

The current transposed weights are suitable for cpu to prefetch data as 1D array from DRAM.
Tiled matrix multiplication needs to load data as 2D array, then the advantage is diminished.
So there will probably not be much difference in loading between transposed case and not-transposed case.

Next, this PR changes the access of shared memory like this.

          if ({{TRANSPOSE}}) {
            localN[idx] = tileB[dotIdx * {{BN}} + threadCol + idx];
          } else {
            localN[idx] = tileB[(threadCol + idx) * {{BK}} + dotIdx];
          }

This may have reduced bank collisions and therefore improved performance.

https://www.vrushankdes.ai/diffusion-policy-inference-optimization/part-ii---cuda-kernel-optimization-tips#compute-optimizations-

If memory padding improves performance, bank collisions may be the cause.

I would suggest just having a different shader including comments on the diffs instead of relying on the removeUnnecessaryIfStatements. This makes it more explicit. The regex transformation works of course, but if we start composing these regex manipulations too deeply I think things will start to getting painful to debug + understand what's going on.

I included the removeUnnecessaryIfStatements to check the difference in code, but I will create a new kernel function not to use it.

@junjihashimoto junjihashimoto force-pushed the feature/transposed-matmul branch 4 times, most recently from b558a20 to b29d14b Compare July 31, 2024 03:34
@junjihashimoto junjihashimoto force-pushed the feature/transposed-matmul branch from b29d14b to d0f5e87 Compare July 31, 2024 09:42
@austinvhuang
Copy link
Contributor

Awesome, thanks!

@austinvhuang austinvhuang merged commit 6bdd778 into AnswerDotAI:main Jul 31, 2024
1 check passed
@austinvhuang
Copy link
Contributor

I'm getting a segfault out of the box on my MBP. May roll back temporarily until we find the issue.

@austinvhuang
Copy link
Contributor

Reverted for now with a536c8f let me know if you have any issues reproducing the segfault, i can take a closer look if it's helpful.

Comment on lines +801 to +814
const char* versionToStr(int version){
switch (version) {
case 1: return "No-Op";
case 2: return "naive matmul";
case 3: return "tiling";
case 4: return "1D blocktiling";
case 5: return "2D blocktiling";
case 6: return "1D blocktiling with loop unrolling";
case 7: return "2D blocktiling with loop unrolling";
case 8: return "2D blocktiling with loop unrolling and vectorization";
case 9: return "2D blocktiling with loop unrolling, vectorization and transpose";
default: return "Not specified";
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The segmentation's cause may be here.

junjihashimoto added a commit to junjihashimoto/gpu.cpp that referenced this pull request Aug 1, 2024
Comment on lines 817 to +819
char* version_str = getenv("MATMUL_VERSION");
int version = version_str == NULL ? 7 : atoi(version_str);
// 1 == naive matmul
// 2 == tiling
// 3 == 1D blocktiling
// 4 == 2D blocktiling
// 5 == 1D blocktiling with loop unrolling
// 6 == 2D blocktiling with loop unrolling
// 7 == 2D blocktiling with loop unrolling and vectorization
// 8 == No-Op
char* kTestSize_str = getenv("MATMUL_SIZE");
int version = version_str == NULL ? 9 : atoi(version_str);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wrong, because getenv uses a global variable to store a returned value, but getenv("MATMUL_VERSION") 's result is read after calling getenv("MATMUL_SIZE").
https://wiki.sei.cmu.edu/confluence/display/c/ENV34-C.+Do+not+store+pointers+returned+by+certain+functions.

@junjihashimoto junjihashimoto deleted the feature/transposed-matmul branch August 2, 2024 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants