-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add matmul with transpose #35
Add matmul with transpose #35
Conversation
3aee496
to
cf19e1a
Compare
Nice! That's a bit surprising - most examples I've seen seem to prefer the transposed forms of weights for efficiency, with both weights and input element-wise contributions into the dot product moving from left to right columns. This is already counter to the mathematical representation, so I'm surprised transposing the weights from that helps but there's probably something I'm missing. I would suggest just having a different shader including comments on the diffs instead of relying on the If we feel a strong need to do more sophisticated code transformations and being explicit in writing WGSL code isn't sufficient, we could consider code generation mechanisms - can chat in the discord about what this might look like. But I'd only reach for that if we're really hitting a wall of what's feasible. |
examples/matmul/run.cpp
Outdated
if ({{TRANSPOSE}}) { | ||
localN[idx] = tileB[dotIdx * {{BN}} + threadCol + idx]; | ||
} else { | ||
localN[idx] = tileB[(threadCol + idx) * {{BK}} + dotIdx]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! That's a bit surprising - most examples I've seen seem to prefer the transposed forms of weights for efficiency, with both weights and input element-wise contributions into the dot product moving from left to right columns. This is already counter to the mathematical representation, so I'm surprised transposing the weights from that helps but there's probably something I'm missing.
The current transposed weights are suitable for cpu to prefetch data as 1D array from DRAM.
Tiled matrix multiplication needs to load data as 2D array, then the advantage is diminished.
So there will probably not be much difference in loading between transposed case and not-transposed case.
Next, this PR changes the access of shared memory like this.
if ({{TRANSPOSE}}) {
localN[idx] = tileB[dotIdx * {{BN}} + threadCol + idx];
} else {
localN[idx] = tileB[(threadCol + idx) * {{BK}} + dotIdx];
}
This may have reduced bank collisions and therefore improved performance.
If memory padding improves performance, bank collisions may be the cause.
I would suggest just having a different shader including comments on the diffs instead of relying on the removeUnnecessaryIfStatements. This makes it more explicit. The regex transformation works of course, but if we start composing these regex manipulations too deeply I think things will start to getting painful to debug + understand what's going on.
I included the removeUnnecessaryIfStatements to check the difference in code, but I will create a new kernel function not to use it.
b558a20
to
b29d14b
Compare
b29d14b
to
d0f5e87
Compare
Awesome, thanks! |
I'm getting a segfault out of the box on my MBP. May roll back temporarily until we find the issue. |
Reverted for now with a536c8f let me know if you have any issues reproducing the segfault, i can take a closer look if it's helpful. |
const char* versionToStr(int version){ | ||
switch (version) { | ||
case 1: return "No-Op"; | ||
case 2: return "naive matmul"; | ||
case 3: return "tiling"; | ||
case 4: return "1D blocktiling"; | ||
case 5: return "2D blocktiling"; | ||
case 6: return "1D blocktiling with loop unrolling"; | ||
case 7: return "2D blocktiling with loop unrolling"; | ||
case 8: return "2D blocktiling with loop unrolling and vectorization"; | ||
case 9: return "2D blocktiling with loop unrolling, vectorization and transpose"; | ||
default: return "Not specified"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The segmentation's cause may be here.
…egfault is resolved" This reverts commit a536c8f.
char* version_str = getenv("MATMUL_VERSION"); | ||
int version = version_str == NULL ? 7 : atoi(version_str); | ||
// 1 == naive matmul | ||
// 2 == tiling | ||
// 3 == 1D blocktiling | ||
// 4 == 2D blocktiling | ||
// 5 == 1D blocktiling with loop unrolling | ||
// 6 == 2D blocktiling with loop unrolling | ||
// 7 == 2D blocktiling with loop unrolling and vectorization | ||
// 8 == No-Op | ||
char* kTestSize_str = getenv("MATMUL_SIZE"); | ||
int version = version_str == NULL ? 9 : atoi(version_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is wrong, because getenv uses a global variable to store a returned value, but getenv("MATMUL_VERSION") 's result is read after calling getenv("MATMUL_SIZE").
https://wiki.sei.cmu.edu/confluence/display/c/ENV34-C.+Do+not+store+pointers+returned+by+certain+functions.
This PR implements matrix multiplication with transposed weight.
In case of NVIDIA, it will be 1.5 times faster.