You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Transpose a matrix (temp_sum += a[k * n + row] * b[col + n * k];)
Transpose b matrix (temp_sum += a[k + n * row] * b[col * n + k];)
We would expect that the GPU would perform best when we transpose matrix a, as the memory accesses for each thread are coalesced in this way, but the profiling shows that it performs better when I transpose matrix b.
The only thing that I am doing different here is that I am using Nsight Compute as a separate application to profile the built binary from Visual Studio and not the inbuilt extension. I am also attaching the performance images I got:
I have double checked the transpositions and this is what I get. Can there be any other bottleneck causing these results? i.e. the cost of fetching multiple elements for the loop (index k) overpowers the coalesced access?
The text was updated successfully, but these errors were encountered:
Hi, I am using the following system configuration:
I am following your tutorials on YouTube and used the file alignment_matrix_mul.cu, in three configuartions:
a
matrix (temp_sum += a[k * n + row] * b[col + n * k];
)b
matrix (temp_sum += a[k + n * row] * b[col * n + k];
)We would expect that the GPU would perform best when we transpose matrix
a
, as the memory accesses for each thread are coalesced in this way, but the profiling shows that it performs better when I transpose matrixb
.The only thing that I am doing different here is that I am using Nsight Compute as a separate application to profile the built binary from Visual Studio and not the inbuilt extension. I am also attaching the performance images I got:
a
matrix: https://drive.google.com/file/d/1rPwMpalSwfVpZ8-jBpO3ROL1R7POAzRt/view?usp=sharingb
matrix: https://drive.google.com/file/d/1WHIQBRRk1KjJk5MXVUc4AopGzqWPDwFh/view?usp=sharingI have double checked the transpositions and this is what I get. Can there be any other bottleneck causing these results? i.e. the cost of fetching multiple elements for the loop (index
k
) overpowers the coalesced access?The text was updated successfully, but these errors were encountered: