Unexpected results with Memory Coalescing #5

khizar-anjum · 2020-12-06T04:29:01Z

Hi, I am using the following system configuration:

Windows 10
Visual Studio 2019 Community
Cuda 10.2
Nvidia Nsight Compute 2019.5.0
Nvidia RTX 2060 GPU (Turing Architecture)

I am following your tutorials on YouTube and used the file alignment_matrix_mul.cu, in three configuartions:

No transpose (just as we were doing it before)
Transpose a matrix (temp_sum += a[k * n + row] * b[col + n * k];)
Transpose b matrix (temp_sum += a[k + n * row] * b[col * n + k];)

We would expect that the GPU would perform best when we transpose matrix a, as the memory accesses for each thread are coalesced in this way, but the profiling shows that it performs better when I transpose matrix b.

The only thing that I am doing different here is that I am using Nsight Compute as a separate application to profile the built binary from Visual Studio and not the inbuilt extension. I am also attaching the performance images I got:

No Transpose: https://drive.google.com/file/d/18-l8W3csIjCRRoxgASsWevjRIV9hINXp/view?usp=sharing
Transpose a matrix: https://drive.google.com/file/d/1rPwMpalSwfVpZ8-jBpO3ROL1R7POAzRt/view?usp=sharing
Transpose b matrix: https://drive.google.com/file/d/1WHIQBRRk1KjJk5MXVUc4AopGzqWPDwFh/view?usp=sharing

I have double checked the transpositions and this is what I get. Can there be any other bottleneck causing these results? i.e. the cost of fetching multiple elements for the loop (index k) overpowers the coalesced access?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected results with Memory Coalescing #5

Unexpected results with Memory Coalescing #5

khizar-anjum commented Dec 6, 2020 •

edited

Loading

Unexpected results with Memory Coalescing #5

Unexpected results with Memory Coalescing #5

Comments

khizar-anjum commented Dec 6, 2020 • edited Loading

khizar-anjum commented Dec 6, 2020 •

edited

Loading