Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected results with Memory Coalescing #5

Open
khizar-anjum opened this issue Dec 6, 2020 · 0 comments
Open

Unexpected results with Memory Coalescing #5

khizar-anjum opened this issue Dec 6, 2020 · 0 comments

Comments

@khizar-anjum
Copy link

khizar-anjum commented Dec 6, 2020

Hi, I am using the following system configuration:

  • Windows 10
  • Visual Studio 2019 Community
  • Cuda 10.2
  • Nvidia Nsight Compute 2019.5.0
  • Nvidia RTX 2060 GPU (Turing Architecture)

I am following your tutorials on YouTube and used the file alignment_matrix_mul.cu, in three configuartions:

  • No transpose (just as we were doing it before)
  • Transpose a matrix (temp_sum += a[k * n + row] * b[col + n * k];)
  • Transpose b matrix (temp_sum += a[k + n * row] * b[col * n + k];)

We would expect that the GPU would perform best when we transpose matrix a, as the memory accesses for each thread are coalesced in this way, but the profiling shows that it performs better when I transpose matrix b.

The only thing that I am doing different here is that I am using Nsight Compute as a separate application to profile the built binary from Visual Studio and not the inbuilt extension. I am also attaching the performance images I got:

I have double checked the transpositions and this is what I get. Can there be any other bottleneck causing these results? i.e. the cost of fetching multiple elements for the loop (index k) overpowers the coalesced access?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant