Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU optimization of LOBPCG #1068

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

abussy
Copy link
Collaborator

@abussy abussy commented Feb 24, 2025

This PR is the result of a detailed profiling of the LOBPCG solver with NVIDIA's Nsight Systems. It allowed for the identification of various hot spots, where code is very slow during GPU runs.

In particular, there are many instances of explicit loops over matrix columns. This access pattern is not ideal, as the massive parallelism of the GPU is not fully exploited. Array operations on the whole matrix are far more efficient.

I measured speed-ups of the order of 30% on the whole LOBPCG iterative solver. Excluding the cost of the H x Psi product (not modified in this PR), the speed-ups reach 50%.

Unfortunately, this comes at the cost of some code readability. I left comments describing what is calculated when necessary.

Finally, I scrapped 2 loops using DFTK's custom threading (in ortho! X vs Y and ldiv! for the preconditioner). I made sure the effect is negligible on CPU runs ( tested with the default n_DFTK = n_blas thread option). It seems that simple BLAS threading on large array operations is quite efficient by itself.

end
end
# Using array operations for GPU performance
norms = vec(sqrt.(sum(abs2, X; dims=1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm.(eachcol(X))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, does that not work? I can't test I don't have a GPU here.

Copy link
Collaborator Author

@abussy abussy Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a factor of 10-20x performance boost with the array operation over norm.(eachcol(X)). Essentially, it goes from being the main bottleneck of ortho! X vs Y to being almost negligible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue with GPUArrays (that it doesn't implement the right overloads) or something more intrinsic?

@views X[:,i] ./= n
end
# using array operations for GPU efficiency
norms = sqrt.(sum(abs2, X; dims=1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

num = sum(conj(X) .* AX, dims=1)
den = sum(conj(X) .* BX, dims=1)
vec(real.(num ./ den))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it clear that this way of doing is faster? Eg on CPU you're possibly allocating big arrays and losing out on cache locality?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not show up as a an issue when I was testing the CPU performance of this PR. This might be the case sometimes, maybe. As for the norms above, the performance gain on the GPU is massive.

For code that could be problematic on the CPU, one option would be to have a GPU definition of it in DFTKCUDAExt.jl and dispatch to it when needed. The main issue with it is code inflation though.

λ = @views [real((X[:, n]'*AX[:, n]) / (X[:, n]'BX[:, n])) for n=1:size(X, 2)]
λ_device = oftype(X[:, 1], λ) # Offload to GPU if needed
λ_device = compute_λ(X, AX, BX)
λ = oftype(ones(eltype(λ_device), 1), λ_device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be merged with master

(; λ=λ_device, X, AX, BX,
residual_norms=norm.(eachcol(residuals)),
residual_norms=oftype(ones(eltype(norms), 1), norms),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a to_cpu method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, it would make a lot of sense here

@antoine-levitt
Copy link
Member

Hm, I'm kind of split on this.

  • Looks like GPU style clashes with threading in a number of places. Threading has been mostly an annoyance, and for very limited results. I think I'm fine with just scrapping threading from DFTK altogether, and pointing people to GPU if they want to do large-scale.
  • It's pretty unfortunate if GPU forces us towards a matlab/numpy style. Those dot products are not very nice... I'd rather use things like eachcol, map, etc.

Also don't add comments at each place we use vector-style operations: we should converge on one uniform style and use it without comments.

@mfherbst
Copy link
Member

Yeah same here. It's a shame this has such an impact on readability. From my point of view the key points are:

  • Avoid code duplication: I think it's better to have a bit of ugly code at selected places (with nice comments and a common convention of course) rather than a GPU and a CPU version with different code. Maybe for some primitives we use often (e.g. column-wise norms or outer products) we can invent functions that are easier to understand and hide the array syntax a little.

  • Avoid performance regressions on CPU: For small arrays this may not matter but we should test on non-trivial calculations if the extra temporaries don't end up killing us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants