-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU optimization of LOBPCG #1068
base: master
Are you sure you want to change the base?
Conversation
end | ||
end | ||
# Using array operations for GPU performance | ||
norms = vec(sqrt.(sum(abs2, X; dims=1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
norm.(eachcol(X))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, does that not work? I can't test I don't have a GPU here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a factor of 10-20x performance boost with the array operation over norm.(eachcol(X))
. Essentially, it goes from being the main bottleneck of ortho! X vs Y
to being almost negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an issue with GPUArrays (that it doesn't implement the right overloads) or something more intrinsic?
@views X[:,i] ./= n | ||
end | ||
# using array operations for GPU efficiency | ||
norms = sqrt.(sum(abs2, X; dims=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
num = sum(conj(X) .* AX, dims=1) | ||
den = sum(conj(X) .* BX, dims=1) | ||
vec(real.(num ./ den)) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it clear that this way of doing is faster? Eg on CPU you're possibly allocating big arrays and losing out on cache locality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did not show up as a an issue when I was testing the CPU performance of this PR. This might be the case sometimes, maybe. As for the norms above, the performance gain on the GPU is massive.
For code that could be problematic on the CPU, one option would be to have a GPU definition of it in DFTKCUDAExt.jl
and dispatch to it when needed. The main issue with it is code inflation though.
λ = @views [real((X[:, n]'*AX[:, n]) / (X[:, n]'BX[:, n])) for n=1:size(X, 2)] | ||
λ_device = oftype(X[:, 1], λ) # Offload to GPU if needed | ||
λ_device = compute_λ(X, AX, BX) | ||
λ = oftype(ones(eltype(λ_device), 1), λ_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be merged with master
(; λ=λ_device, X, AX, BX, | ||
residual_norms=norm.(eachcol(residuals)), | ||
residual_norms=oftype(ones(eltype(norms), 1), norms), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a to_cpu
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point, it would make a lot of sense here
Hm, I'm kind of split on this.
Also don't add comments at each place we use vector-style operations: we should converge on one uniform style and use it without comments. |
Yeah same here. It's a shame this has such an impact on readability. From my point of view the key points are:
|
This PR is the result of a detailed profiling of the LOBPCG solver with NVIDIA's Nsight Systems. It allowed for the identification of various hot spots, where code is very slow during GPU runs.
In particular, there are many instances of explicit loops over matrix columns. This access pattern is not ideal, as the massive parallelism of the GPU is not fully exploited. Array operations on the whole matrix are far more efficient.
I measured speed-ups of the order of 30% on the whole LOBPCG iterative solver. Excluding the cost of the H x Psi product (not modified in this PR), the speed-ups reach 50%.
Unfortunately, this comes at the cost of some code readability. I left comments describing what is calculated when necessary.
Finally, I scrapped 2 loops using DFTK's custom threading (in
ortho! X vs Y
andldiv!
for the preconditioner). I made sure the effect is negligible on CPU runs ( tested with the defaultn_DFTK
=n_blas
thread option). It seems that simple BLAS threading on large array operations is quite efficient by itself.