GPU optimization of LOBPCG #1068

abussy · 2025-02-24T14:50:22Z

This PR is the result of a detailed profiling of the LOBPCG solver with NVIDIA's Nsight Systems. It allowed for the identification of various hot spots, where code is very slow during GPU runs.

In particular, there are many instances of explicit loops over matrix columns. This access pattern is not ideal, as the massive parallelism of the GPU is not fully exploited. Array operations on the whole matrix are far more efficient.

I measured speed-ups of the order of 30% on the whole LOBPCG iterative solver. Excluding the cost of the H x Psi product (not modified in this PR), the speed-ups reach 50%.

Unfortunately, this comes at the cost of some code readability. I left comments describing what is calculated when necessary.

Finally, I scrapped 2 loops using DFTK's custom threading (in ortho! X vs Y and ldiv! for the preconditioner). I made sure the effect is negligible on CPU runs ( tested with the default n_DFTK = n_blas thread option). It seems that simple BLAS threading on large array operations is quite efficient by itself.

antoine-levitt · 2025-02-24T15:51:41Z

src/eigen/lobpcg_hyper_impl.jl

-        end
-    end
+    # Using array operations for GPU performance
+    norms = vec(sqrt.(sum(abs2, X; dims=1)))


norm.(eachcol(X))?

Wait, does that not work? I can't test I don't have a GPU here.

There is a factor of 10-20x performance boost with the array operation over norm.(eachcol(X)). Essentially, it goes from being the main bottleneck of ortho! X vs Y to being almost negligible.

Is this an issue with GPUArrays (that it doesn't implement the right overloads) or something more intrinsic?

antoine-levitt · 2025-02-24T15:53:28Z

src/eigen/lobpcg_hyper_impl.jl

-        @views X[:,i] ./= n
-    end
+    # using array operations for GPU efficiency
+    norms = sqrt.(sum(abs2, X; dims=1))


same as above

antoine-levitt · 2025-02-24T15:57:51Z

src/eigen/lobpcg_hyper_impl.jl

+    num = sum(conj(X) .* AX, dims=1)
+    den = sum(conj(X) .* BX, dims=1)
+    vec(real.(num ./ den))
+end


Is it clear that this way of doing is faster? Eg on CPU you're possibly allocating big arrays and losing out on cache locality?

It did not show up as a an issue when I was testing the CPU performance of this PR. This might be the case sometimes, maybe. As for the norms above, the performance gain on the GPU is massive.

For code that could be problematic on the CPU, one option would be to have a GPU definition of it in DFTKCUDAExt.jl and dispatch to it when needed. The main issue with it is code inflation though.

antoine-levitt · 2025-02-24T16:01:20Z

src/eigen/lobpcg_hyper_impl.jl

-    λ = @views [real((X[:, n]'*AX[:, n]) / (X[:, n]'BX[:, n])) for n=1:size(X, 2)]
-    λ_device = oftype(X[:, 1], λ)  # Offload to GPU if needed
+    λ_device = compute_λ(X, AX, BX)
+    λ = oftype(ones(eltype(λ_device), 1), λ_device)


This will need to be merged with master

antoine-levitt · 2025-02-24T16:01:39Z

src/eigen/lobpcg_hyper_impl.jl

    (; λ=λ_device, X, AX, BX,
-     residual_norms=norm.(eachcol(residuals)),
+     residual_norms=oftype(ones(eltype(norms), 1), norms),


Add a to_cpu method?

Valid point, it would make a lot of sense here

antoine-levitt · 2025-02-24T16:16:52Z

Hm, I'm kind of split on this.

Looks like GPU style clashes with threading in a number of places. Threading has been mostly an annoyance, and for very limited results. I think I'm fine with just scrapping threading from DFTK altogether, and pointing people to GPU if they want to do large-scale.
It's pretty unfortunate if GPU forces us towards a matlab/numpy style. Those dot products are not very nice... I'd rather use things like eachcol, map, etc.

Also don't add comments at each place we use vector-style operations: we should converge on one uniform style and use it without comments.

mfherbst · 2025-02-26T15:22:00Z

Yeah same here. It's a shame this has such an impact on readability. From my point of view the key points are:

Avoid code duplication: I think it's better to have a bit of ugly code at selected places (with nice comments and a common convention of course) rather than a GPU and a CPU version with different code. Maybe for some primitives we use often (e.g. column-wise norms or outer products) we can invent functions that are easier to understand and hide the array syntax a little.
Avoid performance regressions on CPU: For small arrays this may not matter but we should test on non-trivial calculations if the extra temporaries don't end up killing us

GPU optimization of LOBPCG

d6a6f96

antoine-levitt reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU optimization of LOBPCG #1068

GPU optimization of LOBPCG #1068

abussy commented Feb 24, 2025

antoine-levitt Feb 24, 2025

antoine-levitt Feb 24, 2025

abussy Feb 24, 2025 •

edited

Loading

antoine-levitt Feb 24, 2025

antoine-levitt Feb 24, 2025

antoine-levitt Feb 24, 2025

abussy Feb 24, 2025

antoine-levitt Feb 24, 2025

antoine-levitt Feb 24, 2025

abussy Feb 24, 2025

antoine-levitt commented Feb 24, 2025

mfherbst commented Feb 26, 2025

GPU optimization of LOBPCG #1068

Are you sure you want to change the base?

GPU optimization of LOBPCG #1068

Conversation

abussy commented Feb 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abussy Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoine-levitt commented Feb 24, 2025

mfherbst commented Feb 26, 2025

abussy Feb 24, 2025 •

edited

Loading