Skip to content

Parallel periodic coupling #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

pjaap
Copy link
Member

@pjaap pjaap commented Jul 4, 2025

This seems to work well now.

I measured the parts of the function in Example312_PeriodicBoundary3D.main(h=5e-6, order=2, threads=XX)
and get

threads prepare dof-maps box search main loop merging
1 6.5 0.6 26 0.08
2 6.5 0.4 14 0.18
4 6.3 0.26 9.6 0.4
8 6.4 0.16 9.6 0.6

My computer has 6 physical CPUs. So it saturates after 4 threads.
On our clusters parallel efficiency is even better.

chunks = Iterators.partition(bfaces_of_interest, chuck_length)

# loop over boundary face indices in a chunk: we need this index for dofs_on_boundary
function compute_chunk_result(chunk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be the need for "local" comes from the fact that you define this function
inside another one.

@j-fu
Copy link
Member

j-fu commented Jul 4, 2025

For the local matrices you could assemble into instances of SparseMatrixLNK
and then use the overloaded '+' operator:

https://github.com/WIAS-PDELib/ExtendableSparse.jl/blob/a07808002f0fe2c875d2533291edd497d01e4157/src/matrix/sparsematrixlnk.jl#L297

This avoids lots of intermediate transformations in ExtendableSparse, and the sparse method during the merge, and should be faster.

@j-fu
Copy link
Member

j-fu commented Jul 4, 2025

There is also the newer https://github.com/WIAS-PDELib/ExtendableSparse.jl/blob/master/src/matrix/sparsematrixdilnkc.jl with the corresponding '+' overload

which avoids the need for a fully sized colind arrays in the sparse matrix format by replacing this with a dict.

@pjaap
Copy link
Member Author

pjaap commented Jul 4, 2025

Regarding the local matrices: The merging in the end is currently barely measurable in terms of time.
Can you elaborate how to use the new ExtendableSparse types? I do not see how the + operator does the merging of entries instead of summing them.

@j-fu
Copy link
Member

j-fu commented Jul 4, 2025

But I think the main culprit is in interpolate!. The number of allocations in this call grows proportional to the number of nodes in the grid. So this seems to allocate for every node in the grid while we only work on two parts of the boundary. So somehow each interpolate! call must run over all elements. I think this should be fixed in the first place.

@pjaap
Copy link
Member Author

pjaap commented Jul 4, 2025

But I think the main culprit is in interpolate!. The number of allocations in this call grows proportional to the number of nodes in the grid. So this seems to allocate for every node in the grid while we only work on two parts of the boundary. So somehow each interpolate! call must run over all elements. I think this should be fixed in the first place.

Yes, we are fighting on two different fronts here 😃

@j-fu
Copy link
Member

j-fu commented Jul 4, 2025

Regarding the local matrices: The merging in the end is currently barely measurable in terms of time. Can you elaborate how to use the new ExtendableSparse types? I do not see how the + operator does the merging of entries instead of summing them.

It essentially does both. Adding to existing entries, or creating new ones if they are missing.

@pjaap
Copy link
Member Author

pjaap commented Jul 4, 2025

It essentially does both. Adding to existing entries, or creating new ones if they are missing.

Here, we need explicitly no adding. Some entries are culculated multiple times, since the precomputed "searchareas" overlap. Then adding creates a wrong result. This could be avoided with a bool vector "allow list" that blocks inserting the same index twice... Then, simple reducing with + should be possible

@j-fu
Copy link
Member

j-fu commented Jul 4, 2025

Sitting with Christian: searchareas is empty and therefore the find loop goes over all cells...

@pjaap
Copy link
Member Author

pjaap commented Jul 4, 2025

oh what? This is definitely a regression.

@chmerdon
Copy link
Member

chmerdon commented Jul 4, 2025

shouldn't we couple bregions 3 and 5 with this g function above?

@pjaap
Copy link
Member Author

pjaap commented Jul 4, 2025

Maybe, but also this should trigger an error.

@chmerdon
Copy link
Member

chmerdon commented Jul 4, 2025

I thought so, too, but didn't get an error with 3 and 5. And the gridplot also suggests these numbers.

@pjaap
Copy link
Member Author

pjaap commented Jul 4, 2025

Then you are right. I started with a 2D grid and reused the numbers since there was no error. Do we get meaningful search ares then?

I suggest to throw an error if the areas are empty.

@chmerdon
Copy link
Member

chmerdon commented Jul 4, 2025

I think thats why the searchareas are empty, which triggers that the NodalInterpolator wants to evaluate at every node.

@chmerdon
Copy link
Member

chmerdon commented Jul 4, 2025

I suggest to throw an error if the areas are empty.

Yes, that is a good idea

@j-fu
Copy link
Member

j-fu commented Jul 6, 2025

As for timing: in order to verify correct complexity for the single threaded case I would propose to have as scaling test (not necessarily in CI). I think we should have complexity O(number_of_surface_nodes_to_be_coupled) This means that execution time should increase by a factor of approximately 4 when going from h to h/2 (increasing nref by one). In the moment it seems to be much larger. May be things then become fast enough even without parallelization.

@chmerdon
Copy link
Member

chmerdon commented Jul 7, 2025

Right, so the number of calls of __eval_point scales with a factor 4 (when bregions 3 and 5 are coupled), but the overall runtime does not...

@chmerdon
Copy link
Member

chmerdon commented Jul 7, 2025

also the duration and allocations of each interpolate! call stays constant when I refine, so maybe something else in the loop causes the bad scaling?

@chmerdon
Copy link
Member

chmerdon commented Jul 7, 2025

aha, the loop below # set entries scales with a factor 8

@pjaap
Copy link
Member Author

pjaap commented Jul 7, 2025

The bad scaling of the interpolation loop was caused by using a dense vector for fe_vector_target. We experimented with a sparse vector implementation and this solves the problem.

The interpolation loop is now scaled optimally.

Bottleneck( with much smaller factor) is now the offline searchareas construction.

This PR needs WIAS-PDELib/ExtendableFEMBase.jl#43

@pjaap
Copy link
Member Author

pjaap commented Jul 21, 2025

I updated the test script to Example312

using TestEnv; TestEnv.activate()
include("examples/Example312_PeriodicBoundary3D.jl")
Example312_PeriodicBoundary3D.main(order = 2, h = 1e-5, threads = xxx) # I added the kwarg 'threads'

and now have the following results (I have 6 cores / 12 threads )

threads total time matrix assembly matrix merging
1 18.6 15.4 0.0
2 12.2 9.0 0.05
3 10.4 7.1 0.1
6 8.4 5.1 0.2
12 10.0 6.2 0.6

But I'll open a separate PR for the sparse vector stuff. We discuss different issues here.

@pjaap pjaap force-pushed the feature/parallel-coupling branch from 7fe8c9c to 882efc1 Compare August 15, 2025 13:02
@pjaap pjaap changed the title Start implementing parallel periodic coupling Parallel periodic coupling Aug 15, 2025
@pjaap
Copy link
Member Author

pjaap commented Aug 15, 2025

All updated and rebased.

Benchmark in the description.

  • the box search is now also parallel (inner loop)
  • the matrix merging in the end is very cheap
  • for 1 thread or parallel=false no threads are spawned at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants