-
Notifications
You must be signed in to change notification settings - Fork 9
Parallel periodic coupling #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
chunks = Iterators.partition(bfaces_of_interest, chuck_length) | ||
|
||
# loop over boundary face indices in a chunk: we need this index for dofs_on_boundary | ||
function compute_chunk_result(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be the need for "local" comes from the fact that you define this function
inside another one.
For the local matrices you could assemble into instances of SparseMatrixLNK This avoids lots of intermediate transformations in ExtendableSparse, and the |
There is also the newer https://github.com/WIAS-PDELib/ExtendableSparse.jl/blob/master/src/matrix/sparsematrixdilnkc.jl with the corresponding '+' overload which avoids the need for a fully sized colind arrays in the sparse matrix format by replacing this with a dict. |
Regarding the local matrices: The merging in the end is currently barely measurable in terms of time. |
But I think the main culprit is in interpolate!. The number of allocations in this call grows proportional to the number of nodes in the grid. So this seems to allocate for every node in the grid while we only work on two parts of the boundary. So somehow each interpolate! call must run over all elements. I think this should be fixed in the first place. |
Yes, we are fighting on two different fronts here 😃 |
It essentially does both. Adding to existing entries, or creating new ones if they are missing. |
Here, we need explicitly no adding. Some entries are culculated multiple times, since the precomputed "searchareas" overlap. Then adding creates a wrong result. This could be avoided with a bool vector "allow list" that blocks inserting the same index twice... Then, simple reducing with |
Sitting with Christian: searchareas is empty and therefore the find loop goes over all cells... |
oh what? This is definitely a regression. |
shouldn't we couple bregions 3 and 5 with this g function above? |
Maybe, but also this should trigger an error. |
I thought so, too, but didn't get an error with 3 and 5. And the gridplot also suggests these numbers. |
Then you are right. I started with a 2D grid and reused the numbers since there was no error. Do we get meaningful search ares then? I suggest to throw an error if the areas are empty. |
I think thats why the searchareas are empty, which triggers that the NodalInterpolator wants to evaluate at every node. |
Yes, that is a good idea |
As for timing: in order to verify correct complexity for the single threaded case I would propose to have as scaling test (not necessarily in CI). I think we should have complexity O(number_of_surface_nodes_to_be_coupled) This means that execution time should increase by a factor of approximately 4 when going from h to h/2 (increasing nref by one). In the moment it seems to be much larger. May be things then become fast enough even without parallelization. |
Right, so the number of calls of __eval_point scales with a factor 4 (when bregions 3 and 5 are coupled), but the overall runtime does not... |
also the duration and allocations of each interpolate! call stays constant when I refine, so maybe something else in the loop causes the bad scaling? |
aha, the loop below # set entries scales with a factor 8 |
The bad scaling of the interpolation loop was caused by using a dense vector for The interpolation loop is now scaled optimally. Bottleneck( with much smaller factor) is now the offline This PR needs WIAS-PDELib/ExtendableFEMBase.jl#43 |
I updated the test script to Example312 using TestEnv; TestEnv.activate()
include("examples/Example312_PeriodicBoundary3D.jl")
Example312_PeriodicBoundary3D.main(order = 2, h = 1e-5, threads = xxx) # I added the kwarg 'threads' and now have the following results (I have 6 cores / 12 threads )
But I'll open a separate PR for the sparse vector stuff. We discuss different issues here. |
7fe8c9c
to
882efc1
Compare
All updated and rebased. Benchmark in the description.
|
This seems to work well now.
I measured the parts of the function in
Example312_PeriodicBoundary3D.main(h=5e-6, order=2, threads=XX)
and get
My computer has 6 physical CPUs. So it saturates after 4 threads.
On our clusters parallel efficiency is even better.