-
Notifications
You must be signed in to change notification settings - Fork 3
Parallelize evolve_slice
#352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test is failing because of a change in the last digit for one of the grid:
|
That's probably expected. Instruct the CLI to emit one less digit! |
Done! I also added a few lines in the Change Logs. Hi @vbertone, would you perhaps have the chance to test this (it contains the fixes I mentioned in #351)? It'd be really great to check if this also work great for you, especially since I presume you have many CPU cores xD |
Hi @Radonirinaunimi, do I understand correctly that I just need to produce a double convolution FK table using one of the SIHP grids, and check that indeed it now takes significantly less time? If so, I will do at some point towards the end of the week. In this days I'm not in Paris and can't let my laptop running for much time continuously. I'll keep you posted. |
Hi @vbertone, yes, you simply need to compile from this branch and run the If I remember correctly, last time it took ~22hrs to produce the FK table with very good accuracy. Now, it should take (much) less than 5hrs. |
Ok, agreed. Yes, it took O(20) hours. |
Thank you very much! |
That's a speedup of roughly 3, which is already good, but I think we can do even better! Do you have 8 physical or logical cores? |
Hi @cschwan, on my Mac (which is pretty new) I seem to have 12 physical and 12 logical CPUs. Does it make sense?
|
I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster. |
Hi @Radonirinaunimi, that's excellent news! I was about to launch the code on my laptop but, to make sure I'll be using the correct version of |
Hi @vbertone, please use this branch ( |
I will do so, thanks. I keep you posted. |
I don't understand how you use different nodes - rayon shouldn't work across different computers, or am I missing something? |
Ah, sorry! I should have been more precise. I meant to say 120 node points for the |
OK, that makes much more sense 😄! How many |
The grid ( |
Hi @Radonirinaunimi and @cschwan, when I try to run the I wonder whether this problem is specific to my laptop or a general one. Would you have any suggestions as to what else I could check? |
Hi @vbertone, your machines turns off probably because of overheating, and this is evidenced by the fan running at full speed. (New) Macs probably allows the system to draw the full capacity of the CPU, and hence nothing safeguarding against overheating. Allocating the entirety of the CPU resources may not therefore be a good idea. So, you could try using a fewer number of CPUs (say 60-70%). You can do so by doing: export RAYON_NUM_THREADS=<NB_CPUs> before running the program. |
Hi @Radonirinaunimi, thanks for the suggestion which indeed solved the problem.
So, the improvement in speed is substantial indeed. However, as we knew, the numerical accuracy with ~50 nodes is not ideal. Today I will try a run with ~100 nodes and check how long it takes. I'll let you know. |
Hi @Radonirinaunimi, here is the result with ~100 nodes and on 6 cores:
Now the accuracy is significantly better and it took around 3 hours to complete, despite I was using my laptop for other tasks. |
Thanks a lot for the checks @vbertone! In the meantime, I was also checking 100 nodes with 32 cores and it took 44 min.
So these benchmarks show that the changes here already provides significant improvements and can be even scaled more with more cores. |
Hi @Radonirinaunimi, excellent! So it seems that speed scales as expected and definitely we no longer have any performance bottleneck. I have a couple of additional questions. Could you please remind me what is the command that I need to use in C++ to compress the FK tables? I would like to include it in the code so that it spits out the optimised tables straight away. Comparing our results above, I see that, while the convolutions of the grids give identical results, FK tables do not.
If that's the case, how comes that we get different results? Finally, I think it might be worth having a meeting to discuss the interface. Emanuele and I had some thoughts on that, but would be nice to discuss the question all together. |
Hi @vbertone,
Here the function used to optimize FK tables: pineappl/examples/cpp/evolve-grid.cpp Line 308 in af687ab
You can call it right before you write the FK tables onto disk.
I was actually using a slightly larger grids:
So it is actually ~150 rather than ~100 and hence why the slightly improved accuracy.
Yes, it would be indeed great to chat a bit. Would you be available next week? We can perhaps arrange the meeting via email with Emanuele. |
Hi @Radonirinaunimi, ok, all clear, thanks. For the meeting, next week I'm available on Thursday and a Friday. Could you please take care of writing to Emanuele and anybody else you think should be involved? |
@cschwan Are there further optimisation that we need to do here? For future reference, I have updated the description with the latest benchmarks. |
I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more. |
That sounds good! Please let me know if I can also help doing such a benchmark. |
This took me a bit longer, but with commit d60c14e from
after a runtime of roughly 2 hours and 15 minutes on a single AMD Ryzen 7 9700X @ 5.5 GHz,
tells me:
That confirms that indeed most of the CPU time is spent performing matrix multiplication, the operations that perform the actual evolution: these are the functions from the Note that the comparison of the generated FK-table and the original grid fails because the convolution functions aren't the right ones, and maybe the EKOs are also not accurate enough, but hopefully it's close enough for a performance benchmark. |
Concerning the parallelization strategy I tried something different myself, which is running each However, I still believe that we can simplify this a bit and to this end I wrote commit d60c14e, which rewrites part of the evolution code so that it is hopefully much more readable and understandable (than my own code that I wrote some time ago 😄). |
I've changed the changed such that we don't need synchronization types anymore - so no more
|
Using the command line in #352 (comment) with PineAPPL from
Running commit eedd0c0 from this branch with
That's a speed-up of
which is a speedup of |
Thanks a lot @cschwan for having performed these benchmark and for having made these updates. These changes look good to me. Regarding the following:
I guess, the question is whether or not smaller grids are at all affected by the parallelization? I can try to run some small benchmark so that we can decide. |
I was worried that the new code would slow down fast evolutions, but I'm not worried about it anymore. Would you like to review it? |
Thanks a lot @cschwan for everything! I just reviewed this and everything looks fine to me! I simply fixed the merge conflict by re-arranging the CHANGELOG to follow the previous orders (which also seems to be the natural order in https://keepachangelog.com/en/1.0.0/). So, IMO, this could be merged. |
Parallelize the evolution in$x$ -grid nodes, here are the benchmarks:
evolve_slice
using the Rayon crate. Using theSIHP
grid andPineAPFEL
with ~150The memory footprint is still the same as (a) there are no additional copies and (b) the$Q^2$ slices are iterated lazily as before.
cc @vbertone