Skip to content

Conversation

Radonirinaunimi
Copy link
Member

@Radonirinaunimi Radonirinaunimi commented Jun 21, 2025

Parallelize the evolution in evolve_slice using the Rayon crate. Using the SIHP grid and PineAPFEL with ~150 $x$-grid nodes, here are the benchmarks:

master this branch
Cores used 1 core 32 cores
Wall time ~20hrs 50mn

The memory footprint is still the same as (a) there are no additional copies and (b) the $Q^2$ slices are iterated lazily as before.

cc @vbertone

@Radonirinaunimi
Copy link
Member Author

The test is failing because of a change in the last digit for one of the grid:

│   -5 2.53020442272921e2 2.52959682618899e2 -2.4013733e-4
│   +5 2.53020442272921e2 2.52959682618898e2 -2.4013733e-4

@cschwan
Copy link
Contributor

cschwan commented Jun 23, 2025

The test is failing because of a change in the last digit for one of the grid:

│   -5 2.53020442272921e2 2.52959682618899e2 -2.4013733e-4
│   +5 2.53020442272921e2 2.52959682618898e2 -2.4013733e-4

That's probably expected. Instruct the CLI to emit one less digit!

@Radonirinaunimi
Copy link
Member Author

That's probably expected. Instruct the CLI to emit one less digit!

Done! I also added a few lines in the Change Logs.

Hi @vbertone, would you perhaps have the chance to test this (it contains the fixes I mentioned in #351)? It'd be really great to check if this also work great for you, especially since I presume you have many CPU cores xD

@vbertone
Copy link

Hi @Radonirinaunimi, do I understand correctly that I just need to produce a double convolution FK table using one of the SIHP grids, and check that indeed it now takes significantly less time?

If so, I will do at some point towards the end of the week. In this days I'm not in Paris and can't let my laptop running for much time continuously. I'll keep you posted.

@Radonirinaunimi
Copy link
Member Author

Hi @vbertone, yes, you simply need to compile from this branch and run the evolve-apfel-double.cpp. Thanks!!

If I remember correctly, last time it took ~22hrs to produce the FK table with very good accuracy. Now, it should take (much) less than 5hrs.

@vbertone
Copy link

Ok, agreed. Yes, it took O(20) hours.
I'll let you know how long it takes not.

@Radonirinaunimi
Copy link
Member Author

Ok, agreed. Yes, it took O(20) hours. I'll let you know how long it takes not.

Thank you very much!

@cschwan
Copy link
Contributor

cschwan commented Jun 24, 2025

That's a speedup of roughly 3, which is already good, but I think we can do even better! Do you have 8 physical or logical cores?

@vbertone
Copy link

Hi @cschwan, on my Mac (which is pretty new) I seem to have 12 physical and 12 logical CPUs. Does it make sense?
In case any of you has Mac, I used the following commands:

sysctl -n hw.physicalcpu 
sysctl -n hw.logicalcpu 

@Radonirinaunimi
Copy link
Member Author

I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster.

@vbertone
Copy link

Hi @Radonirinaunimi, that's excellent news!

I was about to launch the code on my laptop but, to make sure I'll be using the correct version of pineappl, could you please confirm that I have to use the master branch and not the parallelize-evolve branch? Thank you.

@Radonirinaunimi
Copy link
Member Author

Hi @vbertone, please use this branch (parallelize-evolve) for the test. This one (not master) contains the parallelization that we want to check. Thanks again!!

@vbertone
Copy link

I will do so, thanks. I keep you posted.

@cschwan
Copy link
Contributor

cschwan commented Jun 26, 2025

I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster.

I don't understand how you use different nodes - rayon shouldn't work across different computers, or am I missing something?

@Radonirinaunimi
Copy link
Member Author

I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster.

I don't understand how you use different nodes - rayon shouldn't work across different computers, or am I missing something?

Ah, sorry! I should have been more precise. I meant to say 120 node points for the $x$-grid.

@cschwan
Copy link
Contributor

cschwan commented Jun 26, 2025

OK, that makes much more sense 😄! How many $Q^2 $-slices do you have for this evolution?

@Radonirinaunimi
Copy link
Member Author

OK, that makes much more sense 😄! How many Q 2 -slices do you have for this evolution?

The grid (SIHP-PP-POLARIZED-STAR-NLO.pineappl.lz4 which is also a part of test-data) contains 6 $Q^2$ slices.

@vbertone
Copy link

Hi @Radonirinaunimi and @cschwan, when I try to run the evolve-grid-double-apfel code using the parallelize-evolve branch my laptop goes off. More specifically: it starts running, then the fan starts going full steam, and after a couple of minutes it turns off. This happens systematically with the parallelize-evolve branch, while it runs fine with the master branch. I guess that, for some reason, the core overheats and thus, as a protection, it gets shut down.

I wonder whether this problem is specific to my laptop or a general one. Would you have any suggestions as to what else I could check?

@Radonirinaunimi
Copy link
Member Author

Hi @vbertone, your machines turns off probably because of overheating, and this is evidenced by the fan running at full speed. (New) Macs probably allows the system to draw the full capacity of the CPU, and hence nothing safeguarding against overheating.

Allocating the entirety of the CPU resources may not therefore be a good idea. So, you could try using a fewer number of CPUs (say 60-70%). You can do so by doing:

export RAYON_NUM_THREADS=<NB_CPUs>

before running the program.

@vbertone
Copy link

Hi @Radonirinaunimi, thanks for the suggestion which indeed solved the problem.
Using 6 CPUs (RAYON_NUM_THREADS=6) on my laptop and a grid of around 50 nodes, the code takes around 23 minutes to run:

   Bin           Grid        FkTable        reldiff
  ----  -------------  -------------  -------------
     0   1.880796e+04   1.909462e+04   1.524134e-02
     1   6.720608e+03   6.840034e+03   1.777015e-02
     2   2.562033e+03   2.615826e+03   2.099596e-02
     3   1.045092e+03   1.070742e+03   2.454281e-02
     4   4.624719e+02   4.757466e+02   2.870395e-02
     5   1.611050e+02   1.667573e+02   3.508405e-02
Time elapsed: 1391.428772 seconds

So, the improvement in speed is substantial indeed. However, as we knew, the numerical accuracy with ~50 nodes is not ideal. Today I will try a run with ~100 nodes and check how long it takes. I'll let you know.

@vbertone
Copy link

Hi @Radonirinaunimi, here is the result with ~100 nodes and on 6 cores:

   Bin           Grid        FkTable        reldiff
  ----  -------------  -------------  -------------
     0   1.880796e+04   1.882355e+04   8.288497e-04
     1   6.720608e+03   6.727104e+03   9.665848e-04
     2   2.562033e+03   2.564996e+03   1.156375e-03
     3   1.045092e+03   1.046528e+03   1.373506e-03
     4   4.624719e+02   4.632332e+02   1.646304e-03
     5   1.611050e+02   1.614378e+02   2.065431e-03
Time elapsed: 11157.755553 seconds

Now the accuracy is significantly better and it took around 3 hours to complete, despite I was using my laptop for other tasks.

@Radonirinaunimi
Copy link
Member Author

Thanks a lot for the checks @vbertone! In the meantime, I was also checking 100 nodes with 32 cores and it took 44 min.

   Bin           Grid        FkTable        reldiff
  ----  -------------  -------------  -------------
     0   1.880796e+04   1.882251e+04   7.737321e-04
     1   6.720608e+03   6.726727e+03   9.105060e-04
     2   2.562033e+03   2.564830e+03   1.091661e-03
     3   1.045092e+03   1.046444e+03   1.293859e-03
     4   4.624719e+02   4.631852e+02   1.542367e-03
     5   1.611050e+02   1.614139e+02   1.917261e-03
Time elapsed: 2662.488058 seconds

So these benchmarks show that the changes here already provides significant improvements and can be even scaled more with more cores.

@vbertone
Copy link

Hi @Radonirinaunimi, excellent! So it seems that speed scales as expected and definitely we no longer have any performance bottleneck.

I have a couple of additional questions.

Could you please remind me what is the command that I need to use in C++ to compress the FK tables? I would like to include it in the code so that it spits out the optimised tables straight away.

Comparing our results above, I see that, while the convolutions of the grids give identical results, FK tables do not.
I assume we are using the same internal grid of APFEL, that is:

    const apfel::Grid g{{apfel::SubGrid{80, 1e-5, 3}, apfel::SubGrid{40, 1e-1, 3}}};

If that's the case, how comes that we get different results?

Finally, I think it might be worth having a meeting to discuss the interface. Emanuele and I had some thoughts on that, but would be nice to discuss the question all together.

@Radonirinaunimi
Copy link
Member Author

Hi @vbertone,

Could you please remind me what is the command that I need to use in C++ to compress the FK tables? I would like to include it in the code so that it spits out the optimised tables straight away.

Here the function used to optimize FK tables:

pineappl_fktable_optimize(fktable, PINEAPPL_FK_ASSUMPTIONS_NF3_SYM);

You can call it right before you write the FK tables onto disk.

Comparing our results above, I see that, while the convolutions of the grids give identical results, FK tables do not. I assume we are using the same internal grid of APFEL, that is:

    const apfel::Grid g{{apfel::SubGrid{80, 1e-5, 3}, apfel::SubGrid{40, 1e-1, 3}}};

If that's the case, how comes that we get different results?

I was actually using a slightly larger grids:

const apfel::Grid g{{apfel::SubGrid{80, 1e-5, 3}, apfel::SubGrid{60, 1e-1, 3}}};

So it is actually ~150 rather than ~100 and hence why the slightly improved accuracy.

Finally, I think it might be worth having a meeting to discuss the interface. Emanuele and I had some thoughts on that, but would be nice to discuss the question all together.

Yes, it would be indeed great to chat a bit. Would you be available next week? We can perhaps arrange the meeting via email with Emanuele.

@vbertone
Copy link

Hi @Radonirinaunimi, ok, all clear, thanks.

For the meeting, next week I'm available on Thursday and a Friday. Could you please take care of writing to Emanuele and anybody else you think should be involved?

@Radonirinaunimi
Copy link
Member Author

@cschwan Are there further optimisation that we need to do here? For future reference, I have updated the description with the latest benchmarks.

@cschwan
Copy link
Contributor

cschwan commented Jul 7, 2025

I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more.

@Radonirinaunimi
Copy link
Member Author

I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more.

That sounds good! Please let me know if I can also help doing such a benchmark.

@cschwan
Copy link
Contributor

cschwan commented Aug 16, 2025

I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more.

This took me a bit longer, but with commit d60c14e from master and

perf record -g -F99 target/release/pineappl evolve \
    test-data/SIHP-PP-POLARIZED-STAR-NLO.pineappl.lz4 \
    test-data/SIHP-PP-POLARIZEDSTAR-NLO_polarized.tar,test-data/Downloads/SIHP-PP-POLARIZEDSTAR-NLO_time_like.tar \
    /tmp/deleteme.pineappl \
    NNPDF40_nlo_as_01180+p,NNFF10_PIm_lo+f \
    --accuracy 10000000.0

after a runtime of roughly 2 hours and 15 minutes on a single AMD Ryzen 7 9700X @ 5.5 GHz,

perf report --call-graph --no-children

tells me:

+   75.77%  pineappl  pineappl              [.] matrixmultiply::dgemm_kernel::kernel_target_fma
+    6.63%  pineappl  pineappl              [.] matrixmultiply::packing::pack_avx2
+    5.43%  pineappl  pineappl              [.] matrixmultiply::packing::pack_avx2
+    2.28%  pineappl  pineappl              [.] matrixmultiply::gemm::gemm_loop
+    2.03%  pineappl  pineappl              [.] matrixmultiply::gemm::masked_kernel
+    1.82%  pineappl  libc.so.6             [.] 0x00000000000a3788
+    1.71%  pineappl  libc.so.6             [.] 0x00000000000a3b7d
+    0.53%  pineappl  libc.so.6             [.] 0x00000000000a3b80
     0.52%  pineappl  pineappl              [.] pineappl::evolution::general_tensor_mul
[..]

That confirms that indeed most of the CPU time is spent performing matrix multiplication, the operations that perform the actual evolution: these are the functions from the matrixmultiply crate, and they account for 92.15 % of the CPU time (the first five functions). That's great!

Note that the comparison of the generated FK-table and the original grid fails because the convolution functions aren't the right ones, and maybe the EKOs are also not accurate enough, but hopefully it's close enough for a performance benchmark.

@cschwan
Copy link
Contributor

cschwan commented Aug 16, 2025

Concerning the parallelization strategy I tried something different myself, which is running each $Q^2$-slice in parallel, which is much easier, but given that in this example there are less slices than my desktop PC has cores it will probably not scale well enough. Even worse, we'd have the entire operator(s) in memory, which we surely don't want. So this is to say I'm now quite confident that @Radonirinaunimi got the parallelization strategy just right.

However, I still believe that we can simplify this a bit and to this end I wrote commit d60c14e, which rewrites part of the evolution code so that it is hopefully much more readable and understandable (than my own code that I wrote some time ago 😄).

@cschwan
Copy link
Contributor

cschwan commented Aug 16, 2025

I've changed the changed such that we don't need synchronization types anymore - so no more Arc and Mutex. The parallelization is a bit different, as the following loop still exists:

https://github.com/NNPDF/pineappl/pull/352/files#diff-9048db4f26502fa421577e9ff22c83d42d7497c982fcd892a6d90d2907b155aaL448

  • We can possibly merge this into the parallelization again.
  • Another point is that there's now a lot of allocation going on, since .par_iter_mut().zip() requires the zip argument to be indexable, which means we collect tmp into a Vec.
  • Finally, we may want to look at the performance of smaller grids and enable parallelization selectively.

@cschwan
Copy link
Contributor

cschwan commented Aug 17, 2025

Using the command line in #352 (comment) with PineAPPL from master and one CPU running at 5.5 GHz takes

real    136m50.580s
user    136m39.898s
sys     0m2.840s

Running commit eedd0c0 from this branch with RAYON_NUM_THREADS=8 and 8 cores running at 4.8 GHz:

real    18m28.670s
user    145m26.534s
sys     0m2.832s

That's a speed-up of 7.4, which is pretty good considering the CPUs run at different frequencies. With RAYON_NUM_THREAD=16 (default on my system) and CPUs running at 4.4 GHz:

real    16m41.623s
user    261m1.471s
sys     0m3.826s

which is a speedup of 8.2.

@Radonirinaunimi
Copy link
Member Author

Thanks a lot @cschwan for having performed these benchmark and for having made these updates. These changes look good to me.

Regarding the following:

Finally, we may want to look at the performance of smaller grids and enable parallelization selectively.

I guess, the question is whether or not smaller grids are at all affected by the parallelization? I can try to run some small benchmark so that we can decide.

@cschwan
Copy link
Contributor

cschwan commented Aug 21, 2025

Regarding the following:

Finally, we may want to look at the performance of smaller grids and enable parallelization selectively.

I guess, the question is whether or not smaller grids are at all affected by the parallelization? I can try to run some small benchmark so that we can decide.

I was worried that the new code would slow down fast evolutions, but I'm not worried about it anymore.

Would you like to review it?

@Radonirinaunimi
Copy link
Member Author

Thanks a lot @cschwan for everything! I just reviewed this and everything looks fine to me! I simply fixed the merge conflict by re-arranging the CHANGELOG to follow the previous orders (which also seems to be the natural order in https://keepachangelog.com/en/1.0.0/).

So, IMO, this could be merged.

@cschwan cschwan merged commit c636eda into master Aug 22, 2025
10 checks passed
@cschwan cschwan deleted the parallelize-evolve branch August 22, 2025 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants