Parallelize `evolve_slice` #352

Radonirinaunimi · 2025-06-21T00:15:08Z

Parallelize the evolution in evolve_slice using the Rayon crate. Using the SIHP grid and PineAPFEL with ~150 $x$-grid nodes, here are the benchmarks:

	master	this branch
Cores used	1 core	32 cores
Wall time	~20hrs	50mn

The memory footprint is still the same as (a) there are no additional copies and (b) the $Q^2$ slices are iterated lazily as before.

cc @vbertone

Radonirinaunimi · 2025-06-21T00:17:07Z

The test is failing because of a change in the last digit for one of the grid:

│   -5 2.53020442272921e2 2.52959682618899e2 -2.4013733e-4
│   +5 2.53020442272921e2 2.52959682618898e2 -2.4013733e-4

cschwan · 2025-06-23T05:48:17Z

The test is failing because of a change in the last digit for one of the grid:
│   -5 2.53020442272921e2 2.52959682618899e2 -2.4013733e-4
│   +5 2.53020442272921e2 2.52959682618898e2 -2.4013733e-4

That's probably expected. Instruct the CLI to emit one less digit!

Radonirinaunimi · 2025-06-23T16:16:47Z

That's probably expected. Instruct the CLI to emit one less digit!

Done! I also added a few lines in the Change Logs.

Hi @vbertone, would you perhaps have the chance to test this (it contains the fixes I mentioned in #351)? It'd be really great to check if this also work great for you, especially since I presume you have many CPU cores xD

vbertone · 2025-06-24T08:30:22Z

Hi @Radonirinaunimi, do I understand correctly that I just need to produce a double convolution FK table using one of the SIHP grids, and check that indeed it now takes significantly less time?

If so, I will do at some point towards the end of the week. In this days I'm not in Paris and can't let my laptop running for much time continuously. I'll keep you posted.

Radonirinaunimi · 2025-06-24T08:36:00Z

Hi @vbertone, yes, you simply need to compile from this branch and run the evolve-apfel-double.cpp. Thanks!!

If I remember correctly, last time it took ~22hrs to produce the FK table with very good accuracy. Now, it should take (much) less than 5hrs.

vbertone · 2025-06-24T08:39:11Z

Ok, agreed. Yes, it took O(20) hours.
I'll let you know how long it takes not.

Radonirinaunimi · 2025-06-24T08:40:11Z

Ok, agreed. Yes, it took O(20) hours. I'll let you know how long it takes not.

Thank you very much!

cschwan · 2025-06-24T16:45:32Z

That's a speedup of roughly 3, which is already good, but I think we can do even better! Do you have 8 physical or logical cores?

vbertone · 2025-06-24T16:55:53Z

Hi @cschwan, on my Mac (which is pretty new) I seem to have 12 physical and 12 logical CPUs. Does it make sense?
In case any of you has Mac, I used the following commands:

sysctl -n hw.physicalcpu 
sysctl -n hw.logicalcpu

Radonirinaunimi · 2025-06-26T15:29:35Z

I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster.

vbertone · 2025-06-26T15:58:39Z

Hi @Radonirinaunimi, that's excellent news!

I was about to launch the code on my laptop but, to make sure I'll be using the correct version of pineappl, could you please confirm that I have to use the master branch and not the parallelize-evolve branch? Thank you.

Radonirinaunimi · 2025-06-26T16:02:19Z

Hi @vbertone, please use this branch (parallelize-evolve) for the test. This one (not master) contains the parallelization that we want to check. Thanks again!!

vbertone · 2025-06-26T16:05:38Z

I will do so, thanks. I keep you posted.

cschwan · 2025-06-26T16:23:02Z

I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster.

I don't understand how you use different nodes - rayon shouldn't work across different computers, or am I missing something?

Radonirinaunimi · 2025-06-26T16:24:41Z

I tested this branch with 32 cores but this time using 120 nodes (which should be the same/slightly larger than what Valerio ran before?). It took 1h06min for it to complete, which is at least a factor of ~20 times faster.

I don't understand how you use different nodes - rayon shouldn't work across different computers, or am I missing something?

Ah, sorry! I should have been more precise. I meant to say 120 node points for the $x$-grid.

cschwan · 2025-06-26T16:35:19Z

OK, that makes much more sense 😄! How many $Q^2 $-slices do you have for this evolution?

Radonirinaunimi · 2025-06-26T16:43:32Z

OK, that makes much more sense 😄! How many Q 2 -slices do you have for this evolution?

The grid (SIHP-PP-POLARIZED-STAR-NLO.pineappl.lz4 which is also a part of test-data) contains 6 $Q^2$ slices.

vbertone · 2025-06-26T20:11:38Z

Hi @Radonirinaunimi and @cschwan, when I try to run the evolve-grid-double-apfel code using the parallelize-evolve branch my laptop goes off. More specifically: it starts running, then the fan starts going full steam, and after a couple of minutes it turns off. This happens systematically with the parallelize-evolve branch, while it runs fine with the master branch. I guess that, for some reason, the core overheats and thus, as a protection, it gets shut down.

I wonder whether this problem is specific to my laptop or a general one. Would you have any suggestions as to what else I could check?

Radonirinaunimi · 2025-06-26T20:19:16Z

Hi @vbertone, your machines turns off probably because of overheating, and this is evidenced by the fan running at full speed. (New) Macs probably allows the system to draw the full capacity of the CPU, and hence nothing safeguarding against overheating.

Allocating the entirety of the CPU resources may not therefore be a good idea. So, you could try using a fewer number of CPUs (say 60-70%). You can do so by doing:

export RAYON_NUM_THREADS=<NB_CPUs>

before running the program.

vbertone · 2025-06-27T05:41:46Z

Hi @Radonirinaunimi, thanks for the suggestion which indeed solved the problem.
Using 6 CPUs (RAYON_NUM_THREADS=6) on my laptop and a grid of around 50 nodes, the code takes around 23 minutes to run:

   Bin           Grid        FkTable        reldiff
  ----  -------------  -------------  -------------
     0   1.880796e+04   1.909462e+04   1.524134e-02
     1   6.720608e+03   6.840034e+03   1.777015e-02
     2   2.562033e+03   2.615826e+03   2.099596e-02
     3   1.045092e+03   1.070742e+03   2.454281e-02
     4   4.624719e+02   4.757466e+02   2.870395e-02
     5   1.611050e+02   1.667573e+02   3.508405e-02
Time elapsed: 1391.428772 seconds

So, the improvement in speed is substantial indeed. However, as we knew, the numerical accuracy with ~50 nodes is not ideal. Today I will try a run with ~100 nodes and check how long it takes. I'll let you know.

vbertone · 2025-06-27T18:24:16Z

Hi @Radonirinaunimi, here is the result with ~100 nodes and on 6 cores:

   Bin           Grid        FkTable        reldiff
  ----  -------------  -------------  -------------
     0   1.880796e+04   1.882355e+04   8.288497e-04
     1   6.720608e+03   6.727104e+03   9.665848e-04
     2   2.562033e+03   2.564996e+03   1.156375e-03
     3   1.045092e+03   1.046528e+03   1.373506e-03
     4   4.624719e+02   4.632332e+02   1.646304e-03
     5   1.611050e+02   1.614378e+02   2.065431e-03
Time elapsed: 11157.755553 seconds

Now the accuracy is significantly better and it took around 3 hours to complete, despite I was using my laptop for other tasks.

Radonirinaunimi · 2025-06-27T18:48:38Z

Thanks a lot for the checks @vbertone! In the meantime, I was also checking 100 nodes with 32 cores and it took 44 min.

   Bin           Grid        FkTable        reldiff
  ----  -------------  -------------  -------------
     0   1.880796e+04   1.882251e+04   7.737321e-04
     1   6.720608e+03   6.726727e+03   9.105060e-04
     2   2.562033e+03   2.564830e+03   1.091661e-03
     3   1.045092e+03   1.046444e+03   1.293859e-03
     4   4.624719e+02   4.631852e+02   1.542367e-03
     5   1.611050e+02   1.614139e+02   1.917261e-03
Time elapsed: 2662.488058 seconds

So these benchmarks show that the changes here already provides significant improvements and can be even scaled more with more cores.

vbertone · 2025-06-28T05:01:09Z

Hi @Radonirinaunimi, excellent! So it seems that speed scales as expected and definitely we no longer have any performance bottleneck.

I have a couple of additional questions.

Could you please remind me what is the command that I need to use in C++ to compress the FK tables? I would like to include it in the code so that it spits out the optimised tables straight away.

Comparing our results above, I see that, while the convolutions of the grids give identical results, FK tables do not.
I assume we are using the same internal grid of APFEL, that is:

    const apfel::Grid g{{apfel::SubGrid{80, 1e-5, 3}, apfel::SubGrid{40, 1e-1, 3}}};

If that's the case, how comes that we get different results?

Finally, I think it might be worth having a meeting to discuss the interface. Emanuele and I had some thoughts on that, but would be nice to discuss the question all together.

Radonirinaunimi · 2025-06-28T08:37:08Z

Hi @vbertone,

Could you please remind me what is the command that I need to use in C++ to compress the FK tables? I would like to include it in the code so that it spits out the optimised tables straight away.

Here the function used to optimize FK tables:

pineappl/examples/cpp/evolve-grid.cpp

Line 308 in af687ab

pineappl_fktable_optimize(fktable, PINEAPPL_FK_ASSUMPTIONS_NF3_SYM);

You can call it right before you write the FK tables onto disk.

Comparing our results above, I see that, while the convolutions of the grids give identical results, FK tables do not. I assume we are using the same internal grid of APFEL, that is:
    const apfel::Grid g{{apfel::SubGrid{80, 1e-5, 3}, apfel::SubGrid{40, 1e-1, 3}}};
If that's the case, how comes that we get different results?

I was actually using a slightly larger grids:

const apfel::Grid g{{apfel::SubGrid{80, 1e-5, 3}, apfel::SubGrid{60, 1e-1, 3}}};

So it is actually ~150 rather than ~100 and hence why the slightly improved accuracy.

Finally, I think it might be worth having a meeting to discuss the interface. Emanuele and I had some thoughts on that, but would be nice to discuss the question all together.

Yes, it would be indeed great to chat a bit. Would you be available next week? We can perhaps arrange the meeting via email with Emanuele.

vbertone · 2025-06-28T10:25:05Z

Hi @Radonirinaunimi, ok, all clear, thanks.

For the meeting, next week I'm available on Thursday and a Friday. Could you please take care of writing to Emanuele and anybody else you think should be involved?

Radonirinaunimi · 2025-07-07T19:38:03Z

@cschwan Are there further optimisation that we need to do here? For future reference, I have updated the description with the latest benchmarks.

cschwan · 2025-07-07T19:44:19Z

I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more.

Radonirinaunimi · 2025-07-07T19:46:50Z

I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more.

That sounds good! Please let me know if I can also help doing such a benchmark.

cschwan · 2025-08-16T09:52:31Z

I'm a bit worried that the changes will slow down the usual evolutions with one/two convolutions. Before we merge, I'd like to test a bit more.

This took me a bit longer, but with commit d60c14e from master and

perf record -g -F99 target/release/pineappl evolve \
    test-data/SIHP-PP-POLARIZED-STAR-NLO.pineappl.lz4 \
    test-data/SIHP-PP-POLARIZEDSTAR-NLO_polarized.tar,test-data/Downloads/SIHP-PP-POLARIZEDSTAR-NLO_time_like.tar \
    /tmp/deleteme.pineappl \
    NNPDF40_nlo_as_01180+p,NNFF10_PIm_lo+f \
    --accuracy 10000000.0

after a runtime of roughly 2 hours and 15 minutes on a single AMD Ryzen 7 9700X @ 5.5 GHz,

perf report --call-graph --no-children

tells me:

+   75.77%  pineappl  pineappl              [.] matrixmultiply::dgemm_kernel::kernel_target_fma
+    6.63%  pineappl  pineappl              [.] matrixmultiply::packing::pack_avx2
+    5.43%  pineappl  pineappl              [.] matrixmultiply::packing::pack_avx2
+    2.28%  pineappl  pineappl              [.] matrixmultiply::gemm::gemm_loop
+    2.03%  pineappl  pineappl              [.] matrixmultiply::gemm::masked_kernel
+    1.82%  pineappl  libc.so.6             [.] 0x00000000000a3788
+    1.71%  pineappl  libc.so.6             [.] 0x00000000000a3b7d
+    0.53%  pineappl  libc.so.6             [.] 0x00000000000a3b80
     0.52%  pineappl  pineappl              [.] pineappl::evolution::general_tensor_mul
[..]

That confirms that indeed most of the CPU time is spent performing matrix multiplication, the operations that perform the actual evolution: these are the functions from the matrixmultiply crate, and they account for 92.15 % of the CPU time (the first five functions). That's great!

Note that the comparison of the generated FK-table and the original grid fails because the convolution functions aren't the right ones, and maybe the EKOs are also not accurate enough, but hopefully it's close enough for a performance benchmark.

cschwan · 2025-08-16T10:08:27Z

Concerning the parallelization strategy I tried something different myself, which is running each $Q^2$-slice in parallel, which is much easier, but given that in this example there are less slices than my desktop PC has cores it will probably not scale well enough. Even worse, we'd have the entire operator(s) in memory, which we surely don't want. So this is to say I'm now quite confident that @Radonirinaunimi got the parallelization strategy just right.

However, I still believe that we can simplify this a bit and to this end I wrote commit d60c14e, which rewrites part of the evolution code so that it is hopefully much more readable and understandable (than my own code that I wrote some time ago 😄).

cschwan · 2025-08-16T12:44:14Z

I've changed the changed such that we don't need synchronization types anymore - so no more Arc and Mutex. The parallelization is a bit different, as the following loop still exists:

https://github.com/NNPDF/pineappl/pull/352/files#diff-9048db4f26502fa421577e9ff22c83d42d7497c982fcd892a6d90d2907b155aaL448

We can possibly merge this into the parallelization again.
Another point is that there's now a lot of allocation going on, since .par_iter_mut().zip() requires the zip argument to be indexable, which means we collect tmp into a Vec.
Finally, we may want to look at the performance of smaller grids and enable parallelization selectively.

cschwan · 2025-08-17T09:56:56Z

Using the command line in #352 (comment) with PineAPPL from master and one CPU running at 5.5 GHz takes

real    136m50.580s
user    136m39.898s
sys     0m2.840s

Running commit eedd0c0 from this branch with RAYON_NUM_THREADS=8 and 8 cores running at 4.8 GHz:

real    18m28.670s
user    145m26.534s
sys     0m2.832s

That's a speed-up of 7.4, which is pretty good considering the CPUs run at different frequencies. With RAYON_NUM_THREAD=16 (default on my system) and CPUs running at 4.4 GHz:

real    16m41.623s
user    261m1.471s
sys     0m3.826s

which is a speedup of 8.2.

Radonirinaunimi · 2025-08-17T23:01:53Z

Thanks a lot @cschwan for having performed these benchmark and for having made these updates. These changes look good to me.

Regarding the following:

Finally, we may want to look at the performance of smaller grids and enable parallelization selectively.

I guess, the question is whether or not smaller grids are at all affected by the parallelization? I can try to run some small benchmark so that we can decide.

This reverts commit 37e4035.

cschwan · 2025-08-21T05:23:00Z

Regarding the following:

Finally, we may want to look at the performance of smaller grids and enable parallelization selectively.

I guess, the question is whether or not smaller grids are at all affected by the parallelization? I can try to run some small benchmark so that we can decide.

I was worried that the new code would slow down fast evolutions, but I'm not worried about it anymore.

Would you like to review it?

Radonirinaunimi · 2025-08-21T20:52:51Z

Thanks a lot @cschwan for everything! I just reviewed this and everything looks fine to me! I simply fixed the merge conflict by re-arranging the CHANGELOG to follow the previous orders (which also seems to be the natural order in https://keepachangelog.com/en/1.0.0/).

So, IMO, this could be merged.

Radonirinaunimi added 3 commits June 21, 2025 01:51

Try to parallelize evolve_slice

12a00fb

Merge branch 'master' into parallelize-evolve

d09c1a2

Remove unused (sequential) function for the time-being

fac1be5

Replace glob items import for rayon

e2f333e

Radonirinaunimi added 2 commits June 23, 2025 18:07

Reduce the number of digits-abs to 13

37e4035

Add parallelization of evolve_slice to the Change logs

12f7758

Radonirinaunimi added 3 commits June 29, 2025 12:53

Merge branch 'master' into parallelize-evolve

6163f00

Merge branch 'master' into parallelize-evolve

47cc3e5

Merge branch 'master' into parallelize-evolve

6be5f47

Radonirinaunimi mentioned this pull request Jul 1, 2025

Remove rotations NNPDF/pineko#227

Draft

Merge branch 'master' into parallelize-evolve

407861d

cschwan added 5 commits August 16, 2025 12:47

Merge branch 'master' into parallelize-evolve

832c12c

Introduce back parallelization

4d8c13f

Revert back rayon version increase

51b8c24

Revert removing two comments

a40ba95

Fix changelog entry

eedd0c0

cschwan added 4 commits August 20, 2025 19:32

Find EKOs lazily in evolution code to avoid allocating memory

a1b148e

Swap loops in evolution code

dc73354

Simplify comparison a bit

0a5c555

Revert "Reduce the number of digits-abs to 13"

9fee216

This reverts commit 37e4035.

Merge branch 'master' into parallelize-evolve

775a785

cschwan merged commit c636eda into master Aug 22, 2025
10 checks passed

cschwan deleted the parallelize-evolve branch August 22, 2025 04:21

Parallelize evolve_slice #352

Parallelize evolve_slice #352

Uh oh!

Conversation

Radonirinaunimi commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Radonirinaunimi commented Jun 21, 2025

Uh oh!

cschwan commented Jun 23, 2025

Uh oh!

Radonirinaunimi commented Jun 23, 2025

Uh oh!

vbertone commented Jun 24, 2025

Uh oh!

Radonirinaunimi commented Jun 24, 2025

Uh oh!

vbertone commented Jun 24, 2025

Uh oh!

Radonirinaunimi commented Jun 24, 2025

Uh oh!

cschwan commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbertone commented Jun 24, 2025

Uh oh!

Radonirinaunimi commented Jun 26, 2025

Uh oh!

vbertone commented Jun 26, 2025

Uh oh!

Radonirinaunimi commented Jun 26, 2025

Uh oh!

vbertone commented Jun 26, 2025

Uh oh!

cschwan commented Jun 26, 2025

Uh oh!

Radonirinaunimi commented Jun 26, 2025

Uh oh!

cschwan commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Radonirinaunimi commented Jun 26, 2025

Uh oh!

vbertone commented Jun 26, 2025

Uh oh!

Radonirinaunimi commented Jun 26, 2025

Uh oh!

vbertone commented Jun 27, 2025

Uh oh!

vbertone commented Jun 27, 2025

Uh oh!

Radonirinaunimi commented Jun 27, 2025

Uh oh!

vbertone commented Jun 28, 2025

Uh oh!

Radonirinaunimi commented Jun 28, 2025

Uh oh!

vbertone commented Jun 28, 2025

Uh oh!

Radonirinaunimi commented Jul 7, 2025

Uh oh!

cschwan commented Jul 7, 2025

Uh oh!

Radonirinaunimi commented Jul 7, 2025

Uh oh!

cschwan commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cschwan commented Aug 16, 2025

Uh oh!

cschwan commented Aug 16, 2025

Uh oh!

cschwan commented Aug 17, 2025

Uh oh!

Radonirinaunimi commented Aug 17, 2025

Uh oh!

cschwan commented Aug 21, 2025

Uh oh!

Radonirinaunimi commented Aug 21, 2025

Uh oh!

Parallelize `evolve_slice` #352

Parallelize `evolve_slice` #352

Radonirinaunimi commented Jun 21, 2025 •

edited

Loading

cschwan commented Jun 24, 2025 •

edited

Loading

cschwan commented Jun 26, 2025 •

edited

Loading

cschwan commented Aug 16, 2025 •

edited

Loading