Optimize GCLDAModel with `numba` and `Parallel` #744

JulioAPeraza · 2022-08-24T15:08:36Z

Closes None.

Currently, the GCLDA model is slow, mainly when the vocabulary is large (for example for the 3228 terms from Neurosynth). _update_word_topic_assignments takes ~73 seconds per iteration and _update_peak_assignments takes ~52 seconds per iteration for a total of ~125 seconds of _update() per iteration. Compiling and running these functions with numba reduce these times to 2, 4, and 6 seconds per iteration respectively.

Changes proposed in this pull request:

Convert _update_word_topic_assignments and _update_peak_assignments to static methods, and add a decorator to compile and run them with numba in nopython mode.

…eak_assignments

codecov · 2022-08-24T16:14:52Z

Codecov Report

Base: 88.56% // Head: 87.68% // Decreases project coverage by -0.87% ⚠️

Coverage data is based on head (fc7c5e6) compared to base (1957869).
Patch coverage: 77.21% of modified lines in pull request are covered.

❗ Current head fc7c5e6 differs from pull request most recent head 0d2939c. Consider uploading reports for the commit 0d2939c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #744      +/-   ##
==========================================
- Coverage   88.56%   87.68%   -0.88%     
==========================================
  Files          38       35       -3     
  Lines        4371     3913     -458     
==========================================
- Hits         3871     3431     -440     
+ Misses        500      482      -18

Impacted Files	Coverage Δ
nimare/annotate/gclda.py	`92.04% <77.21%> (-5.27%)`	⬇️
nimare/meta/utils.py	`95.73% <0.00%> (-0.03%)`	⬇️
nimare/dataset.py	`89.96% <0.00%> (ø)`
nimare/__init__.py
nimare/base.py
nimare/utils.py

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

tsalo · 2022-08-24T16:17:52Z

With the changes to the two update methods, it seems like they could be converted to functions instead. Is there any benefit to using @staticmethod over making them functions?

JulioAPeraza · 2022-08-24T18:30:25Z

You're right. The two update methods can be converted to functions. I don't see any benefit in using staticmethod in this case.

@jit

…ctions with @jit

nimare/tests/test_annotate_gclda.py

JulioAPeraza · 2022-08-24T22:14:59Z

nimare/annotate/gclda.py

+        # Float precision issue in Numba: https://github.com/numba/numba/issues/3426
+        probs_pdf = np.trunc(probs_pdf * (10**12)) / (10**12)


This line will need to be removed in the future. It is ensuring the sum never goes over 1.0 so that we don't get this error in numba, but this truncation is causing the maps on the example to look different to the ones from previous versions.

I was running some tests with Neurosynth and Neuroquery, and got the error again:

File "<...>/lib/python3.9/site-packages/nimare/annotate/gclda.py", line 622, in fit self._update(loglikely_freq=loglikely_freq) File "<...>/lib/python3.9/site-packages/nimare/annotate/gclda.py", line 681, in _update ) = _update_peak_assignments( File "<...>/lib/python3.9/site-packages/numba/cpython/randomimpl.py", line 833, in binomial_impl raise ValueError("binomial(): p outside of [0, 1]") ValueError: binomial(): p outside of [0, 1]

would np.clip help here?
during normalization, you could do this:
probs_pdf = (probs_pdf / np.sum(probs_pdf)).clip(0, 1)

I tried using clip but it still raised the same error. I think it has to do with the way numba has the sum implemented.

JulioAPeraza · 2022-08-25T15:23:30Z

Given these limitations (precision issue with sum, and codecov of jitted functions), do you think we should try Cython for these two functions?

tsalo · 2022-08-25T15:24:41Z

I think @jdkent or @adelavega would be better able to answer that.

jdkent · 2022-08-25T19:29:34Z

nimare/annotate/gclda.py

 LGR = logging.getLogger(__name__)


+@jit(nopython=True, cache=True)


would codecov track this function if nopython=False?

I'm not sure. Let me try this out.

JulioAPeraza · 2022-08-25T21:02:29Z

@adelavega _update_peak_assignments is taking a total of 132.477 s per iteration in python (3-4 s in numba). The line (236) that is having issues in numba is the one that takes 20% of the running time.

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
.
.
.
   170                                               # Seed random number generator
   171         2        168.0     84.0      0.0      np.random.seed(randseed)
   172                                           
   173                                               # Iterate over all peaks x, and sample a new y and r assignment for each
   174    768919     904087.0      1.2      0.7      for i_ptoken, doc in enumerate(doc_idx):
   175    768918     646397.0      0.8      0.5          topic = peak_topic_idx[i_ptoken]
   176    768918     622581.0      0.8      0.5          region = peak_region_idx[i_ptoken]
   177                                           
   178                                                   # Decrement count-matrices to remove current ptoken_topic_idx
   179                                                   # because ptoken will be reassigned to a new topic
   180    768918    1234560.0      1.6      0.9          ny_region_topic[region, topic] -= 1  # Decrement count in Subregion x Topic count matx
   181    768918    1064501.0      1.4      0.8          ny_doc_topic[doc, topic] -= 1  # Decrement count in Document x Topic count matrix
   182                                           
   183                                                   # Retrieve the probability of generating current x from all
   184                                                   # subregions: [R x T] array of probs
   185    768918    1719597.0      2.2      1.3          p_x_subregions = (peak_probs[i_ptoken, :, :]).transpose()
   186                                           
   187                                                   # Compute the probabilities of all subregions given doc
   188                                                   #     p(r|d) ~ p(r|t) * p(t|d)
   189                                                   # Counts of subregions per topic + prior: p(r|t)
   190    768918    5259387.0      6.8      4.0          p_region_g_topic = ny_region_topic + delta
   191                                           
   192                                                   # Normalize the columns such that each topic's distribution over subregions sums to 1
   193    768918   12123463.0     15.8      9.2          p_region_g_topic_sum = np.sum(p_region_g_topic, axis=0)
   194    768918    3210778.0      4.2      2.4          p_region_g_topic = np.divide(p_region_g_topic, p_region_g_topic_sum)
   195                                           
   196                                                   # Counts of topics per document + prior: p(t|d)
   197    768918    3648791.0      4.7      2.8          p_topic_g_doc = ny_doc_topic[doc, :] + alpha
   198                                           
   199                                                   # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
   200    768918    2823998.0      3.7      2.1          p_topic_g_doc = p_topic_g_doc.repeat(n_regions).reshape((-1, n_regions)).T
   201                                           
   202                                                   # Compute p(subregion | document): p(r|d) ~ p(r|t) * p(t|d)
   203                                                   # [R x T] array of probs
   204    768918    2890817.0      3.8      2.2          p_region_g_doc = p_topic_g_doc * p_region_g_topic
   205                                           
   206                                                   # Compute the multinomial probability: p(z|y)
   207                                                   # Need the current vector of all z and y assignments for current doc
   208                                                   # The multinomial from which z is sampled is proportional to number
   209                                                   # of y assigned to each topic, plus constant gamma
   210                                                   # Compute the proportional probabilities in log-space
   211   1537836    3591206.0      2.3      2.7          logp = nz_doc_topic[doc, :] * np.log(
   212    768918    6299922.0      8.2      4.8              (ny_doc_topic[doc, :] + gamma + 1) / (ny_doc_topic[doc, :] + gamma)
   213                                                   )
   214                                                   # Add a constant before exponentiating to avoid any underflow issues
   215    768918   14197485.0     18.5     10.7          p_peak_g_topic = np.exp(logp - np.max(logp))
   216                                           
   217                                                   # Reshape from (ntopics,) to (nregions, ntopics) with duplicated rows
   218    768917    2708038.0      3.5      2.0          p_peak_g_topic = p_peak_g_topic.repeat(n_regions).reshape((-1, n_regions)).T
   219                                           
   220                                                   # Get the full sampling distribution:
   221                                                   # [R x T] array containing the proportional probability of all y/r combinations
   222    768917    4631186.0      6.0      3.5          probs_pdf = p_x_subregions * p_region_g_doc * p_peak_g_topic
   223                                           
   224                                                   # Convert from a [R x T] matrix into a [R*T x 1] array we can sample from
   225    768917    5906090.0      7.7      4.5          probs_pdf = np.reshape(probs_pdf, (n_regions * n_topics))
   226                                           
   227                                                   # Normalize the sampling distribution
   228    768917   12366888.0     16.1      9.3          probs_pdf = probs_pdf / np.sum(probs_pdf)
   229                                                   # Float precision issue in Numba: https://github.com/numba/numba/issues/3426
   230    768917    4950658.0      6.4      3.7          probs_pdf = np.trunc(probs_pdf * (10**12)) / (10**12)
   231                                           
   232                                                   # Sample a single element (corresponding to a y_i and c_i assignment for the ptoken)
   233                                                   # from the sampling distribution
   234                                                   # Returns a binary [1 x R*T] vector with a '1' in location that was sampled
   235                                                   # and zeros everywhere else
   236    768917   25842795.0     33.6     19.5          assignment_vec = np.random.multinomial(1, probs_pdf)
   237                                           
   238                                                   # Reshape 1D back to [R x T] 2D
   239    768917    5398268.0      7.0      4.1          assignment_arr = np.reshape(assignment_vec, (n_regions, n_topics))
   240                                                   # Transform the linear index of the sampled element into the
   241                                                   # subregion/topic (r/y) assignment indices
   242    768917    4236942.0      5.5      3.2          assignment_idx = np.where(assignment_arr)
   243                                                   # Subregion sampled (r)
   244    768917    1139169.0      1.5      0.9          region = assignment_idx[0][0]
   245                                                   # Topic sampled (y)
   246    768917     737100.0      1.0      0.6          topic = assignment_idx[1][0]
   247                                           
   248                                                   # Update the indices and the count matrices using the sampled y/r assignments
   249                                                   # Increment count in Subregion x Topic count matrix
   250    768917    1799035.0      2.3      1.4          ny_region_topic[region, topic] += 1
   251                                                   # Increment count in Document x Topic count matrix
   252    768917    1148576.0      1.5      0.9          ny_doc_topic[doc, topic] += 1
   253                                                   # Update y->topic assignment
   254    768917     669923.0      0.9      0.5          peak_topic_idx[i_ptoken] = topic
   255                                                   # Update y->subregion assignment
   256    768917     704382.0      0.9      0.5          peak_region_idx[i_ptoken] = region
   257                                           
   258         1          1.0      1.0      0.0      return ny_region_topic, ny_doc_topic, peak_topic_idx, peak_region_idx

tsalo · 2022-08-26T15:04:32Z

nimare/annotate/gclda.py

+        # from the sampling distribution
+        # Returns a binary [1 x R*T] vector with a '1' in location that was sampled
+        # and zeros everywhere else
+        assignment_vec = np.random.multinomial(1, probs_pdf)


The documentation says to "use the multinomial method of a default_rng() instance instead". If we're lucky, that might also be faster?

I have tried different combinations with:

rng = np.random.default_rng() assignment_vec = rng.multinomial(1, probs_pdf)

and everything has failed in numba.

Using the correction proposed in this issue https://github.com/numba/numba/issues/3426, the current assignment_vec = np.random.multinomial(1, probs_pdf) is working fine without truncating probs_pdf.

Ah that's good to hear. I'd love to hear what the speed up is using numba on a large-ish dataset. When you get a chance to run some bench marks, ping me.

We then need to decide if its worth the "risk" to fork numba. It is unfortunately a maintenance problem potentially.

JulioAPeraza · 2022-08-26T21:08:27Z

@adelavega Running times per iteration for 200 topics:

Function	Branch	Neurosynth (14369 studies, 3228 words)	NeuroQuery (13459 studies, 6145 words)
`_update_word_topic_assignments`	master	~96s	~1342s
`_update_word_topic_assignments`	gclda-numba	~10s	~149s
`_update_peak_assignments`	master	~112s	~96s
`_update_peak_assignments`	gclda-numba	~55s	~50s

Typically we have 10,000 iterations, so to train GCLDA on Neurosynth with master would take ~25 days, which can be reduced to ~7.5 days with gclda-numba.

JulioAPeraza · 2022-10-04T14:57:40Z

The latest changes showed improvement in _update_peak_assignments.

Creating chunks (n_chunks = n_cores) of the data and running each chunk in separate processors reduced the overhead and the number of delayed() calls by the embarrassingly parallel routine. Right now we have 2 nested for loops, one iterates over chunks and it's embarrassingly parallelized, and the other iterates within chunks extending the duration of the task that is submited to Parallel. These changes, besides reducing the overhead, reduce the time that it takes to dispatch so many calls (in our case 500,000) of very fast tasks to workers.

However, this is only scalable to a small number of CPUs. I got small linear scalability for only <= 4 CPUs in a local machine and <= 8 CPUs in an HPC.

What do you all think? Should we:

keep numba support for _update_word_topic_assignments and wait for numba to fix the issue with the multinomial generator to add support for _update_peak_assignments, or
keep both numba support for _update_word_topic_assignments and Parallel for _update_peak_assignments, with a warning to users that the algorithm is only scalable to a small number of CPUs (4-8)?

I think number 1 is good, since using numba for _update_word_topic_assignments already makes GCLDA 10x faster.

adelavega · 2022-10-06T16:40:30Z

@JulioAPeraza awesome! q: why is it only scalable to a small number of CPUs? I don't quite understand that.

but given that limitation, I agree with your reasoning and think #1 is a good option. That's an order of magnitude so not bad.

JulioAPeraza · 2022-10-06T19:53:08Z

@adelavega
I'm not sure. I guess the scalability problem comes from the cost of communication added by having all the input arrays written to the disk (or copied to memory) in parallel. Although the increase in the number of CPUs reduces the calculation time, it may also increase the communication making the algorithm not scalable.
I just realized that by default Parallel will dump into a file any array larger than 1 megabyte. I ran a couple of tests with max_nbytes=None and now the algorithm is at least weakly scalable to all the cores available in my laptop (8 CPUs). I still need to test this on the HPC with a bigger number of CPUs. With 8 CPUs the calculation time per iteration of _update goes down from ~112s to ~15s.

adelavega · 2022-10-06T19:57:37Z

Ah, disk i/o would certainly explain it. I wonder using "threads" would also help: https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism

… instance

JulioAPeraza · 2022-10-12T14:21:11Z

I tested the scalability of the function with more CPUs in the HPC and it looks good. I also ran some tests using threads but it was slower than the default loky.

This PR is ready for review. Thanks!

adelavega · 2022-10-12T22:17:53Z

Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?

https://numba.readthedocs.io/en/stable/user/parallel.html

It looks like the main concern would be the same matrices are being access by the document loop, but it looks like it would be different parts of the arrays?

adelavega · 2022-10-12T22:28:57Z

Otherwise, I took a closer look at the GCLDA algorithm, and this all looks reasonable to me. I may have some more nit-picky comments later, but perhaps @jdkent and @tsalo can help with those as well.

JulioAPeraza · 2022-10-12T23:22:47Z

Thanks, @adelavega.

Do you know if explicitly setting parallel=True would help the numba optimized _update_peak_assignments?

I tried parallel=True before in _update_word_topic_assignments and also in _update_peak_assignments running on our fork of numba, and I got a Segmentation Fault error. I will look into it early tomorrow.

jdkent

Looks good, just a minor suggestion for readability for me.

jdkent · 2022-10-20T19:30:36Z

nimare/annotate/gclda.py

+            [
+                self.data["ptoken_doc_idx"][chunk_i : chunk_i + chunk_size],
+                old_regions[chunk_i : chunk_i + chunk_size],
+                old_topics[chunk_i : chunk_i + chunk_size],
+                peak_probs[chunk_i : chunk_i + chunk_size, :, :],
+                seeds[chunk_i : chunk_i + chunk_size],
+            ]


can this be a dictionary so that the arguments being sent to _update_peak_assignments are explicit?

jdkent · 2022-10-20T19:30:57Z

nimare/annotate/gclda.py

+
+        results = Parallel(n_jobs=self.n_cores, max_nbytes=None)(
+            delayed(_update_peak_assignments)(
+                *chunk,


with above, this would now be **chunk

jdkent · 2022-10-20T19:32:35Z

nimare/annotate/gclda.py

+                self.topics["n_peak_tokens_region_by_topic"],
+                self.topics["n_peak_tokens_doc_by_topic"],
+                self.topics["n_word_tokens_doc_by_topic"],
+                self.params["n_regions"],
+                self.params["n_topics"],
+                self.params["alpha"],
+                self.params["delta"],
+                self.params["gamma"],


a bit redundant, but now the arguments here should be named too (e.g., `gamma = self.params["gamma"]

jdkent · 2022-10-20T19:47:36Z

just want to make note of the current gclda example and the one in this pull request:

current: https://nimare.readthedocs.io/en/latest/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

current decoding ROI:
Term Weight
motor 24.666614
seed 20.181775
structural 17.939355
cortex 15.722663
behavioral 15.696936
methods 11.237680
speech 11.218747
connectivity modeling 11.212097
sca17 11.212097
vbm 8.969678

pull request decoding ROI:
Term Weight
insula 11.288984
connectivity 10.711503
approaches 10.034653
network 8.780633
anterior insula 7.525990
nachr 7.525990
functional 6.869626
voice 6.271658
smokers 5.017326
nachr agonists 5.017326

I suppose I do not know how stochastic the results are on this test dataset since it is very small.

EDIT: consistent since first change in this pull request: https://nimare--744.org.readthedocs.build/en/744/auto_examples/03_annotation/04_plot_gclda.html#sphx-glr-auto-examples-03-annotation-04-plot-gclda-py

adelavega · 2022-10-21T01:10:26Z

nimare/annotate/gclda.py

+            len(self.vocabulary),
+            self.params["beta"],
+            self.params["gamma"],
+            self.seed,


I noticed that earlier int he code self.seed = 0 rather than self.seed = self.params["seed_init"].

Could this be part of the issue?

Probably that explains the difference you saw because we are testing with a small number of iterations. I think self.params["seed_init"] is used to initialize the peak_topic and the peak_subregion assignments, whereas self.seed is used to perform the sampling in _update().

adelavega · 2022-10-21T01:16:35Z

nimare/annotate/gclda.py

+        Updated number of word-tokens assigned to each topic (T) per document (D).
+    """
+    # --- Seed random number generator
+    np.random.seed(randseed)


It's likely jit does something slightly different here, since the results change if I comment out @jit above.

Use numba in for loop of _update_word_topic_assignments and _update_p…

091b420

…eak_assignments

JulioAPeraza added the refactoring Requesting changes to the code which do not impact behavior label Aug 24, 2022

JulioAPeraza added 5 commits August 24, 2022 15:02

Convert staticmethod to function

4e7ca79

Disable numba execution to allow pytest-cov to detect coverage in fun…

daf4962

…ctions with @jit

DISABLE_JIT=True only for GCLDA tests

0d86fba

DISABLE_JIT=True inside GCLDA tests

1502f2b

DISABLE_JIT=False after GCLDA tests

6844ce0

tsalo reviewed Aug 24, 2022

View reviewed changes

nimare/tests/test_annotate_gclda.py Outdated Show resolved Hide resolved

JulioAPeraza commented Aug 24, 2022

View reviewed changes

JulioAPeraza added 4 commits August 24, 2022 18:20

Fix comments

d61fff7

Disable JIT using environment variable

63b46d4

Lint issue

2abaf9a

Try using config.DISABLE_JIT again

1e985e0

jdkent reviewed Aug 25, 2022

View reviewed changes

Set nopython mode to False

694a632

tsalo reviewed Aug 26, 2022

View reviewed changes

JulioAPeraza added 4 commits August 30, 2022 11:51

Remove truncation of probs_pdf

f263e32

Use Parallel in _update_peak_assignments

e818cf9

Speed parallelization by avoiding excessive .copy() across jobs

0c07abc

Create chunks of data to speed up Parallel

5870c3d

Clean up the code

4e12046

JulioAPeraza changed the title ~~Optimize GCLDAModel with numba~~ Optimize GCLDAModel with numba and Parallel Oct 10, 2022

JulioAPeraza added 2 commits October 11, 2022 11:55

Use an array of seeds for multi-core reproducibility of default_rng()…

dc2ac6c

… instance

Update gclda.py

fc7c5e6

jdkent requested changes Oct 20, 2022

View reviewed changes

adelavega reviewed Oct 21, 2022

View reviewed changes

Parallelize _update_word_topic_assignments()

0d2939c

JulioAPeraza marked this pull request as draft December 8, 2022 15:42

		# Float precision issue in Numba: https://github.com/numba/numba/issues/3426
		probs_pdf = np.trunc(probs_pdf * (1012)) / (1012)

		LGR = logging.getLogger(__name__)


		@jit(nopython=True, cache=True)

Optimize GCLDAModel with numba and Parallel #744

Are you sure you want to change the base?

Optimize GCLDAModel with numba and Parallel #744

Uh oh!

Conversation

JulioAPeraza commented Aug 24, 2022

Uh oh!

codecov bot commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tsalo commented Aug 24, 2022

Uh oh!

JulioAPeraza commented Aug 24, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulioAPeraza commented Aug 25, 2022

Uh oh!

tsalo commented Aug 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulioAPeraza commented Aug 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adelavega Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulioAPeraza commented Aug 26, 2022

Uh oh!

JulioAPeraza commented Oct 4, 2022

Uh oh!

adelavega commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulioAPeraza commented Oct 6, 2022

Uh oh!

adelavega commented Oct 6, 2022

Uh oh!

JulioAPeraza commented Oct 12, 2022

Uh oh!

adelavega commented Oct 12, 2022

Uh oh!

adelavega commented Oct 12, 2022

Uh oh!

JulioAPeraza commented Oct 12, 2022

Uh oh!

jdkent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdkent commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulioAPeraza Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Optimize GCLDAModel with `numba` and `Parallel` #744

Optimize GCLDAModel with `numba` and `Parallel` #744

codecov bot commented Aug 24, 2022 •

edited

Loading

adelavega Aug 26, 2022 •

edited

Loading

adelavega commented Oct 6, 2022 •

edited

Loading

jdkent commented Oct 20, 2022 •

edited

Loading

JulioAPeraza Oct 21, 2022 •

edited

Loading