Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reboost should be fast #27

Open
tdixon97 opened this issue Jan 30, 2025 · 4 comments
Open

reboost should be fast #27

tdixon97 opened this issue Jan 30, 2025 · 4 comments
Assignees

Comments

@tdixon97
Copy link
Collaborator

We should discuss checks on reboost performance, identify the limiting factors etc.

@tdixon97 tdixon97 self-assigned this Jan 30, 2025
@tdixon97
Copy link
Collaborator Author

tdixon97 commented Feb 6, 2025

@willquinn I made some tests regarding our sorting issues.
I just compare some different methods to sort.
The code is below, my conclusions:

  • njit and sort a list of numpy arrays probably will not help so much
  • the lexsort I used in hit_table_layout is very slow and can surely be improved,
  • sorting is probably just inherently computationally intensive for us and we need to find other ideas.
  • we need less steps.

Eg. for the time grouping I was considering to exploit that steps within a track should be naturally time sorted.

@gipert what do you think?

Here is the code:

import numpy as np
import awkward as ak
import numba

# generate some random "times" and "idxs"
idx = np.sort(np.random.randint(0, 10**3, size=10**7))
time = np.random.random(size=10**7)
arr = ak.Array({"idx":idx,"time":time})
arr = ak.unflatten(arr,ak.run_lengths(arr.idx))

tried a basic ak.sort on the ak.Array:

# try a standard ak.
%%time
sort = ak.flatten(ak.sort(arr,axis=-1)).to_numpy()
CPU times: user 971 ms, sys: 104 ms, total: 1.07 s
Wall time: 1.07 s

Then an argsort:

%%time
sort = arr[ak.argsort(arr.time,axis=-1)]
CPU times: user 795 ms, sys: 88.1 ms, total: 883 ms
Wall time: 880 ms

Then a np.lexsort joint sorting of both lists (what I used in hit_group_layout):

%%time
# sort first by idx then by time
sort  = np.lexsort((time,idx))
CPU times: user 3.35 s, sys: 16 ms, total: 3.36 s
Wall time: 3.36 s

Finally I wrote some simple numba jit function:

from numba.typed import List
@numba.njit
def sort_subarrays(arr_list):
    for i in range(len(arr_list)):
        arr_list[i] = np.sort(arr_list[i])
    return arr_list

But it requires converting the argument to a numba list which is very slow:

%%time
arr_l = List(arr.time.to_list())
CPU times: user 6.15 s, sys: 184 ms, total: 6.33 s
Wall time: 6.33 s

Finally, I compile the function (run once) then run again:

%%time
a = sort_subarrays(arr_l)
CPU times: user 777 ms, sys: 4.13 ms, total: 781 ms
Wall time: 775 ms

@gipert
Copy link
Member

gipert commented Feb 6, 2025

But it requires converting the argument to a numba list which is very slow

this is certainly not a good idea, maybe there is something useful here https://awkward-array.org/doc/main/user-guide/how-to-use-in-numba-intro.html

@ManuelHu
Copy link
Collaborator

ManuelHu commented Feb 6, 2025

probably this is more helpful: https://awkward-array.org/doc/main/user-guide/how-to-use-in-numba-features.html#casting-one-dimensional-arrays-as-numpy

this compiles and runs, for example:

@numba.njit
def sort_subarrays(arr):
    for i, a in enumerate(arr):
        np.sort(np.asarray(a.time))

(just to illustrate that we can call np functions directly on ak arrays, did not bother to implement the whole sorting right now)

@tdixon97
Copy link
Collaborator Author

tdixon97 commented Feb 6, 2025

I doubt we will get an improvement over ak.sort but I will try...
Edit I had a go, it seems easy enough to compile some ak functions, but I am not sure how to return an awkward array (issue because they are immutable so eg. Manuels code does nothing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants