-
Notifications
You must be signed in to change notification settings - Fork 84
feat: add tree to virtual array conversion #1393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@pfackeldey - what is the plan for this PR? thanks! |
I think we were discussing with Peter to add some caching support. Uproot will deserialize each electron branch for example separately with its own offsets. All those will have the same |
Isn't there already some caching being done in Uproot? When trying to read the count_branch multiple times it should already be hitting the cache |
Yeah I need to try if it's hitting it, haven't done that yet. Will do today. Do you the best way to log that (the number of deserializations per branch)? |
I'm not sure. I've just skimmed the code since at some point I'll have to do that for RNTuples |
I'm not sure. I'm not a big fan of this implementation, but I also don't know how it can be done in a better way. I was hoping for some input here. |
updates: - [github.com/astral-sh/ruff-pre-commit: v0.11.4 → v0.11.5](astral-sh/ruff-pre-commit@v0.11.4...v0.11.5) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Updated Pyodide version * Pinned chrome version * Changed chrome version * Try using node instead of chrome * Remove chrome-specific setup * Actually use Node * Go back to chrome
updates: - [github.com/astral-sh/ruff-pre-commit: v0.11.5 → v0.11.6](astral-sh/ruff-pre-commit@v0.11.5...v0.11.6) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
fix issue with empty big_endian array Co-authored-by: Ianna Osborne <[email protected]>
* safer branch title access * empty str -> None --------- Co-authored-by: Ianna Osborne <[email protected]>
* docs: add contributing guide * style: pre-commit fixes * Update CONTRIBUTING.md Co-authored-by: Andres Rios Tascon <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Andres Rios Tascon <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Andres Rios Tascon <[email protected]> * use pre-commit * build local documentation howto --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Andres Rios Tascon <[email protected]>
Hi @ianna, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfackeldey - Thanks! I would rather avoid duplicating the code. What was the reason for copying the form_with_unique_keys
utility function here? Thanks.
Needs: scikit-hep/awkward#3482 and thus a new awkward release |
Hi, I just found this while wondering how to use / demonstrate virtual arrays without |
No, nothing is blocking it. In fact we have talked about it briefly at a meeting. Now that virtual arrays are in the release, it should go ahead. Thanks for bringing it up! |
I think the only blocking thing was Pyodide, since it required an older version of Awkward. There was a Pyodide release yesterday, so I'll wrap up #1464 so that that this can proceed. |
If I may add a comment here. Merging this would mean the uproot has to require a relatively new version of awkward (which doesn't have to without this PR). I think there are two ways to go with this: Everything is up to you of course, I would just like to mention this. |
I'd prefer doing this via requirements and not parsing versions in the code. This is much more robust and we don't have to maintain/think of these |
def virtual_arrays( | ||
self, | ||
*, | ||
filter_name=no_filter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think people will complain about not having expression support like the regular arrays
method?
(I don't blame you for not wanting to do it. This is why I want to fix formulate
so that I can use it for RNTuples.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add it in the future once formulate if fixed I'd say. I'd argue it's better to restrict functionality and open it up later once it's in a working state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid API sprawl, we can introduce a single, unified API, like:
def arrays(
self,
expressions=None,
cut=None,
*,
filter_name=no_filter,
filter_typename=no_filter,
filter_branch=no_filter,
aliases=None,
language=uproot.language.python.python_language,
entry_start=None,
entry_stop=None,
decompression_executor=None,
interpretation_executor=None,
array_cache="inherit",
library="ak",
ak_add_doc=False,
how=None,
virtual=False, # <--- NEW
access_log=None # <--- NEW (only applies if virtual=True)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a great idea! I think I tried this in the beginning, but for some reason decided for a new method... can't remember why.
I'd be in favor to make it work like you proposed @ianna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we need to make sure that expressions work. Otherwise, it will be very confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always just error for certain argument combinations with reasonable messages. Selecting branches is already there with filter_branch
, right? I don't see a reason why you'd want to do this with an expression as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like implicit materializations though.....hmmmm....we can think about this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with not supporting expressions, but I think in that case having virtual_arrays
would make more sense to make it clear that the api is a bit different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with not supporting expressions, but I think in that case having
virtual_arrays
would make more sense to make it clear that the api is a bit different.
that may have been my original motivation for a separate method. there may be other args as well the make sense for eager arrays and not virtual arrays and vice-versa. Maybe we should understand how many weird/invalid argument combinations there are to see if it is valid to move it into a different method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add a bit of awkward1/uproot4 history. Uproot4 had uproot.lazy
which was not even a TTree method. This is because it behaved more like uproot.dask
though (it could read more than 1 files at the same time). Awkward1 methods like ak.from_parquet
had a lazy=True
option for virtual reading.
If indeed .virtual_arrays
is very different from .arrays
, I like the separate method. If there is a ton of overlap, it can be an option to .arrays
. I don't know the uproot codebase though so I don't have any strong takes.
Also, I think it's fine since Awkward and Uproot are co-maintained, so if one can be upgraded, so can the other one. It's not like Awkward dropped support for Python 3.9, while Uproot still supported it or something like that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, eventually the users would prefer using virtual arrays (aka ak.Array
with virtual buffers :-) by default.
def virtual_arrays( | ||
self, | ||
*, | ||
filter_name=no_filter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid API sprawl, we can introduce a single, unified API, like:
def arrays(
self,
expressions=None,
cut=None,
*,
filter_name=no_filter,
filter_typename=no_filter,
filter_branch=no_filter,
aliases=None,
language=uproot.language.python.python_language,
entry_start=None,
entry_stop=None,
decompression_executor=None,
interpretation_executor=None,
array_cache="inherit",
library="ak",
ak_add_doc=False,
how=None,
virtual=False, # <--- NEW
access_log=None # <--- NEW (only applies if virtual=True)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pfackeldey - thanks! Looks good to me. Please merge it if you’re done with it. Thanks
Oh, it’s still in a draft mode 😅 |
This waits for scikit-hep/awkward#3364 (and a corresponding awkward release).
I may likely not have attempted the most optimal solution here. Happy for feedback & input.