Skip to content

Conversation

pfackeldey
Copy link
Collaborator

This waits for scikit-hep/awkward#3364 (and a corresponding awkward release).

I may likely not have attempted the most optimal solution here. Happy for feedback & input.

@ianna
Copy link
Collaborator

ianna commented Apr 11, 2025

@pfackeldey - what is the plan for this PR? thanks!

@ikrommyd
Copy link
Contributor

@pfackeldey - what is the plan for this PR? thanks!

I think we were discussing with Peter to add some caching support. Uproot will deserialize each electron branch for example separately with its own offsets. All those will have the same count_branch however so it's probably best to not deserialize the same offsets dozens of times. It's probably best to cache count_branch deserialization result (length) and use it for the other branches that have the same count_branch.

@ariostas
Copy link
Collaborator

I think we were discussing with Peter to add some caching support.

Isn't there already some caching being done in Uproot? When trying to read the count_branch multiple times it should already be hitting the cache

@ikrommyd
Copy link
Contributor

ikrommyd commented Apr 11, 2025

I think we were discussing with Peter to add some caching support.

Isn't there already some caching being done in Uproot? When trying to read the count_branch multiple times it should already be hitting the cache

Yeah I need to try if it's hitting it, haven't done that yet. Will do today. Do you the best way to log that (the number of deserializations per branch)?

@ariostas
Copy link
Collaborator

Do you the best way to log that (the number of deserializations per branch)?

I'm not sure. I've just skimmed the code since at some point I'll have to do that for RNTuples

@pfackeldey
Copy link
Collaborator Author

@pfackeldey - what is the plan for this PR? thanks!

I'm not sure. I'm not a big fan of this implementation, but I also don't know how it can be done in a better way. I was hoping for some input here.

@ianna ianna added inactive A pull request that hasn't been touched in a long time help wanted Extra attention is needed and removed inactive A pull request that hasn't been touched in a long time labels Apr 17, 2025
pre-commit-ci bot and others added 7 commits April 23, 2025 14:37
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.11.4 → v0.11.5](astral-sh/ruff-pre-commit@v0.11.4...v0.11.5)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Updated Pyodide version

* Pinned chrome version

* Changed chrome version

* Try using node instead of chrome

* Remove chrome-specific setup

* Actually use Node

* Go back to chrome
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.11.5 → v0.11.6](astral-sh/ruff-pre-commit@v0.11.5...v0.11.6)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
fix issue with empty big_endian array

Co-authored-by: Ianna Osborne <[email protected]>
* safer branch title access

* empty str -> None

---------

Co-authored-by: Ianna Osborne <[email protected]>
* docs: add contributing guide

* style: pre-commit fixes

* Update CONTRIBUTING.md

Co-authored-by: Andres Rios Tascon <[email protected]>

* Update CONTRIBUTING.md

Co-authored-by: Andres Rios Tascon <[email protected]>

* Update CONTRIBUTING.md

Co-authored-by: Andres Rios Tascon <[email protected]>

* use pre-commit

* build local documentation howto

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Andres Rios Tascon <[email protected]>
@pfackeldey pfackeldey requested a review from ianna April 23, 2025 18:37
@pfackeldey
Copy link
Collaborator Author

Hi @ianna,
Finally, I found a good implementation!
This is now handling every awkward Content case in a programmatic way. I could reuse a similar logic as for uproot.dask.
Could you have a look?

@pfackeldey pfackeldey marked this pull request as ready for review April 23, 2025 18:40
@pfackeldey pfackeldey removed the help wanted Extra attention is needed label Apr 24, 2025
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - Thanks! I would rather avoid duplicating the code. What was the reason for copying the form_with_unique_keys utility function here? Thanks.

@pfackeldey
Copy link
Collaborator Author

pfackeldey commented Apr 25, 2025

Needs: scikit-hep/awkward#3482 and thus a new awkward release

@pfackeldey pfackeldey marked this pull request as draft April 25, 2025 17:24
@pfackeldey pfackeldey marked this pull request as ready for review April 27, 2025 18:36
@pfackeldey pfackeldey marked this pull request as draft April 27, 2025 20:59
@alexander-held
Copy link
Member

Hi, I just found this while wondering how to use / demonstrate virtual arrays without coffea and was surprised to not see it in a high-level API in uproot yet. There have been a few awkward releases in the meantime, I am curious if there is something left blocking this?

@ianna
Copy link
Collaborator

ianna commented Aug 4, 2025

Hi, I just found this while wondering how to use / demonstrate virtual arrays without coffea and was surprised to not see it in a high-level API in uproot yet. There have been a few awkward releases in the meantime, I am curious if there is something left blocking this?

No, nothing is blocking it. In fact we have talked about it briefly at a meeting. Now that virtual arrays are in the release, it should go ahead. Thanks for bringing it up!

@ariostas
Copy link
Collaborator

ariostas commented Aug 5, 2025

I think the only blocking thing was Pyodide, since it required an older version of Awkward. There was a Pyodide release yesterday, so I'll wrap up #1464 so that that this can proceed.

@ikrommyd
Copy link
Contributor

ikrommyd commented Aug 5, 2025

If I may add a comment here. Merging this would mean the uproot has to require a relatively new version of awkward (which doesn't have to without this PR). I think there are two ways to go with this:
A) Uproot indeed pins awkward to 2.8.2 (or whatever works around that)
B) Uproot allows earlier versions of awkward but this particular function .virtual_arrays() errors of awkward is not new enough and tells the user to install awkward >= something.

Everything is up to you of course, I would just like to mention this.

@pfackeldey
Copy link
Collaborator Author

If I may add a comment here. Merging this would mean the uproot has to require a relatively new version of awkward (which doesn't have to without this PR). I think there are two ways to go with this: A) Uproot indeed pins awkward to 2.8.2 (or whatever works around that) B) Uproot allows earlier versions of awkward but this particular function .virtual_arrays() errors of awkward is not new enough and tells the user to install awkward >= something.

Everything is up to you of course, I would just like to mention this.

I'd prefer doing this via requirements and not parsing versions in the code. This is much more robust and we don't have to maintain/think of these if conditions in the future.

Comment on lines +731 to +734
def virtual_arrays(
self,
*,
filter_name=no_filter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think people will complain about not having expression support like the regular arrays method?

(I don't blame you for not wanting to do it. This is why I want to fix formulate so that I can use it for RNTuples.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it in the future once formulate if fixed I'd say. I'd argue it's better to restrict functionality and open it up later once it's in a working state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid API sprawl, we can introduce a single, unified API, like:

def arrays(
    self,
    expressions=None,
    cut=None,
    *,
    filter_name=no_filter,
    filter_typename=no_filter,
    filter_branch=no_filter,
    aliases=None,
    language=uproot.language.python.python_language,
    entry_start=None,
    entry_stop=None,
    decompression_executor=None,
    interpretation_executor=None,
    array_cache="inherit",
    library="ak",
    ak_add_doc=False,
    how=None,
    virtual=False,     # <--- NEW
    access_log=None    # <--- NEW (only applies if virtual=True)
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great idea! I think I tried this in the beginning, but for some reason decided for a new method... can't remember why.

I'd be in favor to make it work like you proposed @ianna

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we need to make sure that expressions work. Otherwise, it will be very confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always just error for certain argument combinations with reasonable messages. Selecting branches is already there with filter_branch, right? I don't see a reason why you'd want to do this with an expression as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like implicit materializations though.....hmmmm....we can think about this a bit more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with not supporting expressions, but I think in that case having virtual_arrays would make more sense to make it clear that the api is a bit different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with not supporting expressions, but I think in that case having virtual_arrays would make more sense to make it clear that the api is a bit different.

that may have been my original motivation for a separate method. there may be other args as well the make sense for eager arrays and not virtual arrays and vice-versa. Maybe we should understand how many weird/invalid argument combinations there are to see if it is valid to move it into a different method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add a bit of awkward1/uproot4 history. Uproot4 had uproot.lazy which was not even a TTree method. This is because it behaved more like uproot.dask though (it could read more than 1 files at the same time). Awkward1 methods like ak.from_parquet had a lazy=True option for virtual reading.

If indeed .virtual_arrays is very different from .arrays, I like the separate method. If there is a ton of overlap, it can be an option to .arrays. I don't know the uproot codebase though so I don't have any strong takes.

@ariostas
Copy link
Collaborator

ariostas commented Aug 5, 2025

I'd prefer doing this via requirements and not parsing versions in the code.

Also, I think it's fine since Awkward and Uproot are co-maintained, so if one can be upgraded, so can the other one. It's not like Awkward dropped support for Python 3.9, while Uproot still supported it or something like that.

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, eventually the users would prefer using virtual arrays (aka ak.Array with virtual buffers :-) by default.

Comment on lines +731 to +734
def virtual_arrays(
self,
*,
filter_name=no_filter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid API sprawl, we can introduce a single, unified API, like:

def arrays(
    self,
    expressions=None,
    cut=None,
    *,
    filter_name=no_filter,
    filter_typename=no_filter,
    filter_branch=no_filter,
    aliases=None,
    language=uproot.language.python.python_language,
    entry_start=None,
    entry_stop=None,
    decompression_executor=None,
    interpretation_executor=None,
    array_cache="inherit",
    library="ak",
    ak_add_doc=False,
    how=None,
    virtual=False,     # <--- NEW
    access_log=None    # <--- NEW (only applies if virtual=True)
)

@ianna ianna added the next-release Required for the next release label Aug 14, 2025
@ianna ianna self-requested a review August 28, 2025 15:47
Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pfackeldey - thanks! Looks good to me. Please merge it if you’re done with it. Thanks

@ianna
Copy link
Collaborator

ianna commented Sep 5, 2025

@pfackeldey - thanks! Looks good to me. Please merge it if you’re done with it. Thanks

Oh, it’s still in a draft mode 😅

@ianna ianna removed the next-release Required for the next release label Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants