Skip to content

Change internal array representation to LargeListArray#462

Draft
hombit wants to merge 1 commit intomainfrom
large-list
Draft

Change internal array representation to LargeListArray#462
hombit wants to merge 1 commit intomainfrom
large-list

Conversation

@hombit
Copy link
Collaborator

@hombit hombit commented Mar 3, 2026

Changes internal NestedExtensionArray to use pa.LargeListArray (int64 offset) instead of pa.ListArray (int32 offset). This is motivated by wanting to support dataframes with more than 2**31 nested elements, which may be the case when loading large datasets with nested-pandas or returning large results from LSDB with .compute(). (I faced it myself when operating with DP2 pilots.)

This PR introduces breaking changes: by default all outputs are now large lists, including ndf.nested.to_lists(), ndf.to_parquet(), pa.array(ndf.nested), etc. However, this PR provides a new large_list: bool = True argument which, when set to False, returns "normal" lists. I'd like to hear opinions on whether we should keep this behavior or set it to False by default, from the perspective of hats/hats-import/lsdb usage.

The alternative design would be a better support of chunked arrays, because we quite aggressively re-chunk the data in some operations. This would be much harder to implement and test, and also could lead to "memory fragmentation" issues in some use cases (for example, concatenation of dozens of thousands of partitions happening when running lsdb.Catalog.compute() over a large catalog).

Closes #95

@hombit hombit requested a review from dougbrn March 3, 2026 18:58
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Before [a11d359] After [d40c10a] Ratio Benchmark (Parameter)
28.2±0.8ms 31.3±0.5ms ~1.11 benchmarks.AssignSingleDfToNestedSeries.time_run
48.2±0.5ms 53.5±2ms ~1.11 benchmarks.ReassignHalfOfNestedSeries.time_run
66.3±0.5ms 68.9±0.3ms 1.04 benchmarks.CountNestedBy.time_run
256M 262M 1.02 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
263M 268M 1.02 benchmarks.ReassignHalfOfNestedSeries.peakmem_run
1.16G 1.18G 1.01 benchmarks.ReadFewColumnsS3.peakmem_run
137M 137M 1.00 benchmarks.CountNestedBy.peakmem_run
105M 104M 0.99 benchmarks.NestedFrameAddNested.peakmem_run
110M 109M 0.99 benchmarks.NestedFrameQuery.peakmem_run
109M 108M 0.99 benchmarks.NestedFrameReduce.peakmem_run

Click here to view all benchmarks.

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 97.29730% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.24%. Comparing base (61afc5f) to head (85eac5d).
⚠️ Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
src/nested_pandas/series/utils.py 95.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #462      +/-   ##
==========================================
- Coverage   97.34%   97.24%   -0.11%     
==========================================
  Files          19       19              
  Lines        2148     2178      +30     
==========================================
+ Hits         2091     2118      +27     
- Misses         57       60       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@dougbrn dougbrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks like a reasonable implementation, I have a couple of thoughts/comments:

  • Expectedly, there's a performance hit to this change (~10% on two of our benchmarks), and it sounds like you have use cases where you've run into this, but it does hurt to take a hit like this for all cases because of incompatibility at the edges.
  • Have you considered some kind of pandas.options parallel here for us to swap the backend? Probably this is a can of worms, but didn't know if you had thought about it at all.
  • As to the default value for large_list kwarg, I don't know I could see arguments for both. I liked False initially for minimal disruption of potential downstream workflows, but not sure if invoking the downcast hits performance at all in these cases? Default True seems nice in that the only reason someone who try to move off of it would be for fine-tuning performance (again if that even provides a benefit), or downstream compatibility.

@hombit
Copy link
Collaborator Author

hombit commented Mar 4, 2026

Thank you, @dougbrn!

  • Expectedly, there's a performance hit to this change (~10% on two of our benchmarks), and it sounds like you have use cases where you've run into this, but it does hurt to take a hit like this for all cases because of incompatibility at the edges.

Oh, I missed it, it is a very good point! Let's see if I can do anything to improve the performance. I actually believe that this edge-case is very important from the perspective of large-catalog analysis with LSDB. We can also think about alternative designs, see a comment bellow and in the PR description.

  • Have you considered some kind of pandas.options parallel here for us to swap the backend? Probably this is a can of worms, but didn't know if you had thought about it at all.

I don't like pandas.options, it is too implicit. It would also be very hard to test and debug, both on our and the user's side.

  • As to the default value for large_list kwarg, I don't know I could see arguments for both. I liked False initially for minimal disruption of potential downstream workflows, but not sure if invoking the downcast hits performance at all in these cases? Default True seems nice in that the only reason someone who try to move off of it would be for fine-tuning performance (again if that even provides a benefit), or downstream compatibility.

I think I'll be fine with large_list=False by default. The only downside is that a pipeline debugged on a small dataset would unexpectedly fail on a large dataset, where large_list=True would actually be required.

Meta-comment
One more alternative design is supporting both LargeList and List on the Dtype/ExtensionArray level. But it makes the user interface much trickier. Another reason I think LargeList by default is good is that Polars switched to it after trying with List for a while; I think we can trust their experience.

@hombit hombit marked this pull request as draft March 6, 2026 22:32
@hombit
Copy link
Collaborator Author

hombit commented Mar 6, 2026

I'm converting this to draft and working on the "chunking" alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle series with more than 2^31 "flat" elements

2 participants