Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #675
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm not too familiar with how forking works, so I ended up accidentally deleting the original PR (#656) when fixing my fork. 😞
Disclaimer
I don't have much context around DuckDB + DuckLake yet (just started using both last week), so I got Claude Opus 4.5 to do most of the heavy-lifting when debugging and writing the code for this PR and then proofread the results. I'm happy to receive some feedback and work on any changes that are deemed necessary.
I ended up not opening an issue for this since I already have the fix in the PR. If the issue is a prerequisite, I can take the time to open one. Also hi ! This is my first issue/PR here :)
Summary
Fixes a bug where
ducklake_add_data_filesfails with "bigint out of range" error when registering parquet files containingList(Struct(...))columns with NULL values at various nesting levels.Issue Reproduction Details
We can use this Python script to generate a Parquet file that will trigger the failure.
Start a PostgreSQL instance using Docker:
Then, in DuckDB:
The expected result would be to register the parquet normally. The actual result is:
Problem
When adding PyArrow-generated parquet files with complex nested structures, the value_count field computation could produce underflowed values, which exceed PostgreSQL's signed BIGINT limit and cause the transaction to fail.
Root causes:
Changes
src/functions/ducklake_add_data_files.cppsrc/storage/ducklake_transaction.cpptest/sql/add_files/add_files_nested_list_struct_nulls.testTesting
add_data_filesworks.