Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #656
+236
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Disclaimer
I don't have much context around DuckDB + DuckLake yet (just started using both last week), so I got Claude Opus 4.5 to do most of the heavy-lifting when debugging and writing the code for this PR and then proofread the results. I'm happy to receive some feedback and work on any changes that are deemed necessary.
I ended up not opening an issue for this since I already have the fix in the PR. If the issue is a prerequisite, I can take the time to open one. Also hi ! This is my first issue/PR here :)
Summary
Fixes a bug where
ducklake_add_data_filesfails with "bigint out of range" error when registering parquet files containingList(Struct(...))columns with NULL values at various nesting levels.Issue Reproduction Details
We can use this Python script to generate a Parquet file that will trigger the failure.
Start a PostgreSQL instance using Docker:
Then, in DuckDB:
The expected result would be to register the parquet normally. The actual result is:
Problem
When adding PyArrow-generated parquet files with complex nested structures, the value_count field computation could produce underflowed values, which exceed PostgreSQL's signed BIGINT limit and cause the transaction to fail.
Root causes:
Changes
src/functions/ducklake_add_data_files.cppsrc/storage/ducklake_transaction.cpptest/sql/add_files/add_files_nested_list_struct_nulls.testTesting
add_data_filesworks.