Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #656

Costa-SM · 2025-12-29T21:32:46Z

Disclaimer
I don't have much context around DuckDB + DuckLake yet (just started using both last week), so I got Claude Opus 4.5 to do most of the heavy-lifting when debugging and writing the code for this PR and then proofread the results. I'm happy to receive some feedback and work on any changes that are deemed necessary.

I ended up not opening an issue for this since I already have the fix in the PR. If the issue is a prerequisite, I can take the time to open one. Also hi ! This is my first issue/PR here :)

Summary
Fixes a bug where ducklake_add_data_files fails with "bigint out of range" error when registering parquet files containing List(Struct(...)) columns with NULL values at various nesting levels.

Issue Reproduction Details

DuckDB version: 1.4.3
DuckLake version: core_nightly (tested 2024-12-27)
PostgreSQL version: 16
OS: Linux
Python 3.11

We can use this Python script to generate a Parquet file that will trigger the failure.

#!/usr/bin/env python3
"""Minimal reproduction for DuckLake bigint overflow bug."""
import pyarrow as pa
import pyarrow.parquet as pq
import random
import os

OUTPUT_DIR = "/tmp/ducklake_repro"
PARQUET_FILE = f"{OUTPUT_DIR}/nested_data.parquet"

def create_parquet():
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    n = 100000
    random.seed(42)
    
    records = []
    for i in range(n):
        if i % 8 == 0:
            records.append(None)  # NULL list
        elif i % 6 == 0:
            records.append([])  # Empty list
        else:
            item_list = []
            for j in range(random.randint(1, 3)):
                item = {
                    'id': f'id_{i}_{j}',
                    'metadata': {
                        'source': 'SOURCE_A',
                        'type': None if (i + j) % 7 == 0 else 'TYPE_X',
                        'extra': None,  # Always NULL
                    },
                    'attributes': {
                        'name': None if (i + j) % 9 == 0 else 'Name',
                        'code': None,  # Always NULL
                        'value': None,  # Always NULL (float)
                    },
                    'tags': None if (i + j) % 10 == 0 else [],
                }
                item_list.append(item)
            records.append(item_list)
    
    metadata_struct = pa.struct([
        ('source', pa.string()),
        ('type', pa.string()),
        ('extra', pa.string()),
    ])
    attributes_struct = pa.struct([
        ('name', pa.string()),
        ('code', pa.string()),
        ('value', pa.float64()),
    ])
    item_struct = pa.struct([
        ('id', pa.string()),
        ('metadata', metadata_struct),
        ('attributes', attributes_struct),
        ('tags', pa.list_(pa.string())),
    ])
    
    schema = pa.schema([('items', pa.list_(item_struct))])
    table = pa.Table.from_pydict({'items': records}, schema=schema)
    pq.write_table(table, PARQUET_FILE)
    print(f"Created: {PARQUET_FILE}")

if __name__ == '__main__':
    create_parquet()

Start a PostgreSQL instance using Docker:

# Start PostgreSQL
docker rm -f pg-ducklake 2>/dev/null
docker run -d --name pg-ducklake -e POSTGRES_PASSWORD=test -p 5432:5432 postgres:16
sleep 5
PGPASSWORD=test psql -h localhost -U postgres -c "CREATE DATABASE ducklake_test;"

Then, in DuckDB:

FORCE INSTALL ducklake FROM core_nightly;
INSTALL postgres;

ATTACH 'ducklake:postgres:dbname=ducklake_test host=localhost user=postgres password=test' AS lake
    (DATA_PATH '/tmp/ducklake_repro/');

CREATE TABLE lake.test AS SELECT * FROM read_parquet('/tmp/ducklake_repro/nested_data.parquet') LIMIT 0;
CALL ducklake_add_data_files('lake', 'test', '/tmp/ducklake_repro/nested_data.parquet');

The expected result would be to register the parquet normally. The actual result is:

TransactionContext Error: Failed to commit: Failed to commit DuckLake transaction.
Failed to flush changes into DuckLake: Failed to execute query "INSERT INTO ... 
ducklake_file_column_stats VALUES (..., 18446744073709476743, ...);
...": ERROR:  bigint out of range

Problem
When adding PyArrow-generated parquet files with complex nested structures, the value_count field computation could produce underflowed values, which exceed PostgreSQL's signed BIGINT limit and cause the transaction to fail.

Root causes:

value_count was directly set to num_values, but semantically it seems like it should be the count of non-null values (num_values - null_count).
No validation for negative values from parquet metadata (which could occur due to signed/unsigned conversion).
No guard against null_count > num_values which would cause underflow in the subtraction.

Changes
src/functions/ducklake_add_data_files.cpp

Added validation to reject negative num_values and null_count values when reading parquet metadata

src/storage/ducklake_transaction.cpp

Corrected value_count computation to num_values - null_count (consistent with the fallback path and the IS_NOT_NULL filter semantics)
Added guard for null_count > num_values to prevent underflow

test/sql/add_files/add_files_nested_list_struct_nulls.test

Added test for nested List(Struct(...)) columns with various NULL patterns
Validates value_count = num_values - null_count computation

Testing

Added automated test for calculating the value counts and verifying add_data_files works.
Verified that the Parquet file that the test script generates works.
Verified fix works with PostgreSQL metadata backend.

Costa-SM · 2026-01-06T18:51:54Z

Hi @pdet, happy new year! :)
Just checking in to see if there is anything I can do to help move this PR along? I noticed that the CI is waiting for an approval to run. If you can trigger that, I’d be happy to jump on any necessary changes or fixes immediately.
Thanks !

pdet · 2026-01-07T12:39:51Z

Hi @Costa-SM, thanks for the PR and happy NY! Sorry for my belated answer!

I think the code added makes sense, I believe that the tests you added do not trigger the underflow checks, I'm wondering if you have specific parquet files that trigger those.

Costa-SM · 2026-01-07T17:52:57Z

No problem about the delay. Thanks for your work maintaining the repo !
About the underflow checks, I don't have a parquet that triggers them. They would just be in case of malformed data. If you think it's excessive, I can remove them and just keep the actual fix.

Also, QQ (unrelated to the PR at hand tbh), would you guys be interested in a PR that implements bucket partitioning ? This is something I'm interested in, and I also saw that there's also demand for it in the Discussions tab.

Costa-SM · 2026-01-07T23:25:44Z

I accidentally deleted the base branch this PR was going to merge when fixing my fork. 😞
I reopened the PR as #675.

Costa-SM added 3 commits December 29, 2025 13:50

prevent underflow error in parquet reader

1c207fb

add test

079eabc

improve fix & test

cc3a62a

Costa-SM deleted the branch duckdb:main January 7, 2026 23:11

Costa-SM closed this Jan 7, 2026

Costa-SM deleted the main branch January 7, 2026 23:11

Costa-SM mentioned this pull request Jan 7, 2026

Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #656

Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #656

Costa-SM commented Dec 29, 2025

Uh oh!

Costa-SM commented Jan 6, 2026

Uh oh!

pdet commented Jan 7, 2026

Uh oh!

Costa-SM commented Jan 7, 2026 •

edited

Loading

Uh oh!

Costa-SM commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #656

Fix integer underflow in value_count computation for nested columns in ducklake_add_data_files #656

Conversation

Costa-SM commented Dec 29, 2025

Uh oh!

Costa-SM commented Jan 6, 2026

Uh oh!

pdet commented Jan 7, 2026

Uh oh!

Costa-SM commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Costa-SM commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Costa-SM commented Jan 7, 2026 •

edited

Loading