Skip to content

Conversation

@Costa-SM
Copy link

Disclaimer
I don't have much context around DuckDB + DuckLake yet (just started using both last week), so I got Claude Opus 4.5 to do most of the heavy-lifting when debugging and writing the code for this PR and then proofread the results. I'm happy to receive some feedback and work on any changes that are deemed necessary.

I ended up not opening an issue for this since I already have the fix in the PR. If the issue is a prerequisite, I can take the time to open one. Also hi ! This is my first issue/PR here :)

Summary
Fixes a bug where ducklake_add_data_files fails with "bigint out of range" error when registering parquet files containing List(Struct(...)) columns with NULL values at various nesting levels.

Issue Reproduction Details

  • DuckDB version: 1.4.3
  • DuckLake version: core_nightly (tested 2024-12-27)
  • PostgreSQL version: 16
  • OS: Linux
  • Python 3.11

We can use this Python script to generate a Parquet file that will trigger the failure.

#!/usr/bin/env python3
"""Minimal reproduction for DuckLake bigint overflow bug."""
import pyarrow as pa
import pyarrow.parquet as pq
import random
import os

OUTPUT_DIR = "/tmp/ducklake_repro"
PARQUET_FILE = f"{OUTPUT_DIR}/nested_data.parquet"

def create_parquet():
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    n = 100000
    random.seed(42)
    
    records = []
    for i in range(n):
        if i % 8 == 0:
            records.append(None)  # NULL list
        elif i % 6 == 0:
            records.append([])  # Empty list
        else:
            item_list = []
            for j in range(random.randint(1, 3)):
                item = {
                    'id': f'id_{i}_{j}',
                    'metadata': {
                        'source': 'SOURCE_A',
                        'type': None if (i + j) % 7 == 0 else 'TYPE_X',
                        'extra': None,  # Always NULL
                    },
                    'attributes': {
                        'name': None if (i + j) % 9 == 0 else 'Name',
                        'code': None,  # Always NULL
                        'value': None,  # Always NULL (float)
                    },
                    'tags': None if (i + j) % 10 == 0 else [],
                }
                item_list.append(item)
            records.append(item_list)
    
    metadata_struct = pa.struct([
        ('source', pa.string()),
        ('type', pa.string()),
        ('extra', pa.string()),
    ])
    attributes_struct = pa.struct([
        ('name', pa.string()),
        ('code', pa.string()),
        ('value', pa.float64()),
    ])
    item_struct = pa.struct([
        ('id', pa.string()),
        ('metadata', metadata_struct),
        ('attributes', attributes_struct),
        ('tags', pa.list_(pa.string())),
    ])
    
    schema = pa.schema([('items', pa.list_(item_struct))])
    table = pa.Table.from_pydict({'items': records}, schema=schema)
    pq.write_table(table, PARQUET_FILE)
    print(f"Created: {PARQUET_FILE}")

if __name__ == '__main__':
    create_parquet()

Start a PostgreSQL instance using Docker:

# Start PostgreSQL
docker rm -f pg-ducklake 2>/dev/null
docker run -d --name pg-ducklake -e POSTGRES_PASSWORD=test -p 5432:5432 postgres:16
sleep 5
PGPASSWORD=test psql -h localhost -U postgres -c "CREATE DATABASE ducklake_test;"

Then, in DuckDB:

FORCE INSTALL ducklake FROM core_nightly;
INSTALL postgres;

ATTACH 'ducklake:postgres:dbname=ducklake_test host=localhost user=postgres password=test' AS lake
    (DATA_PATH '/tmp/ducklake_repro/');

CREATE TABLE lake.test AS SELECT * FROM read_parquet('/tmp/ducklake_repro/nested_data.parquet') LIMIT 0;
CALL ducklake_add_data_files('lake', 'test', '/tmp/ducklake_repro/nested_data.parquet');

The expected result would be to register the parquet normally. The actual result is:

TransactionContext Error: Failed to commit: Failed to commit DuckLake transaction.
Failed to flush changes into DuckLake: Failed to execute query "INSERT INTO ... 
ducklake_file_column_stats VALUES (..., 18446744073709476743, ...);
...": ERROR:  bigint out of range

Problem
When adding PyArrow-generated parquet files with complex nested structures, the value_count field computation could produce underflowed values, which exceed PostgreSQL's signed BIGINT limit and cause the transaction to fail.

Root causes:

  1. value_count was directly set to num_values, but semantically it seems like it should be the count of non-null values (num_values - null_count).
  2. No validation for negative values from parquet metadata (which could occur due to signed/unsigned conversion).
  3. No guard against null_count > num_values which would cause underflow in the subtraction.

Changes
src/functions/ducklake_add_data_files.cpp

  • Added validation to reject negative num_values and null_count values when reading parquet metadata

src/storage/ducklake_transaction.cpp

  • Corrected value_count computation to num_values - null_count (consistent with the fallback path and the IS_NOT_NULL filter semantics)
  • Added guard for null_count > num_values to prevent underflow

test/sql/add_files/add_files_nested_list_struct_nulls.test

  • Added test for nested List(Struct(...)) columns with various NULL patterns
  • Validates value_count = num_values - null_count computation

Testing

  • Added automated test for calculating the value counts and verifying add_data_files works.
  • Verified that the Parquet file that the test script generates works.
  • Verified fix works with PostgreSQL metadata backend.

@Costa-SM
Copy link
Author

Costa-SM commented Jan 6, 2026

Hi @pdet, happy new year! :)
Just checking in to see if there is anything I can do to help move this PR along? I noticed that the CI is waiting for an approval to run. If you can trigger that, I’d be happy to jump on any necessary changes or fixes immediately.
Thanks !

@pdet
Copy link
Collaborator

pdet commented Jan 7, 2026

Hi @Costa-SM, thanks for the PR and happy NY! Sorry for my belated answer!

I think the code added makes sense, I believe that the tests you added do not trigger the underflow checks, I'm wondering if you have specific parquet files that trigger those.

@Costa-SM
Copy link
Author

Costa-SM commented Jan 7, 2026

No problem about the delay. Thanks for your work maintaining the repo !
About the underflow checks, I don't have a parquet that triggers them. They would just be in case of malformed data. If you think it's excessive, I can remove them and just keep the actual fix.

Also, QQ (unrelated to the PR at hand tbh), would you guys be interested in a PR that implements bucket partitioning ? This is something I'm interested in, and I also saw that there's also demand for it in the Discussions tab.

@Costa-SM
Copy link
Author

Costa-SM commented Jan 7, 2026

I accidentally deleted the base branch this PR was going to merge when fixing my fork. 😞
I reopened the PR as #675.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants