-
Notifications
You must be signed in to change notification settings - Fork 131
Add custom MultiFileReader for reading delete files #674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.4-andium
Are you sure you want to change the base?
Add custom MultiFileReader for reading delete files #674
Conversation
pdet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a couple more comments!
| require parquet | ||
|
|
||
| statement ok | ||
| ATTACH 'ducklake:${DUCKLAKE_CONNECTION}' AS ducklake (DATA_PATH 's3://mybucket') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other small nit
test-env DATA_PATH __TEST_DIR__
statement ok
ATTACH 'ducklake:${DUCKLAKE_CONNECTION}' AS ducklake (DATA_PATH '${DATA_PATH}/delete_metadata}')
This test will still only be executed with minio due to the require-env but could eventually be executed in dfferent scenarios in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes actually we cannot do this because of duckdb/duckdb#20396 - once the 1.4 branch updates its dependency to a more recent duckdb 1.4 tag - we'll be able to do this.
| query II | ||
| EXPLAIN ANALYZE SELECT COUNT(*) FILTER(WHERE id%2=0) FROM ducklake.test | ||
| ---- | ||
| analyzed_plan <REGEX>:.*#HEAD: 0.* | ||
|
|
||
| # we can time travel to see the state of the table before deletes | ||
| query II | ||
| EXPLAIN ANALYZE SELECT COUNT(*) FILTER(WHERE id%2=0) FROM ducklake.test AT (VERSION => 2) | ||
| ---- | ||
| analyzed_plan <REGEX>:.*#HEAD: 0.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've actually got this branch locally and did a build without your changes in src/storage/ducklake_delete_filter.cpp and the tests seems to have the same effect? Maybe I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm 🤔 I'll check this again on behalf of Mathieu who initially opened this MR. Thank you for this @pdet, I should have double-checked as well 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdet Because of our local development setups we had a lot of trouble reproducing failure / success with and without the fix. That being said, it looks like the base commit we were using might have caused the flakiness you observed.
I went and created a Dockerfile (based on the Minio.yml workflow config to be as close as possible to the CI):
# Based on the MinIO.yml workflow configuration
FROM ubuntu:latest
# Prevent interactive prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive
# Install required Ubuntu packages
RUN apt-get update -y -qq && \
apt-get install -y -qq \
software-properties-common \
curl \
git \
zip \
unzip \
tar \
pkg-config && \
add-apt-repository ppa:git-core/ppa && \
apt-get update -y -qq && \
apt-get install -y -qq \
build-essential \
cmake \
ninja-build \
ccache \
python3 \
clang \
llvm && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /workspace
# Copy the project
COPY . .
# Setup vcpkg
RUN git clone https://github.com/microsoft/vcpkg.git /workspace/vcpkg && \
cd /workspace/vcpkg && \
git checkout ce613c41372b23b1f51333815feb3edd87ef8a8b && \
./bootstrap-vcpkg.sh
# Set build environment variables
ENV VCPKG_TARGET_TRIPLET=x64-linux-release
ENV GEN=ninja
ENV VCPKG_FEATURE_FLAGS=-binarycaching
ENV VCPKG_TOOLCHAIN_PATH=/workspace/vcpkg/scripts/buildsystems/vcpkg.cmake
ENV BUILD_EXTENSION_TEST_DEPS=full
ENV S3_TEST_SERVER_AVAILABLE=1
ENV CC=clang
ENV CXX=clang++
# Show versions for verification
RUN ninja --version && \
cmake --version && \
clang --version
# Build the project
RUN make release
# Default command
CMD ["/bin/bash"]
Building with docker build -t ducklake-build ., and running the test with docker run --network host -it ducklake-build bash -c "S3_TEST_SERVER_AVAILABLE=1 ./build/release/test/unittest --test-config test/configs/minio.json test/sql/delete/delete_metadata.test" (having a minio instance running on localhost:9000) shows that:
- Without the
ducklake_delete_filter.cppfix of this PR, the test fails - With the fix, the test succeeds.
| extended_info->options["etag"] = Value(""); | ||
| extended_info->options["last_modified"] = Value::TIMESTAMP(timestamp_t(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samansmink since you are more aware than I am of the effects of S3 files in the multifile reader, can this have any ill effects?
|
Closing this for now; the bug is somewhere between https://github.com/duckdb/ducklake and https://github.com/dentiny/duck-read-cache-fs - we need to figure out how to test this from here if we want to fix it here. |
…e files
**Context**: We experience slow read performance when a table has many delete files.
**TL;DR**: We can leverage the metadata already available in DuckLake to improve load time of delete files.
**Problem & Motivation:**
DuckLake stores `file_size` metadata for both data and delete files. For data files, there is already a mechanism to forward this metadata to the MultiFileReader and the underlying filesystem. The Parquet reader requires this `file_size` to access the footer metadata. When using an `HTTPFileSystem` instance (e.g., for S3, Azure), it performs a HEAD request on the file if metadata fields (`file_size`, `etag`, `last_modified`) are not present. Since all files in DuckLake are immutable, we can apply the same optimization logic for delete files to avoid these unnecessary HEAD requests.
**Solution:**
Implements a custom multi-file reading solution that pre-populates file metadata to eliminate redundant storage HEAD requests when scanning delete files:
**Key Changes:**
1. **New `DeleteFileFunctionInfo` struct**: Extends `TableFunctionInfo` to carry `DuckLakeFileData` metadata through the table function binding process.
2. **Custom `DeleteFileMultiFileReader` class**:
- Extends DuckDB's `MultiFileReader` to intercept file list creation
- Pre-populates `ExtendedOpenFileInfo` with metadata already available from DuckLake:
- File size (`file_size_bytes`)
- ETag (empty string as placeholder)
- Last modified timestamp (set to epoch)
- Encryption key (if present)
- Creates a `SimpleMultiFileList` with this extended info upfront
- Overrides `CreateFileList()` to return the pre-built list, bypassing DuckDB's default file discovery
3. **Modified `ScanDeleteFile()` method**:
- Changed `parquet_scan` from const reference to mutable copy to allow modification
- Attaches `DeleteFileFunctionInfo` and custom reader factory to the table function
- Passes the actual `parquet_scan` function to `TableFunctionBindInput` instead of a dummy function, ensuring proper function context
**Performance Impact**: Eliminates HEAD requests to object storage when opening Parquet delete files. This is particularly beneficial when working with remote storage (S3, Azure, etc.) and tables with many delete files, where HEAD requests were causing significant performance bottlenecks.
5ef6bfb to
42a586b
Compare
|
Aha, we actually have a very similar PR to iceberg, code-wise looks very similar: Maybe the test we have there would be a better fit? |
This follows #641 that we closed to work-around a GH issue.
We noticed that read performance can degrade when a table accumulates a large number of delete files, especially when those files live on remote object storage.
TLDR; We can actually speed things up by reusing metadata that DuckLake already has, avoiding a bunch of unnecessary storage
HEADrequests when loading delete files.For data files, this metadata is passed down to DuckDB’s
MultiFileReaderand ultimately to the filesystem layer. This matters because the Parquet reader needs the file size to locate footer metadata efficiently. However, for delete files, this optimization wasn’t applied yet. When using anHTTPFileSystem(e.g. S3 or Azure), missing metadata means DuckDB issues aHEADrequest per file to fetch it. Since DuckLake files are immutable, those extra requests are pure overhead—and they really add up when a table has many delete files.This PR introduces a custom multi-file reading path for delete files that eagerly injects the metadata we already have, eliminating redundant storage calls.
What Changed
New
DeleteFileFunctionInfostructExtends
TableFunctionInfoto carryDuckLakeFileDatametadata through the table function binding phase.Custom
DeleteFileMultiFileReaderA specialized
MultiFileReaderthat:ExtendedOpenFileInfowith known metadata:file_size_bytes)SimpleMultiFileListupfrontCreateFileList()to return this pre-built list, skipping DuckDB’s default discovery logic entirelyUpdates to
ScanDeleteFile()parquet_scanfrom a const reference to a mutable copy so it can be customizedDeleteFileFunctionInfoand the custom reader factory to the table functionparquet_scanfunction intoTableFunctionBindInput(instead of a dummy), ensuring the correct execution contextThe win is most noticeable on remote storage backends (S3, Azure, etc.) and for tables with many delete files, where these requests were a significant source of latency.
Net result: faster reads, fewer network round-trips, and better scalability ✨