-
Notifications
You must be signed in to change notification settings - Fork 131
Add custom MultiFileReader for reading delete files #641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom MultiFileReader for reading delete files #641
Conversation
8dfed69 to
c45e07f
Compare
pdet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mchataigner thanks for the PR!
Could you add a MinIO test that demonstrates fewer requests are done ?
Could you also retarget it to v 1.4?
|
@pdet you're welcome. I will do the changes. |
c45e07f to
b625d74
Compare
…e files
**Context**: We experience slow read performance when a table has many delete files.
**TL;DR**: We can leverage the metadata already available in DuckLake to improve load time of delete files.
**Problem & Motivation:**
DuckLake stores `file_size` metadata for both data and delete files. For data files, there is already a mechanism to forward this metadata to the MultiFileReader and the underlying filesystem. The Parquet reader requires this `file_size` to access the footer metadata. When using an `HTTPFileSystem` instance (e.g., for S3, Azure), it performs a HEAD request on the file if metadata fields (`file_size`, `etag`, `last_modified`) are not present. Since all files in DuckLake are immutable, we can apply the same optimization logic for delete files to avoid these unnecessary HEAD requests.
**Solution:**
Implements a custom multi-file reading solution that pre-populates file metadata to eliminate redundant storage HEAD requests when scanning delete files:
**Key Changes:**
1. **New `DeleteFileFunctionInfo` struct**: Extends `TableFunctionInfo` to carry `DuckLakeFileData` metadata through the table function binding process.
2. **Custom `DeleteFileMultiFileReader` class**:
- Extends DuckDB's `MultiFileReader` to intercept file list creation
- Pre-populates `ExtendedOpenFileInfo` with metadata already available from DuckLake:
- File size (`file_size_bytes`)
- ETag (empty string as placeholder)
- Last modified timestamp (set to epoch)
- Encryption key (if present)
- Creates a `SimpleMultiFileList` with this extended info upfront
- Overrides `CreateFileList()` to return the pre-built list, bypassing DuckDB's default file discovery
3. **Modified `ScanDeleteFile()` method**:
- Changed `parquet_scan` from const reference to mutable copy to allow modification
- Attaches `DeleteFileFunctionInfo` and custom reader factory to the table function
- Passes the actual `parquet_scan` function to `TableFunctionBindInput` instead of a dummy function, ensuring proper function context
**Performance Impact**: Eliminates HEAD requests to object storage when opening Parquet delete files. This is particularly beneficial when working with remote storage (S3, Azure, etc.) and tables with many delete files, where HEAD requests were causing significant performance bottlenecks.
b625d74 to
db066de
Compare
|
@pdet sorry for the delay, I updated the PR with format fix and added a test with MinIO. |
pdet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks again for the changes, just a small coment wrt the test
| statement ok | ||
| SET autoinstall_known_extensions=1; | ||
|
|
||
| statement ok | ||
| SET autoload_known_extensions=1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can ensure the test only runs in the MinIo CI by having
require httpfs
require-env S3_TEST_SERVER_AVAILABLE 1
and S3_TEST_SERVER_AVAILABLE: 1 in the .github/workflows/MinIO.yml env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done it @pdet 🫡
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I think you don't need this:
statement ok
SET autoinstall_known_extensions=1;
statement ok
SET autoload_known_extensions=1;
Or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdet I believe I should re-open a new PR because it seems broken. The branch got wrongly deleted on our end, then I added a new commit to it and repushed it - I think GH is now lost. I cannot close this PR myself, but feel free to close it so I can reopen it.
|
Sure! |
|

Context: Slower read performance when a table has many delete files.
TL;DR: We can leverage the metadata already available in DuckLake to improve load time of delete files.
Problem & Motivation:
DuckLake stores
file_sizemetadata for both data and delete files. For data files, there is already a mechanism to forward this metadata to the MultiFileReader and the underlying filesystem. The Parquet reader requires thisfile_sizeto access the footer metadata. When using anHTTPFileSysteminstance (e.g., for S3, Azure), it performs a HEAD request on the file if metadata fields (file_size,etag,last_modified) are not present. Since all files in DuckLake are immutable, we can apply the same optimization logic for delete files to avoid these unnecessary HEAD requests.Solution:
Implements a custom multi-file reading solution that pre-populates file metadata to eliminate redundant storage HEAD requests when scanning delete files:
Key Changes:
New
DeleteFileFunctionInfostruct: ExtendsTableFunctionInfoto carryDuckLakeFileDatametadata through the table function binding process.Custom
DeleteFileMultiFileReaderclass:MultiFileReaderto intercept file list creationExtendedOpenFileInfowith metadata already available from DuckLake:file_size_bytes)SimpleMultiFileListwith this extended info upfrontCreateFileList()to return the pre-built list, bypassing DuckDB's default file discoveryModified
ScanDeleteFile()method:parquet_scanfrom const reference to mutable copy to allow modificationDeleteFileFunctionInfoand custom reader factory to the table functionparquet_scanfunction toTableFunctionBindInputinstead of a dummy function, ensuring proper function contextPerformance Impact: Eliminates HEAD requests to object storage when opening Parquet delete files. This is particularly beneficial when working with remote storage (S3, Azure, etc.) and tables with many delete files, where HEAD requests were causing significant performance bottlenecks.