-
Notifications
You must be signed in to change notification settings - Fork 131
Implement Top-N Dynamic Filter File Pruning from Min/Max Column Statistics #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.4-andium
Are you sure you want to change the base?
Conversation
…stics
This PR adds support for Top-N query optimization by leveraging min/max column statistics to prune data files during dynamic filter pushdown.
I happen to query our append-only dataset with `SELECT * FROM my_ducklake.main.events ORDER BY timestamp DESC LIMIT 10` and quickly experienced quite some performance issues because no file pruning was done: so it ends up sorting all rows from all files to get the top 10. If I'm not mistaken, in such case we should be able to skip quite some data files that don't contain relevant data based on their min/max values.
Here is what I've done:
- Updated `DuckLakeMetadataManager::GetFilesForTable()` to detect Top-N dynamic filters and retrieve their associated `min_value` & `max_value` to later on sort the candidate files accordingly.
- For `ORDER BY col DESC LIMIT N`: Files with highest max values are processed first
- For `ORDER BY col ASC LIMIT N`: Files with lowest min values are processed first
- Updated `DuckLakeMultiFileReader::InitializeReader()` to actually prune the files when their min/max ranges indicate they cannot contain qualifying rows
- Added a test validating this with `EXPLAIN ANALYZE` diffs
On my own tests, this is a tremendous performance improvement; but I would love the expert's POV to make sure I don't forget anything 😅
Let me know what y'all think!
|
Gosh, I need to handle the |
... and realize this is still `FIXME` at the optimizer level 🆗
Oh ok, the TopN optimizer for |
I dig a bit and I'm struggling with an uninitialized |
|
@pdet I'm curious to hear your expert's thoughts on this one 🙏 |
|
Hi @redox, I'm sorry is a bit of a busy week, I'll have a look in the next couple of days! |
Take your time, I know those weeks! Just wanted to make sure it was on your radar 👌 |
|
Hi @redox I think this is looking good, one thing that I wanted to do before accepting new filtering pushdown PRs was too increase the filtering tests we have. As this would drastically reduce my anxiety of incorrectly skipping files, so maybe that's a nice thing to pick up in this PR. One nice thing is that DuckDB already has a quite complete test load for filters that we could reuse, the difference of course is that for these tests to make a difference for ducklake, we would need to force them to write many parquet files. My battle plan would be to add a CI run that stress tests the DuckLake Filter skipping (Somewhat similar to https://github.com/duckdb/ducklake/blob/main/.github/workflows/ConfigTests.yml). But enforcing that our parquet files are as small as possible, so we can stress-test the filter-skipping correctness of DuckLake. Hence a script to collect tests that are filter related would also be nice. I think one way of producing small parquet files is if we set Many tests are use very small datasets, having the vector_size limit to 2048 could still leave things untested, so I would probably try to do a DuckDB build with a vector size of 2. Then we could get parquet files with 2 rows. If this CI is too heavy, we can also limit the filters to one code/header file and only run it if these are changed. Let me know if this makes sense. |
Yes this makes a lot of sense, I'll check how easily I could add more filtering tests; following your guidance. |
|
@pdet without going the CI run yet (that maybe you should own - always a bit tricky for non-maintainer to iterate of those), I've added a test file that I believe is stressing quite a bit the filtering capabilities.
Let me know how you feel about it! |
I happen to query our append-only dataset with
SELECT * FROM my_ducklake.main.events ORDER BY timestamp DESC LIMIT 10and quickly experienced quite some performance issues because no file pruning was done: so it ends up sorting all rows from all files to get the top 10. If I'm not mistaken, in such case we should be able to skip quite some data files that don't contain relevant data based on their min/max values.Here is what I've done:
DuckLakeMetadataManager::GetFilesForTable()to detect Top-N dynamic filters and retrieve their associatedmin_value&max_valueto later on sort the candidate files accordingly.ORDER BY col DESC LIMIT N: Files with highest max values are processed firstORDER BY col ASC LIMIT N: Files with lowest min values are processed firstDuckLakeMultiFileReader::InitializeReader()to actually prune the files when their min/max ranges indicate they cannot contain qualifying rowsEXPLAIN ANALYZEdiffsOn my own tests, this is a tremendous performance improvement; but I would love the expert's POV to make sure I don't forget anything 😅
Let me know what y'all think!