Skip to content

Conversation

@redox
Copy link
Contributor

@redox redox commented Jan 5, 2026

I happen to query our append-only dataset with SELECT * FROM my_ducklake.main.events ORDER BY timestamp DESC LIMIT 10 and quickly experienced quite some performance issues because no file pruning was done: so it ends up sorting all rows from all files to get the top 10. If I'm not mistaken, in such case we should be able to skip quite some data files that don't contain relevant data based on their min/max values.

Here is what I've done:

  • Updated DuckLakeMetadataManager::GetFilesForTable() to detect Top-N dynamic filters and retrieve their associated min_value & max_value to later on sort the candidate files accordingly.
    • For ORDER BY col DESC LIMIT N: Files with highest max values are processed first
    • For ORDER BY col ASC LIMIT N: Files with lowest min values are processed first
  • Updated DuckLakeMultiFileReader::InitializeReader() to actually prune the files when their min/max ranges indicate they cannot contain qualifying rows
  • Added a test validating this with EXPLAIN ANALYZE diffs

On my own tests, this is a tremendous performance improvement; but I would love the expert's POV to make sure I don't forget anything 😅

Let me know what y'all think!

…stics

This PR adds support for Top-N query optimization by leveraging min/max column statistics to prune data files during dynamic filter pushdown.

I happen to query our append-only dataset with `SELECT * FROM my_ducklake.main.events ORDER BY timestamp DESC LIMIT 10` and quickly experienced quite some performance issues because no file pruning was done: so it ends up sorting all rows from all files to get the top 10. If I'm not mistaken, in such case we should be able to skip quite some data files that don't contain relevant data based on their min/max values.

Here is what I've done:
 - Updated `DuckLakeMetadataManager::GetFilesForTable()` to detect Top-N dynamic filters and retrieve their associated `min_value` & `max_value` to later on sort the candidate files accordingly.
    - For `ORDER BY col DESC LIMIT N`: Files with highest max values are processed first
    - For `ORDER BY col ASC LIMIT N`: Files with lowest min values are processed first
 - Updated `DuckLakeMultiFileReader::InitializeReader()` to actually prune the files when their min/max ranges indicate they cannot contain qualifying rows
 - Added a test validating this with `EXPLAIN ANALYZE` diffs

On my own tests, this is a tremendous performance improvement; but I would love the expert's POV to make sure I don't forget anything 😅

Let me know what y'all think!
@redox
Copy link
Contributor Author

redox commented Jan 5, 2026

Gosh, I need to handle the NULLS FIRST/LAST as well

@redox redox changed the title Implement Top-N Dynamic Filter File Pruning from Min/Max Column Statics Implement Top-N Dynamic Filter File Pruning from Min/Max Column Statistics Jan 5, 2026
... and realize this is still `FIXME` at the optimizer level 🆗
@redox
Copy link
Contributor Author

redox commented Jan 5, 2026

Gosh, I need to handle the NULLS FIRST/LAST as well

Oh ok, the TopN optimizer for NULLS FIRST/LAST is disabled in 1.4 😅 but seems to be implemented in main with the new ConjunctionOrFilter (I'll check what it takes to support it)

@redox
Copy link
Contributor Author

redox commented Jan 5, 2026

Oh ok, the TopN optimizer for NULLS FIRST/LAST is disabled in 1.4 😅 but seems to be implemented in main with the new ConjunctionOrFilter (I'll check what it takes to support it)

I dig a bit and I'm struggling with an uninitialized DynamicFilterData. @Mytherin does this ring a bell? I read duckdb/duckdb@f85397a but it's not clear to me why setValue has this specific handling of NULL - which I believe has an impact here because I end up dealing with a non-initialized filter.

@redox redox marked this pull request as ready for review January 6, 2026 16:54
@redox
Copy link
Contributor Author

redox commented Jan 12, 2026

@pdet I'm curious to hear your expert's thoughts on this one 🙏

@pdet
Copy link
Collaborator

pdet commented Jan 13, 2026

Hi @redox, I'm sorry is a bit of a busy week, I'll have a look in the next couple of days!

@redox
Copy link
Contributor Author

redox commented Jan 13, 2026

Hi @redox, I'm sorry is a bit of a busy week, I'll have a look in the next couple of days!

Take your time, I know those weeks! Just wanted to make sure it was on your radar 👌

@pdet
Copy link
Collaborator

pdet commented Jan 14, 2026

Hi @redox I think this is looking good, one thing that I wanted to do before accepting new filtering pushdown PRs was too increase the filtering tests we have. As this would drastically reduce my anxiety of incorrectly skipping files, so maybe that's a nice thing to pick up in this PR.

One nice thing is that DuckDB already has a quite complete test load for filters that we could reuse, the difference of course is that for these tests to make a difference for ducklake, we would need to force them to write many parquet files.

My battle plan would be to add a CI run that stress tests the DuckLake Filter skipping (Somewhat similar to https://github.com/duckdb/ducklake/blob/main/.github/workflows/ConfigTests.yml). But enforcing that our parquet files are as small as possible, so we can stress-test the filter-skipping correctness of DuckLake. Hence a script to collect tests that are filter related would also be nice.

I think one way of producing small parquet files is if we set ROW_GROUP_SIZE 1 ,and ROW_GROUPS_PER_FILE 1 in the parquet writer, we should limit the parquet files to a vector size. By default the vector size is 2048. The problem is that this might cause overshooting, but I think that setting ROW_GROUPS_PER_FILE 1 should resolve that, or worst case, set threads = 1.

Many tests are use very small datasets, having the vector_size limit to 2048 could still leave things untested, so I would probably try to do a DuckDB build with a vector size of 2. Then we could get parquet files with 2 rows. If this CI is too heavy, we can also limit the filters to one code/header file and only run it if these are changed.

Let me know if this makes sense.

@redox
Copy link
Contributor Author

redox commented Jan 14, 2026

Let me know if this makes sense.

Yes this makes a lot of sense, I'll check how easily I could add more filtering tests; following your guidance.

@redox
Copy link
Contributor Author

redox commented Jan 14, 2026

@pdet without going the CI run yet (that maybe you should own - always a bit tricky for non-maintainer to iterate of those), I've added a test file that I believe is stressing quite a bit the filtering capabilities.

  • It builds a reference table events_ref in DuckDB with a mix of types
  • It then writes the reference data into many Parquet files (ROW_GROUP_SIZE 2048, ROW_GROUPS_PER_FILE 1) (I initially made ROW_GROUP_SIZE smaller but I think it's giving nicer results without)
  • It registers those files into a DuckLake events table via ducklake_add_data_files(...)
  • For a bunch of queries inspired by the DuckDB tests; it asserts DuckLake's events returns exactly the same rows as events_ref by using EXCEPT ALL in both directions (so it catches both missing rows and extra rows)

Let me know how you feel about it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants