Skip to content

Add late pruning of Parquet files based on file level statistics #16014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 10, 2025

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented May 10, 2025

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate labels May 10, 2025
@adriangb
Copy link
Contributor Author

A couple of thoughts:

  1. Needs cleanup.
  2. Not sure how to construct the empty stream.
  3. It might be nice to implement pruning for Vec<Statistics> where each statistic represents an arbitrary container (e.g. partition or file).

@alamb
Copy link
Contributor

alamb commented May 11, 2025

It might be nice to implement pruning for Vec where each statistic represents an arbitrary container (e.g. partition or file).

Yes this would be super nice -- the more we can do to consolidate statistics / pruning the better off the code will be I think. Right now it is kind of scattered in several places

@alamb
Copy link
Contributor

alamb commented May 11, 2025

Not sure how to construct the empty stream.

You can use something like https://docs.rs/futures/latest/futures/stream/fn.iter.html perhaps -- like futures::stream::iter(vec![]) for example 🤔

@@ -367,7 +368,7 @@ impl Default for OnError {
pub trait FileOpener: Unpin + Send + Sync {
/// Asynchronously open the specified file and return a stream
/// of [`RecordBatch`]
fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture>;
fn open(&self, file_meta: FileMeta, file: PartitionedFile) -> Result<FileOpenFuture>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it sufficient to provide only file statistics? PartitionedFile seems like an overkill to me

Copy link
Contributor Author

@adriangb adriangb May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? But I feel like we have the partitioned file we might as well pass it in. Maybe we use it in the future to enable optimizations that use the partition values (eg late pruning based on partition values, including partition values in the scan so that more filters can be evaluated, etc)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using PartitionedFile as the "data we have at plan time" including statistics and potentially information about size, encryption, special indexes, etc makes a lot of sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? But I feel like we have the partitioned file we might as well pass it in. Maybe we use it in the future to enable optimizations that use the partition values (eg late pruning based on partition values, including partition values in the scan so that more filters can be evaluated, etc)

I believe these can also be inferred from statistics in a more generalized fashion(don't know partition columns exist in column_statistics now) but not a big deal, we can keep this 👍🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the documetnation for open() to mention that file has plan time per-file information (such as statistics) and leave a doc link back?

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea makes a lot sense. I've one implementation suggestion. Thanks again @adriangb

@adriangb adriangb marked this pull request as ready for review May 11, 2025 23:09
@adriangb adriangb force-pushed the late-pruning-files branch from 0e03bdc to 94726cc Compare May 11, 2025 23:10
@adriangb
Copy link
Contributor Author

@alamb please review again I implemented and added a test 😄

(Some(stats), Some(predicate)) => {
let pruning_predicate = build_pruning_predicate(
Arc::clone(predicate),
&self.table_schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it use table_schema here?

Comment on lines 93 to 94
match (&file.statistics, &self.predicate) {
(Some(stats), Some(predicate)) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there is only one branch, I suggest using if let (Some(_), Some(_)) = xxx {} here.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool -- I think this is very close

@@ -367,7 +368,7 @@ impl Default for OnError {
pub trait FileOpener: Unpin + Send + Sync {
/// Asynchronously open the specified file and return a stream
/// of [`RecordBatch`]
fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture>;
fn open(&self, file_meta: FileMeta, file: PartitionedFile) -> Result<FileOpenFuture>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the documetnation for open() to mention that file has plan time per-file information (such as statistics) and leave a doc link back?

}
}

/// Returns [`BooleanArray`] where each row represents information known
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment cna probably be trimmed with a link back to the original trait source

@@ -995,6 +996,184 @@ fn build_statistics_record_batch<S: PruningStatistics>(
})
}

/// Prune a set of containers represented by their statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice structure -- I think it makes lots of sense and is 100%

Specifically, I thought there was already code that pruned individual files based on statistics but I cound not find any in LIstingTable (we have something like this in influxdb_iox).

My opinion is if we are going to this code it into the DataFusion codebase we should

  1. Ensure that it helps a as many users as possble
  2. Make sure it is executed as much as possible (to ensure test coverage)

Thus, what do you think about using the PrunableStatistics to prune the FileGroup in ListingTable here:

https://github.com/apache/datafusion/blob/55ba4cadce5ea99de4361929226f1c99cfc94450/datafusion/core/src/datasource/listing/table.rs#L1117-L1116

?

Pruning on statistics during plan time would potentially be redundant with also trying to prune again during opening, but it would reduce the files earlier int he plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I bundle in the PartitionValues somehow and then we can re-use and compose that?
Specifically:

  • TableProvider's use just the partition values
  • ParquetOpener combines both
  • Something else can use just the stats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pruning on statistics during plan time would potentially be redundant with also trying to prune again during opening, but it would reduce the files earlier int he plan

Yeah I don't think it's redundant: you either prune or you don't. If we prune earlier the files don't make it this far. If we don't we may now be able to prune them. What's redundant is if there are no changes to the filters (i.e. no dynamic filters), but that sounds both hard to track and like a possible future optimization 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk

/// [`Self::min_values`], [`Self::max_values`], [`Self::null_counts`],
/// and [`Self::row_counts`].
fn num_containers(&self) -> usize {
1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be self.statistics.len(), right?

@adriangb
Copy link
Contributor Author

@alamb I pushed 4607643 which adds some nice APIs for partition values. In particular I think it's important to have a way to prune based on partition values + file level statistics (#15935).

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

@alamb
Copy link
Contributor

alamb commented May 13, 2025

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

Maybe it is time to make a datafusion-pruning crate that has all the PruningPredicate and related infrastructure 🤔

@alamb
Copy link
Contributor

alamb commented May 13, 2025

FYI @xudong963 I think this is relevant to your work on statistics / partition pruning as well

@adriangb
Copy link
Contributor Author

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

Maybe it is time to make a datafusion-pruning crate that has all the PruningPredicate and related infrastructure 🤔

Seems reasonable to me. I guess it'd be at the same level as PhysicalExpr and such.

@adriangb
Copy link
Contributor Author

Moving to datafusion_common works pretty well, I think that's easier than making a new crate.

Next hurdle: at this point we've long lost information on the actual table schema / partition files. ParquetOpener::table_schema is actually the file schema and we have no way to back out the partition columns.
Given that PartitionedFile carries around partition_values: Vec<ScalarValue> I'd recommend either:

  1. Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.
  2. Adding PartitionedFile::partition_schema.
  3. Piping down table_schema into ParquetSource and later ParquetOpener.

I think any of these also sets us up to refactor how the partition filters actually get applied (i.e. we don't have to inject them in the FileScan. But maybe that's not desirable because every format would have to implement this on their own then. In that case we pipe them in to ParquetOpener for pruning and still inject them in the scan (it should be cheapish).

@alamb any preference?

@xudong963 xudong963 self-requested a review May 14, 2025 14:24
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, thank you

if let (Some(stats), Some(predicate)) = (&file.statistics, &self.predicate) {
let pruning_predicate = build_pruning_predicate(
Arc::clone(predicate),
&self.table_schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to use table_schema here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only schema we have. And it's not even really the table schema, the name is misleading for historical reasons.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to add some notes about it. (I often confused when I reading the parquet part code, all kinds of schema, lol)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Note about schemas: we are actually dealing with **3 different schemas** here:
// - The table schema as defined by the TableProvider. This is what the user sees, what they get when they `SELECT * FROM table`, etc.
// - The "virtual" file schema: this is the table schema minus any hive partition columns and projections. This is what the file schema is coerced to.
// - The physical file schema: this is the schema as defined by the parquet file. This is what the parquet file actually contains.
😄

@adriangb
Copy link
Contributor Author

I think the next step here is to resolve #16014 (comment)

In my mind it makes sense to both push down the information and continue to have the ability to do it after the scan.
The direction DataFusion seems to be heading in is to add whatever functionality is needed to specialize readers for the most optimal performance (in this case by doing late pruning of files / partitions and being able to evaluate filters that mix partition columns and file columns during the scan) but preserving the ability to fall back to more general approaches (FilterExec, evaluating mixed filters after the scan) for sources that don't support this advanced functionality.

@alamb
Copy link
Contributor

alamb commented May 14, 2025

Moving to datafusion_common works pretty well, I think that's easier than making a new crate.

I think we should try and avoid moving everything to datafusion_common. Since the pruning stuff relies on PhysicalExpr I don't think we can directly put it in datafusion_common

Next hurdle: at this point we've long lost information on the actual table schema / partition files. ParquetOpener::table_schema is actually the file schema and we have no way to back out the partition columns. Given that PartitionedFile carries around partition_values: Vec<ScalarValue> I'd recommend either:

  1. Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.
  2. Adding PartitionedFile::partition_schema.
  3. Piping down table_schema into ParquetSource and later ParquetOpener.

I think any of these also sets us up to refactor how the partition filters actually get applied (i.e. we don't have to inject them in the FileScan. But maybe that's not desirable because every format would have to implement this on their own then. In that case we pipe them in to ParquetOpener for pruning and still inject them in the scan (it should be cheapish).

@alamb any preference?

  1. Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.

I think this sounds like the most straightforward thing to me and the easiest way to get the required information

Seems like FileScanConfig already has table_partition_cols,

Maybe we can do something like this (change to use a FieldRef rather than Field to avoid copies):

pub struct PartitionedFile {
...
    pub partition_values: Vec<ScalarValue>,
...
}

to

pub struct PartitionedFile {
...
    pub partition_values: Vec<(FieldRef, ScalarValue)>,
...
}

@alamb
Copy link
Contributor

alamb commented May 14, 2025

BTW the other thing I somewhat worry about reapplying pruning during file opening is that it is in the critical path and directly will add to the query latency. I wonder if there is some way to ensure we have hidden it behind IO if possible (aka make sure we are applying the extra pruning while the next file is opened rather than waiting to do it before starting that IO

@adriangb
Copy link
Contributor Author

Since the pruning stuff relies on PhysicalExpr I don't think we can directly put it in datafusion_common
The stuff I'm moving doesn't 😄. It's basically just the PruningStatistics trait.

Maybe we can do something like this (change to use a FieldRef rather than Field to avoid copies):

That sounds good to me. It kinda makes sense that if you're carrying around partition values you'd carry around info on what columns they belong to. Maybe it will help resolve #13270 as well in the future.

BTW the other thing I somewhat worry about reapplying pruning during file opening is that it is in the critical path and directly will add to the query latency. I wonder if there is some way to ensure we have hidden it behind IO if possible (aka make sure we are applying the extra pruning while the next file is opened rather than waiting to do it before starting that IO

I think we can move it a couple lines lower into Ok(Box::pin(async move { and that will do the trick? As long as it happens before we load the Parquet metadata the overhead is minimal. There's probably other stuff we could move into there if that's a concern.

@github-actions github-actions bot added common Related to common crate proto Related to proto crate labels May 15, 2025
@adriangb adriangb force-pushed the late-pruning-files branch from e8eb87f to cc120d0 Compare May 15, 2025 03:30
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 15, 2025
@adriangb
Copy link
Contributor Author

@alamb @xudong963 I've pushed a change that:

  1. Moves PruningStatistics into common.
  2. Adds composable helpers to prune based on Vec<Statistics> (multiple files / partitions) and Vec<Vec<ScalarValue>> (multiple containers of partition values).
  3. Adds partition_fields: Vec<FieldRef> to ParquetOpener, with slight tweaks to FileScanConfig (the latter is a bit of a PITA because of how it's both a struct and it's own builder).
  4. Implements the pruning inside of the the IO work so that it's deferred as Andrew asked for.
  5. Sets us up nicely to pipe the partition values into the other stages of pruning (row group stats, page stats and row filters). Leaving this for future work though.

@adriangb
Copy link
Contributor Author

My plan for this PR now is to first resolve blockers. In particular:

And then come back here and resolve the rest of the points of discussion.

@adriangb adriangb force-pushed the late-pruning-files branch from d6e974c to 7178a63 Compare June 5, 2025 18:35
fields.extend(conf.table_partition_cols.iter().cloned().map(Arc::new));
fields.extend(conf.table_partition_cols.iter().cloned());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may have just been clippy, but it's not a bad change!

@adriangb
Copy link
Contributor Author

adriangb commented Jun 5, 2025

I've rebased this and it's looking nice now.
I think the main open question is the concern about performance / overhead:

https://github.com/apache/datafusion/pull/16014/files#r2093515834

@github-actions github-actions bot removed the common Related to common crate label Jun 5, 2025
@alamb
Copy link
Contributor

alamb commented Jun 8, 2025

I've rebased this and it's looking nice now. I think the main open question is the concern about performance / overhead:

I'll fire up some benchmarks and see if we can see anything concerning

@alamb
Copy link
Contributor

alamb commented Jun 8, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing late-pruning-files (e0088bc) to 25727d4 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 8, 2025

🤖: Benchmark completed

Details

Comparing HEAD and late-pruning-files
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ late-pruning-files ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1894.50 ms │         1912.32 ms │ no change │
│ QQuery 1     │   699.43 ms │          725.99 ms │ no change │
│ QQuery 2     │  1429.19 ms │         1418.16 ms │ no change │
│ QQuery 3     │   691.52 ms │          705.27 ms │ no change │
│ QQuery 4     │  1479.88 ms │         1435.90 ms │ no change │
│ QQuery 5     │ 15451.09 ms │        15449.35 ms │ no change │
│ QQuery 6     │  1990.47 ms │         2025.21 ms │ no change │
│ QQuery 7     │  2167.33 ms │         2083.99 ms │ no change │
│ QQuery 8     │   849.71 ms │          855.75 ms │ no change │
└──────────────┴─────────────┴────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                 │ 26653.13ms │
│ Total Time (late-pruning-files)   │ 26611.94ms │
│ Average Time (HEAD)               │  2961.46ms │
│ Average Time (late-pruning-files) │  2956.88ms │
│ Queries Faster                    │          0 │
│ Queries Slower                    │          0 │
│ Queries with No Change            │          9 │
│ Queries with Failure              │          0 │
└───────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ late-pruning-files ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.62 ms │           15.76 ms │    no change │
│ QQuery 1     │    33.10 ms │           33.10 ms │    no change │
│ QQuery 2     │    79.76 ms │           80.46 ms │    no change │
│ QQuery 3     │    92.47 ms │           95.67 ms │    no change │
│ QQuery 4     │   602.03 ms │          584.42 ms │    no change │
│ QQuery 5     │   849.99 ms │          819.51 ms │    no change │
│ QQuery 6     │    23.38 ms │           23.41 ms │    no change │
│ QQuery 7     │    37.12 ms │           38.68 ms │    no change │
│ QQuery 8     │   894.56 ms │          902.18 ms │    no change │
│ QQuery 9     │  1187.51 ms │         1225.09 ms │    no change │
│ QQuery 10    │   266.45 ms │          267.61 ms │    no change │
│ QQuery 11    │   296.16 ms │          295.69 ms │    no change │
│ QQuery 12    │   899.57 ms │          896.76 ms │    no change │
│ QQuery 13    │  1329.93 ms │         1349.63 ms │    no change │
│ QQuery 14    │   831.31 ms │          838.69 ms │    no change │
│ QQuery 15    │   813.18 ms │          810.15 ms │    no change │
│ QQuery 16    │  1720.13 ms │         1720.39 ms │    no change │
│ QQuery 17    │  1588.26 ms │         1595.08 ms │    no change │
│ QQuery 18    │  3045.79 ms │         3058.16 ms │    no change │
│ QQuery 19    │    85.16 ms │           83.74 ms │    no change │
│ QQuery 20    │  1101.27 ms │         1138.77 ms │    no change │
│ QQuery 21    │  1295.63 ms │         1316.05 ms │    no change │
│ QQuery 22    │  2144.62 ms │         2174.86 ms │    no change │
│ QQuery 23    │  7884.84 ms │         7964.78 ms │    no change │
│ QQuery 24    │   457.65 ms │          470.08 ms │    no change │
│ QQuery 25    │   390.04 ms │          396.00 ms │    no change │
│ QQuery 26    │   526.52 ms │          528.18 ms │    no change │
│ QQuery 27    │  1543.88 ms │         1576.38 ms │    no change │
│ QQuery 28    │ 12400.97 ms │        13551.20 ms │ 1.09x slower │
│ QQuery 29    │   530.87 ms │          520.10 ms │    no change │
│ QQuery 30    │   799.47 ms │          806.82 ms │    no change │
│ QQuery 31    │   865.16 ms │          845.13 ms │    no change │
│ QQuery 32    │  2666.18 ms │         2626.90 ms │    no change │
│ QQuery 33    │  3344.58 ms │         3348.67 ms │    no change │
│ QQuery 34    │  3334.73 ms │         3406.85 ms │    no change │
│ QQuery 35    │  1296.64 ms │         1265.78 ms │    no change │
│ QQuery 36    │   122.86 ms │          120.54 ms │    no change │
│ QQuery 37    │    54.18 ms │           61.38 ms │ 1.13x slower │
│ QQuery 38    │   118.52 ms │          123.84 ms │    no change │
│ QQuery 39    │   189.23 ms │          196.04 ms │    no change │
│ QQuery 40    │    48.29 ms │           51.29 ms │ 1.06x slower │
│ QQuery 41    │    44.33 ms │           48.16 ms │ 1.09x slower │
│ QQuery 42    │    38.13 ms │           38.17 ms │    no change │
└──────────────┴─────────────┴────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                 │ 55890.09ms │
│ Total Time (late-pruning-files)   │ 57310.16ms │
│ Average Time (HEAD)               │  1299.77ms │
│ Average Time (late-pruning-files) │  1332.79ms │
│ Queries Faster                    │          0 │
│ Queries Slower                    │          4 │
│ Queries with No Change            │         39 │
│ Queries with Failure              │          0 │
└───────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ late-pruning-files ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 114.45 ms │          119.58 ms │     no change │
│ QQuery 2     │  22.01 ms │           21.52 ms │     no change │
│ QQuery 3     │  34.25 ms │           32.87 ms │     no change │
│ QQuery 4     │  19.51 ms │           19.30 ms │     no change │
│ QQuery 5     │  52.57 ms │           51.52 ms │     no change │
│ QQuery 6     │  11.78 ms │           11.88 ms │     no change │
│ QQuery 7     │  97.17 ms │           92.22 ms │ +1.05x faster │
│ QQuery 8     │  26.28 ms │           25.89 ms │     no change │
│ QQuery 9     │  59.08 ms │           59.79 ms │     no change │
│ QQuery 10    │  48.59 ms │           47.60 ms │     no change │
│ QQuery 11    │  11.33 ms │           11.14 ms │     no change │
│ QQuery 12    │  40.14 ms │           41.04 ms │     no change │
│ QQuery 13    │  27.13 ms │           28.06 ms │     no change │
│ QQuery 14    │   9.76 ms │            9.59 ms │     no change │
│ QQuery 15    │  22.77 ms │           22.47 ms │     no change │
│ QQuery 16    │  20.86 ms │           21.34 ms │     no change │
│ QQuery 17    │  98.68 ms │           96.16 ms │     no change │
│ QQuery 18    │ 210.32 ms │          214.81 ms │     no change │
│ QQuery 19    │  25.22 ms │           25.18 ms │     no change │
│ QQuery 20    │  34.85 ms │           33.74 ms │     no change │
│ QQuery 21    │ 160.84 ms │          162.45 ms │     no change │
│ QQuery 22    │  16.62 ms │           16.53 ms │     no change │
└──────────────┴───────────┴────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                 │ 1164.21ms │
│ Total Time (late-pruning-files)   │ 1164.70ms │
│ Average Time (HEAD)               │   52.92ms │
│ Average Time (late-pruning-files) │   52.94ms │
│ Queries Faster                    │         1 │
│ Queries Slower                    │         0 │
│ Queries with No Change            │        21 │
│ Queries with Failure              │         0 │
└───────────────────────────────────┴───────────┘

@adriangb
Copy link
Contributor Author

adriangb commented Jun 8, 2025

Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op.

@alamb
Copy link
Contributor

alamb commented Jun 8, 2025

Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op.

clickbench_partitioned have row group statistics so I think it should use partition pruning 🤔 but I am not sure

@adriangb
Copy link
Contributor Author

adriangb commented Jun 8, 2025

It might just be that cheap 😃, I do expect it to be very cheap.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this one in -- I think it looks pretty sweet and will make dynamic filtering that much more effectivel

One thing I was thinking was how to show off how good dynamic filtering is / tell people about it.

@adriangb what do you think about making a benchmark for using dynamic filtering? Perhaps we could take the clickbench dataset and rewrite it so it was partitioned by EventDate (so each file had a distinct date).

Then I bet dynamic filters / file opener pruning would show a pretty big difference

@adriangb adriangb changed the title Add late pruning of file based on file level statistics Add late pruning of Parquet files based on file level statistics Jun 10, 2025
@adriangb
Copy link
Contributor Author

@adriangb what do you think about making a benchmark for using dynamic filtering? Perhaps we could take the clickbench dataset and rewrite it so it was partitioned by EventDate (so each file had a distinct date).

Then I bet dynamic filters / file opener pruning would show a pretty big difference

I think that’d be great! But don’t we need #15770 first? I guess we can prototype on the merged commit in the meantime. But once we have that in I’ll work on benchmarks, blog posts, etc!

@adriangb adriangb merged commit 7477aa6 into apache:main Jun 10, 2025
30 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 10, 2025

I think that’d be great! But don’t we need #15770 first? I guess we can prototype on the merged commit in the meantime. But once we have that in I’ll work on benchmarks, blog posts, etc!

Yes I think you are right.

@adriangb
Copy link
Contributor Author

@alamb should we be running these checks for every batch? obviously that makes your concerns about overhead / performance much worse but I think it will have an even greater impact for dynamic filters: currently once the file is opened if midway through the stream the topk state becomes such that we could exclude the whole file we still stream every row from the file and exclude it via the predicate pushdown, despite the fact that we now know from the stats that we could immediately exit.

I propose the following:

  1. Make a helper struct that encapsulates the state needed to prune the file based on the combination of filters + file statistics.
  2. Add a method to PhysicalExpr::is_dynamic that leaks the necessary information to know if we should be doing these checks or not.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass PartitionedFile into FileSource for late file stats based pruning
4 participants