Add late pruning of Parquet files based on file level statistics #16014

adriangb · 2025-05-10T00:45:52Z

Closes Pass PartitionedFile into FileSource for late file stats based pruning #16000

adriangb · 2025-05-10T00:46:51Z

A couple of thoughts:

Needs cleanup.
Not sure how to construct the empty stream.
It might be nice to implement pruning for Vec<Statistics> where each statistic represents an arbitrary container (e.g. partition or file).

alamb · 2025-05-11T12:00:30Z

It might be nice to implement pruning for Vec where each statistic represents an arbitrary container (e.g. partition or file).

Yes this would be super nice -- the more we can do to consolidate statistics / pruning the better off the code will be I think. Right now it is kind of scattered in several places

alamb · 2025-05-11T12:01:17Z

Not sure how to construct the empty stream.

You can use something like https://docs.rs/futures/latest/futures/stream/fn.iter.html perhaps -- like futures::stream::iter(vec![]) for example 🤔

berkaysynnada · 2025-05-11T13:17:06Z

datafusion/datasource/src/file_stream.rs

@@ -367,7 +368,7 @@ impl Default for OnError {
 pub trait FileOpener: Unpin + Send + Sync {
    /// Asynchronously open the specified file and return a stream
    /// of [`RecordBatch`]
-    fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture>;
+    fn open(&self, file_meta: FileMeta, file: PartitionedFile) -> Result<FileOpenFuture>;


Isn't it sufficient to provide only file statistics? PartitionedFile seems like an overkill to me

Maybe? But I feel like we have the partitioned file we might as well pass it in. Maybe we use it in the future to enable optimizations that use the partition values (eg late pruning based on partition values, including partition values in the scan so that more filters can be evaluated, etc)

I think using PartitionedFile as the "data we have at plan time" including statistics and potentially information about size, encryption, special indexes, etc makes a lot of sense

Maybe? But I feel like we have the partitioned file we might as well pass it in. Maybe we use it in the future to enable optimizations that use the partition values (eg late pruning based on partition values, including partition values in the scan so that more filters can be evaluated, etc)

I believe these can also be inferred from statistics in a more generalized fashion(don't know partition columns exist in column_statistics now) but not a big deal, we can keep this 👍🏻

Can you please update the documetnation for open() to mention that file has plan time per-file information (such as statistics) and leave a doc link back?

berkaysynnada

The idea makes a lot sense. I've one implementation suggestion. Thanks again @adriangb

adriangb · 2025-05-11T23:10:29Z

@alamb please review again I implemented and added a test 😄

xudong963 · 2025-05-12T09:55:13Z

datafusion/datasource-parquet/src/opener.rs

+            (Some(stats), Some(predicate)) => {
+                let pruning_predicate = build_pruning_predicate(
+                    Arc::clone(predicate),
+                    &self.table_schema,


Should it use table_schema here?

xudong963 · 2025-05-12T09:56:45Z

datafusion/datasource-parquet/src/opener.rs

+        match (&file.statistics, &self.predicate) {
+            (Some(stats), Some(predicate)) => {


Given that there is only one branch, I suggest using if let (Some(_), Some(_)) = xxx {} here.

alamb

Very cool -- I think this is very close

alamb · 2025-05-12T14:38:43Z

datafusion/datasource/src/file_stream.rs

@@ -367,7 +368,7 @@ impl Default for OnError {
 pub trait FileOpener: Unpin + Send + Sync {
    /// Asynchronously open the specified file and return a stream
    /// of [`RecordBatch`]
-    fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture>;
+    fn open(&self, file_meta: FileMeta, file: PartitionedFile) -> Result<FileOpenFuture>;


Can you please update the documetnation for open() to mention that file has plan time per-file information (such as statistics) and leave a doc link back?

alamb · 2025-05-12T14:45:09Z

datafusion/physical-optimizer/src/pruning.rs

+        }
+    }
+
+    /// Returns [`BooleanArray`] where each row represents information known


this comment cna probably be trimmed with a link back to the original trait source

alamb · 2025-05-12T14:56:56Z

datafusion/physical-optimizer/src/pruning.rs

@@ -995,6 +996,184 @@ fn build_statistics_record_batch<S: PruningStatistics>(
    })
 }

+/// Prune a set of containers represented by their statistics.


This is a nice structure -- I think it makes lots of sense and is 100%

Specifically, I thought there was already code that pruned individual files based on statistics but I cound not find any in LIstingTable (we have something like this in influxdb_iox).

My opinion is if we are going to this code it into the DataFusion codebase we should

Ensure that it helps a as many users as possble

Make sure it is executed as much as possible (to ensure test coverage)

Thus, what do you think about using the PrunableStatistics to prune the FileGroup in ListingTable here:

https://github.com/apache/datafusion/blob/55ba4cadce5ea99de4361929226f1c99cfc94450/datafusion/core/src/datasource/listing/table.rs#L1117-L1116

?

Pruning on statistics during plan time would potentially be redundant with also trying to prune again during opening, but it would reduce the files earlier int he plan

How about I bundle in the PartitionValues somehow and then we can re-use and compose that?
Specifically:

TableProvider's use just the partition values

ParquetOpener combines both

Something else can use just the stats

Pruning on statistics during plan time would potentially be redundant with also trying to prune again during opening, but it would reduce the files earlier int he plan

Yeah I don't think it's redundant: you either prune or you don't. If we prune earlier the files don't make it this far. If we don't we may now be able to prune them. What's redundant is if there are no changes to the filters (i.e. no dynamic filters), but that sounds both hard to track and like a possible future optimization 😄

alamb · 2025-05-12T14:57:16Z

datafusion/physical-optimizer/src/pruning.rs

+    /// [`Self::min_values`], [`Self::max_values`], [`Self::null_counts`],
+    /// and [`Self::row_counts`].
+    fn num_containers(&self) -> usize {
+        1


this should be self.statistics.len(), right?

adriangb · 2025-05-13T14:50:15Z

@alamb I pushed 4607643 which adds some nice APIs for partition values. In particular I think it's important to have a way to prune based on partition values + file level statistics (#15935).

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

alamb · 2025-05-13T18:57:41Z

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

Maybe it is time to make a datafusion-pruning crate that has all the PruningPredicate and related infrastructure 🤔

alamb · 2025-05-13T18:57:57Z

FYI @xudong963 I think this is relevant to your work on statistics / partition pruning as well

adriangb · 2025-05-13T20:06:52Z

However I can't implement it for ListingTable since the trait is defined in physical-optimizer. Can we move the trait somewhere upstream?

Maybe it is time to make a datafusion-pruning crate that has all the PruningPredicate and related infrastructure 🤔

Seems reasonable to me. I guess it'd be at the same level as PhysicalExpr and such.

adriangb · 2025-05-14T04:55:56Z

Moving to datafusion_common works pretty well, I think that's easier than making a new crate.

Next hurdle: at this point we've long lost information on the actual table schema / partition files. ParquetOpener::table_schema is actually the file schema and we have no way to back out the partition columns.
Given that PartitionedFile carries around partition_values: Vec<ScalarValue> I'd recommend either:

Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.
Adding PartitionedFile::partition_schema.
Piping down table_schema into ParquetSource and later ParquetOpener.

I think any of these also sets us up to refactor how the partition filters actually get applied (i.e. we don't have to inject them in the FileScan. But maybe that's not desirable because every format would have to implement this on their own then. In that case we pipe them in to ParquetOpener for pruning and still inject them in the scan (it should be cheapish).

@alamb any preference?

xudong963

Generally LGTM, thank you

xudong963 · 2025-05-14T14:39:00Z

datafusion/datasource-parquet/src/opener.rs

+        if let (Some(stats), Some(predicate)) = (&file.statistics, &self.predicate) {
+            let pruning_predicate = build_pruning_predicate(
+                Arc::clone(predicate),
+                &self.table_schema,


Is it reasonable to use table_schema here?

It's the only schema we have. And it's not even really the table schema, the name is misleading for historical reasons.

It'd be better to add some notes about it. (I often confused when I reading the parquet part code, all kinds of schema, lol)

datafusion/datafusion/datasource-parquet/src/opener.rs

Lines 182 to 185 in 4607643

// Note about schemas: we are actually dealing with **3 different schemas** here:

// - The table schema as defined by the TableProvider. This is what the user sees, what they get when they `SELECT * FROM table`, etc.

// - The "virtual" file schema: this is the table schema minus any hive partition columns and projections. This is what the file schema is coerced to.

// - The physical file schema: this is the schema as defined by the parquet file. This is what the parquet file actually contains.

😄

adriangb · 2025-05-14T14:56:19Z

I think the next step here is to resolve #16014 (comment)

In my mind it makes sense to both push down the information and continue to have the ability to do it after the scan.
The direction DataFusion seems to be heading in is to add whatever functionality is needed to specialize readers for the most optimal performance (in this case by doing late pruning of files / partitions and being able to evaluate filters that mix partition columns and file columns during the scan) but preserving the ability to fall back to more general approaches (FilterExec, evaluating mixed filters after the scan) for sources that don't support this advanced functionality.

alamb · 2025-05-14T20:43:29Z

Moving to datafusion_common works pretty well, I think that's easier than making a new crate.

I think we should try and avoid moving everything to datafusion_common. Since the pruning stuff relies on PhysicalExpr I don't think we can directly put it in datafusion_common

Next hurdle: at this point we've long lost information on the actual table schema / partition files. ParquetOpener::table_schema is actually the file schema and we have no way to back out the partition columns. Given that PartitionedFile carries around partition_values: Vec<ScalarValue> I'd recommend either:

Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.

Adding PartitionedFile::partition_schema.

Piping down table_schema into ParquetSource and later ParquetOpener.

I think any of these also sets us up to refactor how the partition filters actually get applied (i.e. we don't have to inject them in the FileScan. But maybe that's not desirable because every format would have to implement this on their own then. In that case we pipe them in to ParquetOpener for pruning and still inject them in the scan (it should be cheapish).

@alamb any preference?

Changing PartitionedFile::partition_values to Vec<String, ScalarValue>.

I think this sounds like the most straightforward thing to me and the easiest way to get the required information

Seems like FileScanConfig already has table_partition_cols,

Maybe we can do something like this (change to use a FieldRef rather than Field to avoid copies):

pub struct PartitionedFile {
...
    pub partition_values: Vec<ScalarValue>,
...
}

to

pub struct PartitionedFile {
...
    pub partition_values: Vec<(FieldRef, ScalarValue)>,
...
}

alamb · 2025-05-14T20:44:55Z

BTW the other thing I somewhat worry about reapplying pruning during file opening is that it is in the critical path and directly will add to the query latency. I wonder if there is some way to ensure we have hidden it behind IO if possible (aka make sure we are applying the extra pruning while the next file is opened rather than waiting to do it before starting that IO

adriangb · 2025-05-14T21:31:08Z

Since the pruning stuff relies on PhysicalExpr I don't think we can directly put it in datafusion_common
The stuff I'm moving doesn't 😄. It's basically just the PruningStatistics trait.

Maybe we can do something like this (change to use a FieldRef rather than Field to avoid copies):

That sounds good to me. It kinda makes sense that if you're carrying around partition values you'd carry around info on what columns they belong to. Maybe it will help resolve #13270 as well in the future.

BTW the other thing I somewhat worry about reapplying pruning during file opening is that it is in the critical path and directly will add to the query latency. I wonder if there is some way to ensure we have hidden it behind IO if possible (aka make sure we are applying the extra pruning while the next file is opened rather than waiting to do it before starting that IO

I think we can move it a couple lines lower into Ok(Box::pin(async move { and that will do the trick? As long as it happens before we load the Parquet metadata the overhead is minimal. There's probably other stuff we could move into there if that's a concern.

adriangb · 2025-05-15T03:39:41Z

@alamb @xudong963 I've pushed a change that:

Moves PruningStatistics into common.
Adds composable helpers to prune based on Vec<Statistics> (multiple files / partitions) and Vec<Vec<ScalarValue>> (multiple containers of partition values).
Adds partition_fields: Vec<FieldRef> to ParquetOpener, with slight tweaks to FileScanConfig (the latter is a bit of a PITA because of how it's both a struct and it's own builder).
Implements the pruning inside of the the IO work so that it's deferred as Andrew asked for.
Sets us up nicely to pipe the partition values into the other stages of pruning (row group stats, page stats and row filters). Leaving this for future work though.

adriangb · 2025-05-18T02:34:49Z

My plan for this PR now is to first resolve blockers. In particular:

Move PruningStatistics into datafusion::common #16069
Make SessionContext::register_parquet obey collect_statistics config #16080
PR to add the new structs into datafusion-common

And then come back here and resolve the rest of the points of discussion.

adriangb · 2025-06-05T18:42:04Z

datafusion/proto/src/physical_plan/to_proto.rs

-    fields.extend(conf.table_partition_cols.iter().cloned().map(Arc::new));
+    fields.extend(conf.table_partition_cols.iter().cloned());


I think this may have just been clippy, but it's not a bad change!

adriangb · 2025-06-05T18:42:52Z

I've rebased this and it's looking nice now.
I think the main open question is the concern about performance / overhead:

https://github.com/apache/datafusion/pull/16014/files#r2093515834

alamb · 2025-06-08T14:02:05Z

I've rebased this and it's looking nice now. I think the main open question is the concern about performance / overhead:

I'll fire up some benchmarks and see if we can see anything concerning

alamb · 2025-06-08T14:30:06Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing late-pruning-files (e0088bc) to 25727d4 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-06-08T15:10:09Z

🤖: Benchmark completed

Details

Comparing HEAD and late-pruning-files
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ late-pruning-files ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1894.50 ms │         1912.32 ms │ no change │
│ QQuery 1     │   699.43 ms │          725.99 ms │ no change │
│ QQuery 2     │  1429.19 ms │         1418.16 ms │ no change │
│ QQuery 3     │   691.52 ms │          705.27 ms │ no change │
│ QQuery 4     │  1479.88 ms │         1435.90 ms │ no change │
│ QQuery 5     │ 15451.09 ms │        15449.35 ms │ no change │
│ QQuery 6     │  1990.47 ms │         2025.21 ms │ no change │
│ QQuery 7     │  2167.33 ms │         2083.99 ms │ no change │
│ QQuery 8     │   849.71 ms │          855.75 ms │ no change │
└──────────────┴─────────────┴────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                 │ 26653.13ms │
│ Total Time (late-pruning-files)   │ 26611.94ms │
│ Average Time (HEAD)               │  2961.46ms │
│ Average Time (late-pruning-files) │  2956.88ms │
│ Queries Faster                    │          0 │
│ Queries Slower                    │          0 │
│ Queries with No Change            │          9 │
│ Queries with Failure              │          0 │
└───────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ late-pruning-files ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.62 ms │           15.76 ms │    no change │
│ QQuery 1     │    33.10 ms │           33.10 ms │    no change │
│ QQuery 2     │    79.76 ms │           80.46 ms │    no change │
│ QQuery 3     │    92.47 ms │           95.67 ms │    no change │
│ QQuery 4     │   602.03 ms │          584.42 ms │    no change │
│ QQuery 5     │   849.99 ms │          819.51 ms │    no change │
│ QQuery 6     │    23.38 ms │           23.41 ms │    no change │
│ QQuery 7     │    37.12 ms │           38.68 ms │    no change │
│ QQuery 8     │   894.56 ms │          902.18 ms │    no change │
│ QQuery 9     │  1187.51 ms │         1225.09 ms │    no change │
│ QQuery 10    │   266.45 ms │          267.61 ms │    no change │
│ QQuery 11    │   296.16 ms │          295.69 ms │    no change │
│ QQuery 12    │   899.57 ms │          896.76 ms │    no change │
│ QQuery 13    │  1329.93 ms │         1349.63 ms │    no change │
│ QQuery 14    │   831.31 ms │          838.69 ms │    no change │
│ QQuery 15    │   813.18 ms │          810.15 ms │    no change │
│ QQuery 16    │  1720.13 ms │         1720.39 ms │    no change │
│ QQuery 17    │  1588.26 ms │         1595.08 ms │    no change │
│ QQuery 18    │  3045.79 ms │         3058.16 ms │    no change │
│ QQuery 19    │    85.16 ms │           83.74 ms │    no change │
│ QQuery 20    │  1101.27 ms │         1138.77 ms │    no change │
│ QQuery 21    │  1295.63 ms │         1316.05 ms │    no change │
│ QQuery 22    │  2144.62 ms │         2174.86 ms │    no change │
│ QQuery 23    │  7884.84 ms │         7964.78 ms │    no change │
│ QQuery 24    │   457.65 ms │          470.08 ms │    no change │
│ QQuery 25    │   390.04 ms │          396.00 ms │    no change │
│ QQuery 26    │   526.52 ms │          528.18 ms │    no change │
│ QQuery 27    │  1543.88 ms │         1576.38 ms │    no change │
│ QQuery 28    │ 12400.97 ms │        13551.20 ms │ 1.09x slower │
│ QQuery 29    │   530.87 ms │          520.10 ms │    no change │
│ QQuery 30    │   799.47 ms │          806.82 ms │    no change │
│ QQuery 31    │   865.16 ms │          845.13 ms │    no change │
│ QQuery 32    │  2666.18 ms │         2626.90 ms │    no change │
│ QQuery 33    │  3344.58 ms │         3348.67 ms │    no change │
│ QQuery 34    │  3334.73 ms │         3406.85 ms │    no change │
│ QQuery 35    │  1296.64 ms │         1265.78 ms │    no change │
│ QQuery 36    │   122.86 ms │          120.54 ms │    no change │
│ QQuery 37    │    54.18 ms │           61.38 ms │ 1.13x slower │
│ QQuery 38    │   118.52 ms │          123.84 ms │    no change │
│ QQuery 39    │   189.23 ms │          196.04 ms │    no change │
│ QQuery 40    │    48.29 ms │           51.29 ms │ 1.06x slower │
│ QQuery 41    │    44.33 ms │           48.16 ms │ 1.09x slower │
│ QQuery 42    │    38.13 ms │           38.17 ms │    no change │
└──────────────┴─────────────┴────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                 │ 55890.09ms │
│ Total Time (late-pruning-files)   │ 57310.16ms │
│ Average Time (HEAD)               │  1299.77ms │
│ Average Time (late-pruning-files) │  1332.79ms │
│ Queries Faster                    │          0 │
│ Queries Slower                    │          4 │
│ Queries with No Change            │         39 │
│ Queries with Failure              │          0 │
└───────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ late-pruning-files ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 114.45 ms │          119.58 ms │     no change │
│ QQuery 2     │  22.01 ms │           21.52 ms │     no change │
│ QQuery 3     │  34.25 ms │           32.87 ms │     no change │
│ QQuery 4     │  19.51 ms │           19.30 ms │     no change │
│ QQuery 5     │  52.57 ms │           51.52 ms │     no change │
│ QQuery 6     │  11.78 ms │           11.88 ms │     no change │
│ QQuery 7     │  97.17 ms │           92.22 ms │ +1.05x faster │
│ QQuery 8     │  26.28 ms │           25.89 ms │     no change │
│ QQuery 9     │  59.08 ms │           59.79 ms │     no change │
│ QQuery 10    │  48.59 ms │           47.60 ms │     no change │
│ QQuery 11    │  11.33 ms │           11.14 ms │     no change │
│ QQuery 12    │  40.14 ms │           41.04 ms │     no change │
│ QQuery 13    │  27.13 ms │           28.06 ms │     no change │
│ QQuery 14    │   9.76 ms │            9.59 ms │     no change │
│ QQuery 15    │  22.77 ms │           22.47 ms │     no change │
│ QQuery 16    │  20.86 ms │           21.34 ms │     no change │
│ QQuery 17    │  98.68 ms │           96.16 ms │     no change │
│ QQuery 18    │ 210.32 ms │          214.81 ms │     no change │
│ QQuery 19    │  25.22 ms │           25.18 ms │     no change │
│ QQuery 20    │  34.85 ms │           33.74 ms │     no change │
│ QQuery 21    │ 160.84 ms │          162.45 ms │     no change │
│ QQuery 22    │  16.62 ms │           16.53 ms │     no change │
└──────────────┴───────────┴────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                 ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                 │ 1164.21ms │
│ Total Time (late-pruning-files)   │ 1164.70ms │
│ Average Time (HEAD)               │   52.92ms │
│ Average Time (late-pruning-files) │   52.94ms │
│ Queries Faster                    │         1 │
│ Queries Slower                    │         0 │
│ Queries with No Change            │        21 │
│ Queries with Failure              │         0 │
└───────────────────────────────────┴───────────┘

adriangb · 2025-06-08T15:33:12Z

Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op.

alamb · 2025-06-08T17:03:32Z

Do any of those benchmarks actually collect statistics or use partition pruning? If not I do expect this to essentially be a no-op.

clickbench_partitioned have row group statistics so I think it should use partition pruning 🤔 but I am not sure

adriangb · 2025-06-08T17:11:46Z

It might just be that cheap 😃, I do expect it to be very cheap.

alamb

Let's merge this one in -- I think it looks pretty sweet and will make dynamic filtering that much more effectivel

One thing I was thinking was how to show off how good dynamic filtering is / tell people about it.

@adriangb what do you think about making a benchmark for using dynamic filtering? Perhaps we could take the clickbench dataset and rewrite it so it was partitioned by EventDate (so each file had a distinct date).

Then I bet dynamic filters / file opener pruning would show a pretty big difference

adriangb · 2025-06-10T12:48:36Z

@adriangb what do you think about making a benchmark for using dynamic filtering? Perhaps we could take the clickbench dataset and rewrite it so it was partitioned by EventDate (so each file had a distinct date).

Then I bet dynamic filters / file opener pruning would show a pretty big difference

I think that’d be great! But don’t we need #15770 first? I guess we can prototype on the merged commit in the meantime. But once we have that in I’ll work on benchmarks, blog posts, etc!

alamb · 2025-06-10T15:15:08Z

I think that’d be great! But don’t we need #15770 first? I guess we can prototype on the merged commit in the meantime. But once we have that in I’ll work on benchmarks, blog posts, etc!

Yes I think you are right.

adriangb · 2025-06-16T15:39:07Z

@alamb should we be running these checks for every batch? obviously that makes your concerns about overhead / performance much worse but I think it will have an even greater impact for dynamic filters: currently once the file is opened if midway through the stream the topk state becomes such that we could exclude the whole file we still stream every row from the file and exclude it via the predicate pushdown, despite the fact that we now know from the stats that we could immediately exit.

I propose the following:

Make a helper struct that encapsulates the state needed to prune the file based on the combination of filters + file statistics.
Add a method to PhysicalExpr::is_dynamic that leaks the necessary information to know if we should be doing these checks or not.

alamb · 2025-06-17T14:42:43Z

Follow up PR:

Skip re-pruning based on partition values and file level stats if there are no dynamic filters #16424

adriangb mentioned this pull request May 10, 2025

Pass PartitionedFile into FileSource for late file stats based pruning #16000

Closed

github-actions bot added optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate labels May 10, 2025

berkaysynnada reviewed May 11, 2025

View reviewed changes

adriangb marked this pull request as ready for review May 11, 2025 23:09

adriangb force-pushed the late-pruning-files branch from 0e03bdc to 94726cc Compare May 11, 2025 23:10

xudong963 reviewed May 12, 2025

View reviewed changes

alamb reviewed May 12, 2025

View reviewed changes

xudong963 self-requested a review May 14, 2025 14:24

xudong963 reviewed May 14, 2025

View reviewed changes

github-actions bot added common Related to common crate proto Related to proto crate labels May 15, 2025

adriangb force-pushed the late-pruning-files branch from e8eb87f to cc120d0 Compare May 15, 2025 03:30

github-actions bot added the documentation Improvements or additions to documentation label May 15, 2025

adriangb mentioned this pull request May 18, 2025

Make SessionContext::register_parquet obey collect_statistics config #16080

Merged

This was referenced May 19, 2025

Add test for datafusion.execution.collect_statistics setting #16096

Closed

Weekly Plan: Andrew Lamb 2025-05-19 #16101

Closed

alamb marked this pull request as draft May 19, 2025 17:25

alamb mentioned this pull request May 20, 2025

Move PruningStatistics into datafusion::common #16069

Merged

adriangb mentioned this pull request May 21, 2025

Add new stats pruning helpers to allow combining partition values in file level stats #16139

Merged

adriangb force-pushed the late-pruning-files branch from de0590c to d6e974c Compare June 5, 2025 18:34

adriangb marked this pull request as ready for review June 5, 2025 18:34

Add late pruning of files based on file level statistics

7178a63

adriangb force-pushed the late-pruning-files branch from d6e974c to 7178a63 Compare June 5, 2025 18:35

revert

e0088bc

adriangb commented Jun 5, 2025

View reviewed changes

github-actions bot removed the common Related to common crate label Jun 5, 2025

alamb approved these changes Jun 9, 2025

View reviewed changes

adriangb changed the title ~~Add late pruning of file based on file level statistics~~ Add late pruning of Parquet files based on file level statistics Jun 10, 2025

adriangb merged commit 7477aa6 into apache:main Jun 10, 2025
30 checks passed

alamb mentioned this pull request Jun 13, 2025

Add statistics to ParquetExec for *files* pruned #16402

Open

adriangb mentioned this pull request Jun 16, 2025

Skip re-pruning based on partition values and file level stats if there are no dynamic filters #16424

Open

		match (&file.statistics, &self.predicate) {
		(Some(stats), Some(predicate)) => {

	// Note about schemas: we are actually dealing with 3 different schemas here:
	// - The table schema as defined by the TableProvider. This is what the user sees, what they get when they `SELECT * FROM table`, etc.
	// - The "virtual" file schema: this is the table schema minus any hive partition columns and projections. This is what the file schema is coerced to.
	// - The physical file schema: this is the schema as defined by the parquet file. This is what the parquet file actually contains.

		fields.extend(conf.table_partition_cols.iter().cloned().map(Arc::new));
		fields.extend(conf.table_partition_cols.iter().cloned());

Add late pruning of Parquet files based on file level statistics #16014

Add late pruning of Parquet files based on file level statistics #16014

Uh oh!

Conversation

adriangb commented May 10, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented May 10, 2025

Uh oh!

alamb commented May 11, 2025

Uh oh!

alamb commented May 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented May 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented May 13, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

alamb commented May 13, 2025

Uh oh!

adriangb commented May 13, 2025

Uh oh!

adriangb commented May 14, 2025

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented May 14, 2025

Uh oh!

alamb commented May 14, 2025

Uh oh!

alamb commented May 14, 2025

Uh oh!

adriangb commented May 14, 2025

Uh oh!

adriangb commented May 10, 2025 •

edited by alamb

Loading

adriangb May 11, 2025 •

edited

Loading