Skip re-pruning based on partition values and file level stats if there are no dynamic filters #16424

adriangb · 2025-06-16T16:18:36Z

adriangb · 2025-06-16T16:19:23Z

datafusion/datasource-parquet/src/opener.rs

@@ -524,6 +512,91 @@ fn should_enable_page_index(
            .unwrap_or(false)
 }

+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {


I made this pub as I think it could be useful for other data sources.

I do think we should move this + PruningPredicate stuff into a datafusion-pruning create or something.

+1. It's time to put all the pruning logic to a place

Shall I make a datafusion-pruning crate? I guess it will have all of the same deps as datasource-parquet sans the parquet specific bits.

adriangb · 2025-06-16T16:21:09Z

datafusion/datasource-parquet/src/opener.rs

+        let pruning_predicate = build_pruning_predicate(
+            Arc::clone(&self.predicate),
+            &self.pruning_schema,
+            &self.predicate_creation_errors,
+        );


It's unfortunate we need to re-do this work every iteration.
I wonder if we should call snapshot_physical_expr manually here are keep track of "is the new expression different than last iteration, if not skip the rest of the work" 🤔

Another option would be to add a generation to dynamic filters which gets bumped up by 1 every time they get updated. Then it would be super cheap to check if a filter has been updated. But we'd have to come up with APIs for that, put it on PhysicalExpr (what happens if there are multiple child dynamic filters with different generations...?), etc.

It seems to me that given that if there is a perf tradeoff it's only for some cases with dynamic filters so it should be okay to proceed as is for now and worry about that as a later optimization.

it happens once per file, right?

If so I agree that doing it as a follow on optimization sounds good.

However, I recommend we file a ticket while this is all in our heads / we have the full context otherwise we'll forget what to do

adriangb · 2025-06-16T16:22:34Z

cc @alamb I think this resolves the concern about perf overhead of this late pruning when there are no dynamic filters; it's a tossup of what happens when there are dynamic filters, in the case of a topk with large files it's clearly a win, but there could obviously be cases where the additional checks are more overhead if they don't result in early termination of the streams

alamb

Thanks @adriangb -- I like the FilePruner and not pruning if there are no dynamic filters.

I am not sure about trying to prune on each batch -- it does seem like it has the potential to stop reading from certain files earlier, but I worry that the overhead is too high

I'll fire off some benchmarks and see what we can see

alamb · 2025-06-16T21:09:48Z

datafusion/datasource-parquet/src/opener.rs

+}
+
+impl FilePruner {
+    pub fn new_opt(


Could we document under what circumstances it returns None? I think it is when there are no dynamic predicates

It would also be good to document why it would return None (in this case because we assume that files have already been pruned using any non-dynamic predicates so additional pruning may happen ONLY when new dynamic predicates are available??)

alamb · 2025-06-16T21:23:54Z

datafusion/datasource-parquet/src/opener.rs

+        let pruning_predicate = build_pruning_predicate(
+            Arc::clone(&self.predicate),
+            &self.pruning_schema,
+            &self.predicate_creation_errors,
+        );


it happens once per file, right?

If so I agree that doing it as a follow on optimization sounds good.

However, I recommend we file a ticket while this is all in our heads / we have the full context otherwise we'll forget what to do

alamb · 2025-06-16T21:25:21Z

datafusion/datasource-parquet/src/opener.rs

+/// Prune based on partition values and file-level statistics.
+pub struct FilePruner {
+    predicate: Arc<dyn PhysicalExpr>,
+    pruning_schema: Arc<Schema>,


Could we maybe add some comments about what a pruning_schema is? And how it relates to partition_fields

alamb · 2025-06-16T21:27:28Z

datafusion/datasource-parquet/src/opener.rs

+                })
+                .take_while(move |_| {
+                    if let Some(file_pruner) = file_pruner.as_ref() {
+                        match file_pruner.should_prune() {


This is basically applying the filter on each record batch, right?

I think once we can actually push the filters into the parquet scan (which I realize I have been talking about for months...) this could become be entirely redundant

On the other hand, it also stops the input immediately if we find out the file should stop 🤔

Right the point is the stopping which the parquet pruning will not do!

alamb · 2025-06-16T21:30:52Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing prune-rg (936e039) to dd936cb diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

adriangb · 2025-06-16T21:33:51Z

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

alamb · 2025-06-16T22:11:25Z

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1879.99 ms │  1939.08 ms │ no change │
│ QQuery 1     │   693.65 ms │   708.16 ms │ no change │
│ QQuery 2     │  1355.42 ms │  1393.13 ms │ no change │
│ QQuery 3     │   669.90 ms │   672.00 ms │ no change │
│ QQuery 4     │  1337.47 ms │  1363.59 ms │ no change │
│ QQuery 5     │ 15038.90 ms │ 15112.89 ms │ no change │
│ QQuery 6     │  1986.81 ms │  1965.90 ms │ no change │
│ QQuery 7     │  1929.58 ms │  1936.37 ms │ no change │
│ QQuery 8     │   799.31 ms │   798.86 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 25691.04ms │
│ Total Time (prune-rg)   │ 25889.98ms │
│ Average Time (HEAD)     │  2854.56ms │
│ Average Time (prune-rg) │  2876.66ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    16.59 ms │    15.61 ms │ +1.06x faster │
│ QQuery 1     │    33.33 ms │    32.71 ms │     no change │
│ QQuery 2     │    80.97 ms │    79.03 ms │     no change │
│ QQuery 3     │    98.84 ms │   101.19 ms │     no change │
│ QQuery 4     │   589.88 ms │   617.40 ms │     no change │
│ QQuery 5     │   822.41 ms │   848.28 ms │     no change │
│ QQuery 6     │    23.72 ms │    23.39 ms │     no change │
│ QQuery 7     │    36.21 ms │    35.69 ms │     no change │
│ QQuery 8     │   857.78 ms │   867.88 ms │     no change │
│ QQuery 9     │  1165.86 ms │  1172.75 ms │     no change │
│ QQuery 10    │   252.83 ms │   253.69 ms │     no change │
│ QQuery 11    │   282.67 ms │   281.46 ms │     no change │
│ QQuery 12    │   854.98 ms │   846.96 ms │     no change │
│ QQuery 13    │  1257.05 ms │  1266.75 ms │     no change │
│ QQuery 14    │   801.95 ms │   785.82 ms │     no change │
│ QQuery 15    │   777.20 ms │   764.66 ms │     no change │
│ QQuery 16    │  1633.65 ms │  1595.44 ms │     no change │
│ QQuery 17    │  1595.53 ms │  1582.06 ms │     no change │
│ QQuery 18    │  2896.26 ms │  2864.05 ms │     no change │
│ QQuery 19    │    86.57 ms │    84.57 ms │     no change │
│ QQuery 20    │  1119.10 ms │  1094.90 ms │     no change │
│ QQuery 21    │  1243.84 ms │  1271.98 ms │     no change │
│ QQuery 22    │  2064.74 ms │  2070.37 ms │     no change │
│ QQuery 23    │  7537.75 ms │  7537.46 ms │     no change │
│ QQuery 24    │   446.88 ms │   445.02 ms │     no change │
│ QQuery 25    │   367.87 ms │   374.34 ms │     no change │
│ QQuery 26    │   503.34 ms │   506.30 ms │     no change │
│ QQuery 27    │  1481.94 ms │  1504.55 ms │     no change │
│ QQuery 28    │ 11763.16 ms │ 11902.61 ms │     no change │
│ QQuery 29    │   525.71 ms │   532.18 ms │     no change │
│ QQuery 30    │   752.63 ms │   753.79 ms │     no change │
│ QQuery 31    │   801.83 ms │   796.85 ms │     no change │
│ QQuery 32    │  2494.56 ms │  2480.09 ms │     no change │
│ QQuery 33    │  3143.19 ms │  3172.92 ms │     no change │
│ QQuery 34    │  3150.66 ms │  3179.61 ms │     no change │
│ QQuery 35    │  1225.93 ms │  1238.68 ms │     no change │
│ QQuery 36    │   123.81 ms │   124.66 ms │     no change │
│ QQuery 37    │    55.59 ms │    55.17 ms │     no change │
│ QQuery 38    │   121.16 ms │   124.64 ms │     no change │
│ QQuery 39    │   195.41 ms │   195.87 ms │     no change │
│ QQuery 40    │    46.69 ms │    48.51 ms │     no change │
│ QQuery 41    │    44.18 ms │    43.08 ms │     no change │
│ QQuery 42    │    39.09 ms │    38.68 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 53413.35ms │
│ Total Time (prune-rg)   │ 53611.67ms │
│ Average Time (HEAD)     │  1242.17ms │
│ Average Time (prune-rg) │  1246.78ms │
│ Queries Faster          │          1 │
│ Queries Slower          │          0 │
│ Queries with No Change  │         42 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃  prune-rg ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 100.52 ms │ 100.26 ms │    no change │
│ QQuery 2     │  21.21 ms │  21.35 ms │    no change │
│ QQuery 3     │  32.52 ms │  32.87 ms │    no change │
│ QQuery 4     │  19.08 ms │  18.72 ms │    no change │
│ QQuery 5     │  51.07 ms │  50.15 ms │    no change │
│ QQuery 6     │  11.90 ms │  12.19 ms │    no change │
│ QQuery 7     │  85.38 ms │  89.71 ms │ 1.05x slower │
│ QQuery 8     │  24.32 ms │  25.05 ms │    no change │
│ QQuery 9     │  53.59 ms │  54.12 ms │    no change │
│ QQuery 10    │  43.80 ms │  43.31 ms │    no change │
│ QQuery 11    │  11.57 ms │  11.31 ms │    no change │
│ QQuery 12    │  35.33 ms │  34.53 ms │    no change │
│ QQuery 13    │  25.59 ms │  26.29 ms │    no change │
│ QQuery 14    │   9.80 ms │   9.68 ms │    no change │
│ QQuery 15    │  18.63 ms │  19.74 ms │ 1.06x slower │
│ QQuery 16    │  19.16 ms │  18.61 ms │    no change │
│ QQuery 17    │  97.35 ms │  96.99 ms │    no change │
│ QQuery 18    │ 205.89 ms │ 200.40 ms │    no change │
│ QQuery 19    │  27.10 ms │  26.81 ms │    no change │
│ QQuery 20    │  32.14 ms │  32.06 ms │    no change │
│ QQuery 21    │ 152.12 ms │ 148.42 ms │    no change │
│ QQuery 22    │  15.22 ms │  15.37 ms │    no change │
└──────────────┴───────────┴───────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 1093.28ms │
│ Total Time (prune-rg)   │ 1087.91ms │
│ Average Time (HEAD)     │   49.69ms │
│ Average Time (prune-rg) │   49.45ms │
│ Queries Faster          │         0 │
│ Queries Slower          │         2 │
│ Queries with No Change  │        20 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

alamb · 2025-06-16T22:11:27Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing prune-rg (936e039) to dd936cb diff
Benchmarks: clickbench_1
Results will be posted here when complete

alamb · 2025-06-16T22:21:13Z

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    48.55 ms │    48.75 ms │     no change │
│ QQuery 1     │    74.22 ms │    74.11 ms │     no change │
│ QQuery 2     │   109.42 ms │   109.88 ms │     no change │
│ QQuery 3     │   129.53 ms │   122.61 ms │ +1.06x faster │
│ QQuery 4     │   627.55 ms │   625.42 ms │     no change │
│ QQuery 5     │   849.86 ms │   849.16 ms │     no change │
│ QQuery 6     │    57.05 ms │    56.90 ms │     no change │
│ QQuery 7     │    80.49 ms │    82.69 ms │     no change │
│ QQuery 8     │   879.81 ms │   876.39 ms │     no change │
│ QQuery 9     │  1165.56 ms │  1167.48 ms │     no change │
│ QQuery 10    │   291.55 ms │   293.01 ms │     no change │
│ QQuery 11    │   318.77 ms │   322.61 ms │     no change │
│ QQuery 12    │   854.77 ms │   844.91 ms │     no change │
│ QQuery 13    │  1228.41 ms │  1205.05 ms │     no change │
│ QQuery 14    │   795.93 ms │   780.73 ms │     no change │
│ QQuery 15    │   809.83 ms │   797.08 ms │     no change │
│ QQuery 16    │  1624.25 ms │  1631.25 ms │     no change │
│ QQuery 17    │  1610.43 ms │  1592.79 ms │     no change │
│ QQuery 18    │  2880.16 ms │  2972.94 ms │     no change │
│ QQuery 19    │   126.17 ms │   122.33 ms │     no change │
│ QQuery 20    │  1168.29 ms │  1145.79 ms │     no change │
│ QQuery 21    │  1332.98 ms │  1326.28 ms │     no change │
│ QQuery 22    │  2301.13 ms │  2296.25 ms │     no change │
│ QQuery 23    │  7739.93 ms │  7786.37 ms │     no change │
│ QQuery 24    │   480.87 ms │   468.63 ms │     no change │
│ QQuery 25    │   407.47 ms │   407.81 ms │     no change │
│ QQuery 26    │   538.00 ms │   537.92 ms │     no change │
│ QQuery 27    │  1622.71 ms │  1634.71 ms │     no change │
│ QQuery 28    │ 12496.80 ms │ 12414.28 ms │     no change │
│ QQuery 29    │   555.02 ms │   572.82 ms │     no change │
│ QQuery 30    │   778.06 ms │   776.41 ms │     no change │
│ QQuery 31    │   851.43 ms │   834.85 ms │     no change │
│ QQuery 32    │  2531.83 ms │  2507.16 ms │     no change │
│ QQuery 33    │  3255.20 ms │  3232.13 ms │     no change │
│ QQuery 34    │  3300.27 ms │  3271.05 ms │     no change │
│ QQuery 35    │  1250.13 ms │  1217.49 ms │     no change │
│ QQuery 36    │   173.22 ms │   169.50 ms │     no change │
│ QQuery 37    │   101.32 ms │   101.13 ms │     no change │
│ QQuery 38    │   170.56 ms │   167.80 ms │     no change │
│ QQuery 39    │   251.74 ms │   251.91 ms │     no change │
│ QQuery 40    │    87.40 ms │    89.54 ms │     no change │
│ QQuery 41    │    86.72 ms │    84.18 ms │     no change │
│ QQuery 42    │    75.34 ms │    77.39 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 56118.72ms │
│ Total Time (prune-rg)   │ 55947.49ms │
│ Average Time (HEAD)     │  1305.09ms │
│ Average Time (prune-rg) │  1301.10ms │
│ Queries Faster          │          1 │
│ Queries Slower          │          0 │
│ Queries with No Change  │         42 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘

alamb · 2025-06-17T11:38:49Z

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

I want to make sure the overhead of checking the predicates on each incoming batch didn't slow things down

adriangb · 2025-06-17T11:41:38Z

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

I want to make sure the overhead of checking the predicates on each incoming batch didn't slow things down

If you check the code that only happens if there are dynamic filters. And since there are non right now it becomes just a if let Some(file_pruner) = file_pruner.as_ref() check which is going to be too cheap to show up in benchmarks.

The only way to actually verify will be to merge #15770 and then compare this PR to main.

xudong963 · 2025-06-17T12:35:13Z

@adriangb I'll review tomorrow, today have some other things

adriangb · 2025-06-17T18:40:20Z

@alamb sorry for the ping but would you mind running topk_tpch on here?

alamb · 2025-06-17T19:02:08Z

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

alamb · 2025-06-17T19:02:55Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing prune-rg (54b3bbf) to 1429c92 diff
Benchmarks: topk_tpch
Results will be posted here when complete

Dandandan · 2025-06-17T19:05:04Z

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

I was reading that Arrow has requested AWS credits https://lists.apache.org/thread/q33oofy2v3zpg9s9l8o0w68rmjr3ocsv . Perhaps we can utilize one of those for that use case.

alamb · 2025-06-17T19:14:07Z

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark run_topk_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃  prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  26.17 ms │  33.18 ms │  1.27x slower │
│ Q2           │  38.44 ms │  34.00 ms │ +1.13x faster │
│ Q3           │  97.05 ms │ 101.20 ms │     no change │
│ Q4           │  36.71 ms │  40.95 ms │  1.12x slower │
│ Q5           │  25.59 ms │  32.41 ms │  1.27x slower │
│ Q6           │  54.01 ms │  54.31 ms │     no change │
│ Q7           │ 146.60 ms │ 137.02 ms │ +1.07x faster │
│ Q8           │  79.27 ms │  88.55 ms │  1.12x slower │
│ Q9           │ 102.21 ms │ 112.97 ms │  1.11x slower │
│ Q10          │ 174.11 ms │ 188.49 ms │  1.08x slower │
│ Q11          │ 103.82 ms │  91.26 ms │ +1.14x faster │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary       ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (HEAD)       │ 883.98ms │
│ Total Time (prune-rg)   │ 914.34ms │
│ Average Time (HEAD)     │  80.36ms │
│ Average Time (prune-rg) │  83.12ms │
│ Queries Faster          │        3 │
│ Queries Slower          │        6 │
│ Queries with No Change  │        2 │
│ Queries with Failure    │        0 │
└─────────────────────────┴──────────┘

adriangb · 2025-06-17T20:02:16Z

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

I was reading that Arrow has requested / recieved AWS credits https://lists.apache.org/thread/q33oofy2v3zpg9s9l8o0w68rmjr3ocsv . Perhaps we can utilize one of those for that use case.

I tried to ask GCS for credits... they didn't seem excited and ultimately came up with nothing.

adriangb · 2025-06-17T20:03:25Z

🤖: Benchmark completed

Details

Interesting results. I'm inclined to believe that the speedups and slowdowns are both real. We'll have to think about this a bit more.

adriangb · 2025-06-17T21:43:18Z

@Dandandan @alamb I pushed ebe4196 which adds a very cheap way to track changes to a PhysicalExpr if it's dynamic. I think this will be useful in several places but immediately it gives us the ability to check if the dynamic predicate has been updated before doing the work of re-calculating the pruning predicate, etc.

I'm still not sure it will be cheap enough, but I think it's worth a shot if we can re-run the benches.

It'll be a shame if we can't figure this out, I think if we are able to get this working it mostly negates the unfortunate situation right now that if you have a TopK it might be faster with less parallelism / partitioning upfront. With this change you still open the files but are able to quickly bail out as opposed to having to stream the whole thing.

adriangb · 2025-06-17T23:02:05Z

I think this will require @Dandandan 's suggestion of only updating the filters if the new ones are more selective: #16433.

Right now since we always update the filters -> it always bumps the generation -> we always re-check.

… dynamic filters

adriangb · 2025-06-19T20:29:46Z

@alamb I reverted the filtering during the stream so this should now do strictly less work 😄

datafusion/datasource-parquet/src/opener.rs

github-actions bot added physical-expr Changes to the physical-expr crates datasource Changes to the datasource crate labels Jun 16, 2025

adriangb commented Jun 16, 2025

View reviewed changes

adriangb requested review from alamb and xudong963 June 16, 2025 16:25

adriangb mentioned this pull request Jun 16, 2025

TopK dynamic filter pushdown attempt 2 #15770

Merged

alamb reviewed Jun 16, 2025

View reviewed changes

alamb approved these changes Jun 17, 2025

View reviewed changes

alamb mentioned this pull request Jun 17, 2025

Add late pruning of Parquet files based on file level statistics #16014

Merged

adriangb force-pushed the prune-rg branch from e617355 to 54b3bbf Compare June 17, 2025 18:39

adriangb force-pushed the prune-rg branch 2 times, most recently from 3d6a97a to ebe4196 Compare June 17, 2025 21:39

adriangb added 11 commits June 19, 2025 14:16

Prune files as we iterate over streams, short circuit if there are no…

b152c22

… dynamic filters

fmt

5825736

better comment

fc67ce5

clippy

6c2b2a3

add test

b93d6ee

fix import

f1a31c0

fmt

74263f7

refactor for clarity

d62c5a8

wrap

a340796

add generation checks

cdd9f6d

fix

1cbed0b

adriangb force-pushed the prune-rg branch from 5233162 to 1cbed0b Compare June 19, 2025 19:16

adriangb added 2 commits June 19, 2025 15:25

don't prune on each batch

6e3bf5c

clippy

6025f14

adriangb changed the title ~~Prune files during streams and avoid additional pruning if there are no dynamic filters~~ Skip re-pruning based on partition values and file level stats if there are no dynamic filters Jun 19, 2025

adriangb commented Jun 19, 2025

View reviewed changes

datafusion/datasource-parquet/src/opener.rs Show resolved Hide resolved

Update datafusion/datasource-parquet/src/opener.rs

c95c773

Skip re-pruning based on partition values and file level stats if there are no dynamic filters #16424

Are you sure you want to change the base?

Skip re-pruning based on partition values and file level stats if there are no dynamic filters #16424

Uh oh!

Conversation

adriangb commented Jun 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jun 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

adriangb commented Jun 16, 2025

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

adriangb commented Jun 17, 2025

Uh oh!

xudong963 commented Jun 17, 2025

Uh oh!

adriangb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

Dandandan commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

adriangb commented Jun 17, 2025 • edited by Dandandan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jun 17, 2025

Uh oh!

adriangb commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jun 17, 2025

Uh oh!

adriangb commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

adriangb Jun 16, 2025 •

edited

Loading

adriangb commented Jun 17, 2025 •

edited by Dandandan

Loading

adriangb commented Jun 17, 2025 •

edited

Loading