Skip to content

Skip re-pruning based on partition values and file level stats if there are no dynamic filters #16424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

adriangb
Copy link
Contributor

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates datasource Changes to the datasource crate labels Jun 16, 2025
@@ -524,6 +512,91 @@ fn should_enable_page_index(
.unwrap_or(false)
}

/// Prune based on partition values and file-level statistics.
pub struct FilePruner {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this pub as I think it could be useful for other data sources.

I do think we should move this + PruningPredicate stuff into a datafusion-pruning create or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's time to put all the pruning logic to a place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I make a datafusion-pruning crate? I guess it will have all of the same deps as datasource-parquet sans the parquet specific bits.

Comment on lines +559 to +575
let pruning_predicate = build_pruning_predicate(
Arc::clone(&self.predicate),
&self.pruning_schema,
&self.predicate_creation_errors,
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate we need to re-do this work every iteration.
I wonder if we should call snapshot_physical_expr manually here are keep track of "is the new expression different than last iteration, if not skip the rest of the work" 🤔

Copy link
Contributor Author

@adriangb adriangb Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to add a generation to dynamic filters which gets bumped up by 1 every time they get updated. Then it would be super cheap to check if a filter has been updated. But we'd have to come up with APIs for that, put it on PhysicalExpr (what happens if there are multiple child dynamic filters with different generations...?), etc.

It seems to me that given that if there is a perf tradeoff it's only for some cases with dynamic filters so it should be okay to proceed as is for now and worry about that as a later optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens once per file, right?

If so I agree that doing it as a follow on optimization sounds good.

However, I recommend we file a ticket while this is all in our heads / we have the full context otherwise we'll forget what to do

@adriangb
Copy link
Contributor Author

cc @alamb I think this resolves the concern about perf overhead of this late pruning when there are no dynamic filters; it's a tossup of what happens when there are dynamic filters, in the case of a topk with large files it's clearly a win, but there could obviously be cases where the additional checks are more overhead if they don't result in early termination of the streams

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- I like the FilePruner and not pruning if there are no dynamic filters.

I am not sure about trying to prune on each batch -- it does seem like it has the potential to stop reading from certain files earlier, but I worry that the overhead is too high

I'll fire off some benchmarks and see what we can see

}

impl FilePruner {
pub fn new_opt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we document under what circumstances it returns None? I think it is when there are no dynamic predicates

It would also be good to document why it would return None (in this case because we assume that files have already been pruned using any non-dynamic predicates so additional pruning may happen ONLY when new dynamic predicates are available??)

Comment on lines +559 to +575
let pruning_predicate = build_pruning_predicate(
Arc::clone(&self.predicate),
&self.pruning_schema,
&self.predicate_creation_errors,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it happens once per file, right?

If so I agree that doing it as a follow on optimization sounds good.

However, I recommend we file a ticket while this is all in our heads / we have the full context otherwise we'll forget what to do

/// Prune based on partition values and file-level statistics.
pub struct FilePruner {
predicate: Arc<dyn PhysicalExpr>,
pruning_schema: Arc<Schema>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe add some comments about what a pruning_schema is? And how it relates to partition_fields

})
.take_while(move |_| {
if let Some(file_pruner) = file_pruner.as_ref() {
match file_pruner.should_prune() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically applying the filter on each record batch, right?

I think once we can actually push the filters into the parquet scan (which I realize I have been talking about for months...) this could become be entirely redundant

On the other hand, it also stops the input immediately if we find out the file should stop 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right the point is the stopping which the parquet pruning will not do!

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing prune-rg (936e039) to dd936cb diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@adriangb
Copy link
Contributor Author

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  1879.99 ms │  1939.08 ms │ no change │
│ QQuery 1     │   693.65 ms │   708.16 ms │ no change │
│ QQuery 2     │  1355.42 ms │  1393.13 ms │ no change │
│ QQuery 3     │   669.90 ms │   672.00 ms │ no change │
│ QQuery 4     │  1337.47 ms │  1363.59 ms │ no change │
│ QQuery 5     │ 15038.90 ms │ 15112.89 ms │ no change │
│ QQuery 6     │  1986.81 ms │  1965.90 ms │ no change │
│ QQuery 7     │  1929.58 ms │  1936.37 ms │ no change │
│ QQuery 8     │   799.31 ms │   798.86 ms │ no change │
└──────────────┴─────────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 25691.04ms │
│ Total Time (prune-rg)   │ 25889.98ms │
│ Average Time (HEAD)     │  2854.56ms │
│ Average Time (prune-rg) │  2876.66ms │
│ Queries Faster          │          0 │
│ Queries Slower          │          0 │
│ Queries with No Change  │          9 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    16.59 ms │    15.61 ms │ +1.06x faster │
│ QQuery 1     │    33.33 ms │    32.71 ms │     no change │
│ QQuery 2     │    80.97 ms │    79.03 ms │     no change │
│ QQuery 3     │    98.84 ms │   101.19 ms │     no change │
│ QQuery 4     │   589.88 ms │   617.40 ms │     no change │
│ QQuery 5     │   822.41 ms │   848.28 ms │     no change │
│ QQuery 6     │    23.72 ms │    23.39 ms │     no change │
│ QQuery 7     │    36.21 ms │    35.69 ms │     no change │
│ QQuery 8     │   857.78 ms │   867.88 ms │     no change │
│ QQuery 9     │  1165.86 ms │  1172.75 ms │     no change │
│ QQuery 10    │   252.83 ms │   253.69 ms │     no change │
│ QQuery 11    │   282.67 ms │   281.46 ms │     no change │
│ QQuery 12    │   854.98 ms │   846.96 ms │     no change │
│ QQuery 13    │  1257.05 ms │  1266.75 ms │     no change │
│ QQuery 14    │   801.95 ms │   785.82 ms │     no change │
│ QQuery 15    │   777.20 ms │   764.66 ms │     no change │
│ QQuery 16    │  1633.65 ms │  1595.44 ms │     no change │
│ QQuery 17    │  1595.53 ms │  1582.06 ms │     no change │
│ QQuery 18    │  2896.26 ms │  2864.05 ms │     no change │
│ QQuery 19    │    86.57 ms │    84.57 ms │     no change │
│ QQuery 20    │  1119.10 ms │  1094.90 ms │     no change │
│ QQuery 21    │  1243.84 ms │  1271.98 ms │     no change │
│ QQuery 22    │  2064.74 ms │  2070.37 ms │     no change │
│ QQuery 23    │  7537.75 ms │  7537.46 ms │     no change │
│ QQuery 24    │   446.88 ms │   445.02 ms │     no change │
│ QQuery 25    │   367.87 ms │   374.34 ms │     no change │
│ QQuery 26    │   503.34 ms │   506.30 ms │     no change │
│ QQuery 27    │  1481.94 ms │  1504.55 ms │     no change │
│ QQuery 28    │ 11763.16 ms │ 11902.61 ms │     no change │
│ QQuery 29    │   525.71 ms │   532.18 ms │     no change │
│ QQuery 30    │   752.63 ms │   753.79 ms │     no change │
│ QQuery 31    │   801.83 ms │   796.85 ms │     no change │
│ QQuery 32    │  2494.56 ms │  2480.09 ms │     no change │
│ QQuery 33    │  3143.19 ms │  3172.92 ms │     no change │
│ QQuery 34    │  3150.66 ms │  3179.61 ms │     no change │
│ QQuery 35    │  1225.93 ms │  1238.68 ms │     no change │
│ QQuery 36    │   123.81 ms │   124.66 ms │     no change │
│ QQuery 37    │    55.59 ms │    55.17 ms │     no change │
│ QQuery 38    │   121.16 ms │   124.64 ms │     no change │
│ QQuery 39    │   195.41 ms │   195.87 ms │     no change │
│ QQuery 40    │    46.69 ms │    48.51 ms │     no change │
│ QQuery 41    │    44.18 ms │    43.08 ms │     no change │
│ QQuery 42    │    39.09 ms │    38.68 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 53413.35ms │
│ Total Time (prune-rg)   │ 53611.67ms │
│ Average Time (HEAD)     │  1242.17ms │
│ Average Time (prune-rg) │  1246.78ms │
│ Queries Faster          │          1 │
│ Queries Slower          │          0 │
│ Queries with No Change  │         42 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃  prune-rg ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 100.52 ms │ 100.26 ms │    no change │
│ QQuery 2     │  21.21 ms │  21.35 ms │    no change │
│ QQuery 3     │  32.52 ms │  32.87 ms │    no change │
│ QQuery 4     │  19.08 ms │  18.72 ms │    no change │
│ QQuery 5     │  51.07 ms │  50.15 ms │    no change │
│ QQuery 6     │  11.90 ms │  12.19 ms │    no change │
│ QQuery 7     │  85.38 ms │  89.71 ms │ 1.05x slower │
│ QQuery 8     │  24.32 ms │  25.05 ms │    no change │
│ QQuery 9     │  53.59 ms │  54.12 ms │    no change │
│ QQuery 10    │  43.80 ms │  43.31 ms │    no change │
│ QQuery 11    │  11.57 ms │  11.31 ms │    no change │
│ QQuery 12    │  35.33 ms │  34.53 ms │    no change │
│ QQuery 13    │  25.59 ms │  26.29 ms │    no change │
│ QQuery 14    │   9.80 ms │   9.68 ms │    no change │
│ QQuery 15    │  18.63 ms │  19.74 ms │ 1.06x slower │
│ QQuery 16    │  19.16 ms │  18.61 ms │    no change │
│ QQuery 17    │  97.35 ms │  96.99 ms │    no change │
│ QQuery 18    │ 205.89 ms │ 200.40 ms │    no change │
│ QQuery 19    │  27.10 ms │  26.81 ms │    no change │
│ QQuery 20    │  32.14 ms │  32.06 ms │    no change │
│ QQuery 21    │ 152.12 ms │ 148.42 ms │    no change │
│ QQuery 22    │  15.22 ms │  15.37 ms │    no change │
└──────────────┴───────────┴───────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 1093.28ms │
│ Total Time (prune-rg)   │ 1087.91ms │
│ Average Time (HEAD)     │   49.69ms │
│ Average Time (prune-rg) │   49.45ms │
│ Queries Faster          │         0 │
│ Queries Slower          │         2 │
│ Queries with No Change  │        20 │
│ Queries with Failure    │         0 │
└─────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing prune-rg (936e039) to dd936cb diff
Benchmarks: clickbench_1
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃    prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    48.55 ms │    48.75 ms │     no change │
│ QQuery 1     │    74.22 ms │    74.11 ms │     no change │
│ QQuery 2     │   109.42 ms │   109.88 ms │     no change │
│ QQuery 3     │   129.53 ms │   122.61 ms │ +1.06x faster │
│ QQuery 4     │   627.55 ms │   625.42 ms │     no change │
│ QQuery 5     │   849.86 ms │   849.16 ms │     no change │
│ QQuery 6     │    57.05 ms │    56.90 ms │     no change │
│ QQuery 7     │    80.49 ms │    82.69 ms │     no change │
│ QQuery 8     │   879.81 ms │   876.39 ms │     no change │
│ QQuery 9     │  1165.56 ms │  1167.48 ms │     no change │
│ QQuery 10    │   291.55 ms │   293.01 ms │     no change │
│ QQuery 11    │   318.77 ms │   322.61 ms │     no change │
│ QQuery 12    │   854.77 ms │   844.91 ms │     no change │
│ QQuery 13    │  1228.41 ms │  1205.05 ms │     no change │
│ QQuery 14    │   795.93 ms │   780.73 ms │     no change │
│ QQuery 15    │   809.83 ms │   797.08 ms │     no change │
│ QQuery 16    │  1624.25 ms │  1631.25 ms │     no change │
│ QQuery 17    │  1610.43 ms │  1592.79 ms │     no change │
│ QQuery 18    │  2880.16 ms │  2972.94 ms │     no change │
│ QQuery 19    │   126.17 ms │   122.33 ms │     no change │
│ QQuery 20    │  1168.29 ms │  1145.79 ms │     no change │
│ QQuery 21    │  1332.98 ms │  1326.28 ms │     no change │
│ QQuery 22    │  2301.13 ms │  2296.25 ms │     no change │
│ QQuery 23    │  7739.93 ms │  7786.37 ms │     no change │
│ QQuery 24    │   480.87 ms │   468.63 ms │     no change │
│ QQuery 25    │   407.47 ms │   407.81 ms │     no change │
│ QQuery 26    │   538.00 ms │   537.92 ms │     no change │
│ QQuery 27    │  1622.71 ms │  1634.71 ms │     no change │
│ QQuery 28    │ 12496.80 ms │ 12414.28 ms │     no change │
│ QQuery 29    │   555.02 ms │   572.82 ms │     no change │
│ QQuery 30    │   778.06 ms │   776.41 ms │     no change │
│ QQuery 31    │   851.43 ms │   834.85 ms │     no change │
│ QQuery 32    │  2531.83 ms │  2507.16 ms │     no change │
│ QQuery 33    │  3255.20 ms │  3232.13 ms │     no change │
│ QQuery 34    │  3300.27 ms │  3271.05 ms │     no change │
│ QQuery 35    │  1250.13 ms │  1217.49 ms │     no change │
│ QQuery 36    │   173.22 ms │   169.50 ms │     no change │
│ QQuery 37    │   101.32 ms │   101.13 ms │     no change │
│ QQuery 38    │   170.56 ms │   167.80 ms │     no change │
│ QQuery 39    │   251.74 ms │   251.91 ms │     no change │
│ QQuery 40    │    87.40 ms │    89.54 ms │     no change │
│ QQuery 41    │    86.72 ms │    84.18 ms │     no change │
│ QQuery 42    │    75.34 ms │    77.39 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)       │ 56118.72ms │
│ Total Time (prune-rg)   │ 55947.49ms │
│ Average Time (HEAD)     │  1305.09ms │
│ Average Time (prune-rg) │  1301.10ms │
│ Queries Faster          │          1 │
│ Queries Slower          │          0 │
│ Queries with No Change  │         42 │
│ Queries with Failure    │          0 │
└─────────────────────────┴────────────┘

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

I want to make sure the overhead of checking the predicates on each incoming batch didn't slow things down

@adriangb
Copy link
Contributor Author

Do we expect the benchmarks to show anything? I don't think they're using dynamic filters right? Maybe we need to merge #15770 and then we can benchmark this?

I want to make sure the overhead of checking the predicates on each incoming batch didn't slow things down

If you check the code that only happens if there are dynamic filters. And since there are non right now it becomes just a if let Some(file_pruner) = file_pruner.as_ref() check which is going to be too cheap to show up in benchmarks.

The only way to actually verify will be to merge #15770 and then compare this PR to main.

@xudong963
Copy link
Member

@adriangb I'll review tomorrow, today have some other things

@adriangb
Copy link
Contributor Author

@alamb sorry for the ping but would you mind running topk_tpch on here?

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing prune-rg (54b3bbf) to 1429c92 diff
Benchmarks: topk_tpch
Results will be posted here when complete

@Dandandan
Copy link
Contributor

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

I was reading that Arrow has requested AWS credits https://lists.apache.org/thread/q33oofy2v3zpg9s9l8o0w68rmjr3ocsv . Perhaps we can utilize one of those for that use case.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

🤖: Benchmark completed

Details

Comparing HEAD and prune-rg
--------------------
Benchmark run_topk_tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃  prune-rg ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ Q1           │  26.17 ms │  33.18 ms │  1.27x slower │
│ Q2           │  38.44 ms │  34.00 ms │ +1.13x faster │
│ Q3           │  97.05 ms │ 101.20 ms │     no change │
│ Q4           │  36.71 ms │  40.95 ms │  1.12x slower │
│ Q5           │  25.59 ms │  32.41 ms │  1.27x slower │
│ Q6           │  54.01 ms │  54.31 ms │     no change │
│ Q7           │ 146.60 ms │ 137.02 ms │ +1.07x faster │
│ Q8           │  79.27 ms │  88.55 ms │  1.12x slower │
│ Q9           │ 102.21 ms │ 112.97 ms │  1.11x slower │
│ Q10          │ 174.11 ms │ 188.49 ms │  1.08x slower │
│ Q11          │ 103.82 ms │  91.26 ms │ +1.14x faster │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Benchmark Summary       ┃          ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Total Time (HEAD)       │ 883.98ms │
│ Total Time (prune-rg)   │ 914.34ms │
│ Average Time (HEAD)     │  80.36ms │
│ Average Time (prune-rg) │  83.12ms │
│ Queries Faster          │        3 │
│ Queries Slower          │        6 │
│ Queries with No Change  │        2 │
│ Queries with Failure    │        0 │
└─────────────────────────┴──────────┘

@adriangb
Copy link
Contributor Author

adriangb commented Jun 17, 2025

@alamb sorry for the ping but would you mind running topk_tpch on here?

LOL I need to make a webpage (or give you access to the sever to queue the jobs yourself)

I was reading that Arrow has requested / recieved AWS credits https://lists.apache.org/thread/q33oofy2v3zpg9s9l8o0w68rmjr3ocsv . Perhaps we can utilize one of those for that use case.

I tried to ask GCS for credits... they didn't seem excited and ultimately came up with nothing.

@adriangb
Copy link
Contributor Author

🤖: Benchmark completed

Details

Interesting results. I'm inclined to believe that the speedups and slowdowns are both real. We'll have to think about this a bit more.

@adriangb adriangb force-pushed the prune-rg branch 2 times, most recently from 3d6a97a to ebe4196 Compare June 17, 2025 21:39
@adriangb
Copy link
Contributor Author

adriangb commented Jun 17, 2025

@Dandandan @alamb I pushed ebe4196 which adds a very cheap way to track changes to a PhysicalExpr if it's dynamic. I think this will be useful in several places but immediately it gives us the ability to check if the dynamic predicate has been updated before doing the work of re-calculating the pruning predicate, etc.

I'm still not sure it will be cheap enough, but I think it's worth a shot if we can re-run the benches.

It'll be a shame if we can't figure this out, I think if we are able to get this working it mostly negates the unfortunate situation right now that if you have a TopK it might be faster with less parallelism / partitioning upfront. With this change you still open the files but are able to quickly bail out as opposed to having to stream the whole thing.

@adriangb
Copy link
Contributor Author

I think this will require @Dandandan 's suggestion of only updating the filters if the new ones are more selective: #16433.

Right now since we always update the filters -> it always bumps the generation -> we always re-check.

@adriangb adriangb changed the title Prune files during streams and avoid additional pruning if there are no dynamic filters Skip re-pruning based on partition values and file level stats if there are no dynamic filters Jun 19, 2025
@adriangb
Copy link
Contributor Author

@alamb I reverted the filtering during the stream so this should now do strictly less work 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource Changes to the datasource crate physical-expr Changes to the physical-expr crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants