Description
Is your feature request related to a problem or challenge?
The basic aggregate functions like COUNT
and SUM
in DataFusion are very fast (see Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)
However, many of the other aggregate functions are not particularly fast, and this shows up specifically on some of the H20 benchmarks
We saw this in the results in the 2024 DataFusion SIGMOD paper
(BTW we have made median faster)
@MrPowers has also observed similar results on discord (link):
DataFusion was added to the h2o benchmarks (which are now maintained by duckdb) and DataFusion performs quite well for most of the "basic" groupby queries. It performs poorly for some of the advanced questions on the 50GB dataset. Here are the results:
https://duckdblabs.github.io/db-benchmark/
See his version of the benchmarks here
https://github.com/MrPowers/mrpowers-benchmarks
Testing
Functions
Other improvements
Describe the solution you'd like
DataFusion has two APIs ways to implement Aggregate functions like SUM
and COUNT
- Easy (but slow) way:
Accumulator
(api docs) - Fast (but complicated way):
GroupsAccumulator
(api docs)
The basic aggregates are implemented using GroupsAccumulator
and are part of DataFusions performance
This ticket tracks the effort to improve the performance of these for these "more advanced" aggregate functions, likely by implementing GroupsAccumulator
Describe alternatives you've considered
For each function listed above, ideally we would:
- Add a new benchmark. Either add a specific one for H20 benchmarks or add a query to the ClickBench extended benchmark Documentation Here in one PR
- Implement
GroupsAccumulator
for the relevant aggregate function in a second PR (along with tests for correctness). We would use the benchmark to verify the performance
Here is a pretty good example of how @eejbyfeldt did this for STDDEV
:
Additional context
No response