Skip to content

[EPIC] Improved performance in H2O.ai benchmarks #13548

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

The basic aggregate functions like COUNT and SUM in DataFusion are very fast (see Apache DataFusion is now the fastest single node engine for querying Apache Parquet files)

However, many of the other aggregate functions are not particularly fast, and this shows up specifically on some of the H20 benchmarks

We saw this in the results in the 2024 DataFusion SIGMOD paper
Screenshot 2024-11-24 at 8 34 35 AM

(BTW we have made median faster)

@MrPowers has also observed similar results on discord (link):

DataFusion was added to the h2o benchmarks (which are now maintained by duckdb) and DataFusion performs quite well for most of the "basic" groupby queries. It performs poorly for some of the advanced questions on the 50GB dataset. Here are the results:
https://duckdblabs.github.io/db-benchmark/

See his version of the benchmarks here
https://github.com/MrPowers/mrpowers-benchmarks

Testing

Functions

Other improvements

Describe the solution you'd like

DataFusion has two APIs ways to implement Aggregate functions like SUM and COUNT

  • Easy (but slow) way: Accumulator (api docs)
  • Fast (but complicated way): GroupsAccumulator (api docs)

The basic aggregates are implemented using GroupsAccumulator and are part of DataFusions performance

This ticket tracks the effort to improve the performance of these for these "more advanced" aggregate functions, likely by implementing GroupsAccumulator

Describe alternatives you've considered

For each function listed above, ideally we would:

  1. Add a new benchmark. Either add a specific one for H20 benchmarks or add a query to the ClickBench extended benchmark Documentation Here in one PR
  2. Implement GroupsAccumulator for the relevant aggregate function in a second PR (along with tests for correctness). We would use the benchmark to verify the performance

Here is a pretty good example of how @eejbyfeldt did this for STDDEV:

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions