Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve M4 Query Performance #691

Merged
merged 4 commits into from
Feb 17, 2025
Merged

Improve M4 Query Performance #691

merged 4 commits into from
Feb 17, 2025

Conversation

spren9er
Copy link
Contributor

@spren9er spren9er commented Feb 16, 2025

Why are the changes necessary?

M4 query performance can be improved by a factor of up to 4 using materialized CTEs.

Benchmark Example

Consider the following benchmark example

m4query
meta:
  title: NOAA Global Historical Climatology Network Daily (GHCN-D)
  description: >
    Time series showing the daily number of weather stations in use, based
    on a dataset of 20 million daily weather observations from 1890 to 2024.
data:
  noaa: { file: data/noaa_20m.parquet }
plot:
- mark: lineY
  data: { from: noaa }
  x: date
  y: { count:  }
  stroke: steelblue
xLabel: Date
yLabel: Total Stations
width: 800
height: 400

Executing the M4 query from above line chart on a MacBook Pro (M1 Pro) produces the following timings

CTE Usage Duration
Before no 1,355ms
After yes 362ms

What does this pull request cover?

Overview

  • Add CTES for SetOperation
  • Update of M4 Query using CTEs
  • Update Documentation
  • Fix typo

Add CTEs for SetOperation

The Query.with method has been redesigned to support the new M4 query formulation with CTEs. It is now also supported for SetOperation and introduces a new intermediate WithClause object, from which both SelectQuery and SetOperation objects can be instantiated.

Here is an example demonstrating the use of a WITH clause together with unionAll, which was previously not possible.

Query
  .with({
    base: Query.from("table").groupby(...)
  })
  .unionAll(
    Query
      .from("base")
      .select({ x: min("x"), y: argmin("y", "x") })
      .groupby(...),
    Query
      .from("base")
      .select({ x: max("x"), y: argmax("y", "x") })
      .groupby(...)
  )
  .orderby("x")

To achieve this, the with instance method of SelectQuery has been moved to the Query class.

Update of M4 Query using CTES

The M4 query has been improved by employing CTEs instead of subqueries. This approach allows intermediate results to be cached and reused. DuckDB automatically materializes the input query based on heuristics, specifically when

  1. CTE performs a grouped aggregation, and
  2. CTE is queried more than once (here 4x)

See also here.

Caveats

Query.subqueries

By adding a WITH clause to SetOperation, the subqueries methods probably need refactoring. Queries from SetOperation could have now subqueries coming from CTEs, which are not yet considered. There is also an existing comment in SetOperation, that the implementation of subqueries is not optimal and potentially incomplete. This method has been left unchanged.

Support of MATERIALIZED / NOT MATERIALIZED

While materializing not only helps when there is a grouped aggregation query, it can improve performance for ungrouped queries as well.

Looking at the Overview/Detail Chart from Mosaic examples — but using 1 million data points — materializing would speed up update processes significantly (by a factor up to 2). However, as there is no grouped aggregation, materializing won't be applied here. Thus, adding keywords MATERIALIZED (and NOT MATERIALIZED for completeness) to WithClauseNode would make sense.

As queries of WITH clause are passed currently as (name, query) pairs using objects, syntax does not allow for adding an extra parameter without compromising the simplicity of API. However, before passing a query to WITH, an optional boolean flag on Query object could be set, indicating forced usage of MATERIALIZED or NOT MATERIALIZED. This flag could be then passed to WithClauseNode to create the appropriate clause, allowing explicit control over MATERIALIZED usage in M4 queries. This has not been implemented yet.

@jheer
Copy link
Member

jheer commented Feb 16, 2025

This looks fantastic, thank you! I don’t think materialization was even available when we first wrote the M4 optimization in vgplot… great to take advantage of this. I’ll try to get to a full review soon.

@jheer
Copy link
Member

jheer commented Feb 16, 2025

Looks good post-review. I want to revisit subqueries for SetOperation instances, but can do that via a separate PR.

Meanwhile, we can consider explicit MATERIALIZED support as a separate issue altogether -- so feel free to open a feature request issue if this is a functionality you would like to use.

@jheer
Copy link
Member

jheer commented Feb 16, 2025

(Note: to avoid updating the docs prematurely I will merge this when I start prepping the next release, which should be soon.)

@spren9er
Copy link
Contributor Author

Great!
I opened a feature request issue here: #693.

@jheer jheer changed the base branch from main to v0.13.0 February 17, 2025 17:07
@jheer jheer merged commit 4b673e9 into uwdata:v0.13.0 Feb 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants