Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are the changes necessary?
M4 query performance can be improved by a factor of up to 4 using materialized CTEs.
Benchmark Example
Consider the following benchmark example
Executing the M4 query from above line chart on a MacBook Pro (M1 Pro) produces the following timings
What does this pull request cover?
Overview
SetOperation
Add CTEs for
SetOperation
The
Query.with
method has been redesigned to support the new M4 query formulation with CTEs. It is now also supported forSetOperation
and introduces a new intermediateWithClause
object, from which bothSelectQuery
andSetOperation
objects can be instantiated.Here is an example demonstrating the use of a
WITH
clause together withunionAll
, which was previously not possible.To achieve this, the
with
instance method ofSelectQuery
has been moved to theQuery
class.Update of M4 Query using CTES
The M4 query has been improved by employing CTEs instead of subqueries. This approach allows intermediate results to be cached and reused. DuckDB automatically materializes the input query based on heuristics, specifically when
See also here.
Caveats
Query.subqueries
By adding a
WITH
clause toSetOperation
, thesubqueries
methods probably need refactoring. Queries fromSetOperation
could have now subqueries coming from CTEs, which are not yet considered. There is also an existing comment inSetOperation
, that the implementation ofsubqueries
is not optimal and potentially incomplete. This method has been left unchanged.Support of
MATERIALIZED
/NOT MATERIALIZED
While materializing not only helps when there is a grouped aggregation query, it can improve performance for ungrouped queries as well.
Looking at the Overview/Detail Chart from Mosaic examples — but using 1 million data points — materializing would speed up update processes significantly (by a factor up to 2). However, as there is no grouped aggregation, materializing won't be applied here. Thus, adding keywords
MATERIALIZED
(andNOT MATERIALIZED
for completeness) toWithClauseNode
would make sense.As queries of
WITH
clause are passed currently as (name, query) pairs using objects, syntax does not allow for adding an extra parameter without compromising the simplicity of API. However, before passing a query toWITH
, an optional boolean flag onQuery
object could be set, indicating forced usage ofMATERIALIZED
orNOT MATERIALIZED
. This flag could be then passed toWithClauseNode
to create the appropriate clause, allowing explicit control overMATERIALIZED
usage in M4 queries. This has not been implemented yet.