Skip to content

Improve performance of correlated NOT EXISTS queries #21859

Open
@sopel39

Description

@sopel39

Correlated NOT EXISTS queries like

SELECT * FROM table_a a WHERE NOT EXISTS (SELECT * FROM table_b b WHERE a.x = b.y AND a.i < b.j)

are rewritten to:

Filter(not exists)
|
Projection(exists := COALESCE(subquery_or, false))
|
Aggregation([a.*], subquery_or := boolean_or(subquery_true))
|
LeftJoin[a.x = b.y AND a.i < b.j]
|
---- Probe
|
---- Project[b.*, subquery_true := TRUE] -- Build

The LeftJoin in the plan above will produce row for every match. However, only single match is required to determine that exists == true for particular probe row.

To improve this we could add a new JoinNode flag, e.g: boolean singleMatch which would stop enumerating matches after a single hit.

Perhaps we could refactor or remove JoinNode#maySkipOutputDuplicates, which doesn't help here because join operator cannot stop enumerating matches when maySkipOutputDuplicates==true since subquery_true is not part of equi-clauses (see

// Implementation of hash join operator may only take advantage of output duplicates insensitive joins when:
)

Alternatively, we could determine that subquery_true is constant expression, which would allow us to skip output duplicates when maySkipOutputDuplicates==true. That might require having "traits" as part of Reference (cc @martint )

cc @Dith3r

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions