We have found internally that definitions in query optimization have become overloaded. This document defines key names and definitions for concepts that are required in optimization.
Many of the names and definitions will be inspired by the Cascades framework. However, there are a few important differences that need to be addressed considering our memo table will be persistent.
- Memo Table
- Expression
- Group
- Query Plan
- Operator / Plan Node
- Property
- Logical Property
- Physical Property
- ? Derived Property ?
- Rule
In the Cascades framework, an expression is a tree of operators. In optd
, we are instead defining
a logical or physical Query Plan to be a tree or DAG of Operators. An expression in optd
strictly refers to the representation of an operator in the Memo Table, not in query plans.
See the section below on the kinds of expressions for more information.
Most other terms in optd
are similar to Cascades or are self-explanatory.
This section describes names and definitions of concepts related to the memo table.
The memo table is the data structure used for dynamic programming in a top-down plan enumeration search algorithm. The memo table consists of a mutually recursive data structure made up of Expressions and Groups.
An expression is the representation of a non-materialized operator inside of the Memo Table.
There are 2 types of expressions: Relational Expressions and Scalar Expressions. A Relational Expression can be either a Logical Expression or a Physical Expression.
Note that different kinds of expressions can have the same names as Operators or Plan Nodes, but expressions solely indicate non-materialized relational or scalar operators in the Memo Table.
Operators outside of the Memo Table should not be referred to as expressions, and should instead be referred to as Operators or Plan Nodes.
Notably, when we refer to an expression, we are specifically talking about the representation of operators inside the memo table. A logical operator from an incoming logical plan should not be called an Logical Expression, and similarly a physical execution operator in the final output physical plan should also not be called an Physical Expression.
Another way to think about this is that expressions are not materialized, and plan nodes and operators inside query plans are materialized. Operators inside of query plans (both logical and physical) should be referred to as either logical or physical Operators or logical or physical Plan Nodes.
Another key difference between expressions and Plan Nodes is that expressions have 0 or more Group Identifiers as children, and Plan Nodes have 0 or more other Plan Nodes as children.
A relational expression is either a Logical Expression or a Physical Expression.
When we say "relational", we mean representations of operations in the relational algebra of SQL.
Relational expressions differ from Scalar Expressions in that the result of algebraically evaluating a relational expression produces a bag of tuples instead of a single scalar value.
See the following sections for more information.
A logical expression is a version of a Relational Expression.
TODO(connor) Add more details.
Examples of logical expressions include Logical Scan, Logical Join, or Logical Sort expressions (which can just be shorthanded to Scan, Join, or Sort).
A physical expression is a version of a Relational Expression.
TODO(connor) Add more details.
Examples of physical expressions include Table Scan, Index Scan, Hash Join, or Sort Merge Join.
A scalar expression is a version of an Expression.
A scalar expression describes an operation that can be evaluated to obtain a single value. This can also be referred to as a SQL expression, a row expression, or a SQL predicate.
TODO(everyone) Figure out the semantics of what a scalar expression really is.
Examples of scalar expressions include the expressions t1.a < 42
or t1.b = t2.c
.
Two Logical Expressions are equivalent if the Logical Propertys of the two expressions are the same. In other words, the Logical Plans they represent produce the same set of rows and columns.
Two Physical Expressions are equivalent if their Logical and Physical Propertys are the same. In other words, the Physical Plans they represent produce the same set of rows and columns, in the exact same order and distribution.
TODO This next part is unclear?
A Logical Expression with a required Physical Property is equivalent to a Physical Expression if the Physical Expression has the same Logical Property and delivers the Physical Property.
A group is a set of equivalent Expressions.
We follow the definition of groups in the Volcano and Cascades frameworks. From the EQOP Microsoft article (Section 2.2, page 205):
In the memo, each class of equivalent expressions is called an equivalence class or a group, and all equivalent expressions within the class are called group expressions or simply expressions.
A relational group is a set of 1 or more equivalent Logical Expressions and 0 or more equivalent Physical Expressions.
For a given relational group, the first step of optimization is exploration, in which equivalent Logical Expressions are added to the group via Transformation Rules. Once the search space for the group has been exhausted (all possible transformation rules have been applied to all logical expressions in the group), the group can be physically optimized. At this point, the search algorithm will apply Implementation RUles to cost and find the best execution plan.
TODO Add more details.
TODO Add example.
A scalar group consists of equivalent Scalar Expressions.
TODO Add more details.
TODO Add example.
This section describes names and definitions of concepts related to the general plan enumeration and search of optimal query plans.
A query plan is a tree or DAG of relational and scalar operators. We can consider query optimization to be a function from an unoptimized query plan to an optimized query plan. More specifically, the input plan is generally a Logical Plan and the output plan is always a Physical Plan.
We generally consider query plans to either be completely logical or completely physical. However, when dealing with rule matching and rule application to enumerate different but equivalent query plans, we also deal with partially materialized query plans that can be a mix of both logical and physical operators (as well as group identifiers and other scalar operators).
TODO Add more details about partially materialized plans.
A logical plan is a tree or DAG of Logical Operators that can be evaluated to produce a bag of tuples. This can also be referred to as a logical query plan. The Operators that make up this logical plan can be considered logical plan nodes.
A physical plan is a tree or DAG of Physical Operators that can be evaluated by an execution engine to produce a table. This can also be referred to as a physical query plan. The Operators that make up this physical plan can be considered physical plan nodes.
An operator is the materialized version of an Expression. Like expressions, there are both relational operators and scalar operators.
See the following sections for more information.
A relational operator is a node in a Query Plan (which is a tree or DAG), and is the materialized version of a Relational Expression.
A logical operator is a node in a Logical Plan (which is a tree or DAG), and is the materialized version of a Logical Expression.
A physical operator is a node in a Physical Plan (which is a tree or DAG), and is the materialized version of a Physical Expression.
A scalar operator is a node in a Query Plan that describes a scalar expression, and can be considered the materialized version of a Scalar Expression.
A property is metadata computed (and sometimes stored) for a given relational expression.
Properties of an expression may be required by the original SQL query or derived from the Physical Property of one of its inputs.
TODO Add more details.
A logical property describes the structure and content of data returned by a given expression.
Examples: row count, operator type,statistics, whether relational output columns can contain nulls.
TODO Clean up and add more details.
A physical property is a characteristic of an expression that impacts its layout, presentation, or location, but not its logical content.
Examples: order and data distribution.
TODO Clean up and add more details.
A rule transforms a query plan or sub-plan into an equivalent plan.
Rules should have an interface similar to the following:
trait Rule {
/// Checks whether the rule is applicable on the input expression.
fn check_pattern(expr: Expr) -> bool;
/// Transforms the expression into one or more equivalent expressions.
fn transform(expr: Expr) -> Vec<Expr>;
}
TODO Actually figure out the interface for rules since it's probably not going to like that.
TODO Clean up and add more details.
A transformation rule transforms a part of the logical expression into logical expressions.
This is also called a logical to logical transformation in other systems.
TODO Clean up and add more details.
A implementation rule transforms a part of a logical expression to an equivalent physical expression with physical properties.
In Cascades, you don't need to materialize the entire query tree when applying rules. Instead, you can materialize expressions on demand while leaving unrelated parts of the tree as group identifiers.
In other systems, there are physical to physical expression transformation for execution engine specific optimization, physical property enforcement, or distributed planning. At the moment, we are not considering physical-to-physical transformations.
TODO Clean up and add more details.
TODO Write this section.
TODO Write this section.