-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Adaptive broadcast to partitioned #23206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Adaptive broadcast to partitioned #23206
Conversation
2296bcb
to
857c672
Compare
857c672
to
3631135
Compare
@losipiuk I'm still adding more tests. But you can take a look. |
0d9a9c0
to
5aa5d0f
Compare
* RemoteExchangeNodes that can be reused instead of adding a new one. For instance, this will be helpful in | ||
* cases where either side of the join has union nodes. | ||
*/ | ||
public class AdaptiveBroadcastToPartitionedJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general request: for this and adaptive reordering PRs.
Can you add means to track how often the optimizers trigger.
Let's have metrics for each optimizer which is increased whenever rule triggers.
Also it would be great to list all adaptive optimizers which triggered for a query, with some context (which stage etc in query completion event, so we can do offline analysis).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I'll create a PR for this. We already record metrics through JMX as part of the planner. However, we need to expose that through the query completion event.
...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java
Show resolved
Hide resolved
...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java
Outdated
Show resolved
Hide resolved
if (mustReplicate(node, context)) { | ||
return Result.empty(); | ||
} | ||
boolean isExtraRemoteExchangeNeededAtProbeSide = captures.getOptional(LEFT_EXCHANGE_NODE).isEmpty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does having a remote exchange on left side ensures that we do not need to repartition again.
You are checking if there is an remote exchange with FIXED_ARBITRARY_DISTRIBUTION
. Where do we guarantee that data is actually distributed according to join keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - ok the meaning of that is:
- are we changing existing exchange or adding new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
{ | ||
DataSize joinMaxBroadcastTableSize = getJoinMaxBroadcastTableSize(context.getSession()); | ||
|
||
PlanNodeWithCost replicatedJoinCost = getJoinNodeWithCost( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reming me where are we taking into account that some of the progress has been made alrady with current plan shape, and if we replan we need to start from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we only consider the cost of adding a new exchange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider the subplan
finished if it's 20% done and do not change it, and if it's less than 20%, we will always restart. There's an open issue around handling speculative execution with adaptive planning: #23180
217f09e
to
8a157c7
Compare
@@ -484,8 +486,12 @@ public PlanNode visitRemoteSource(RemoteSourceNode node, RewriteContext<Fragment | |||
else if (node.getExchangeType() == ExchangeNode.Type.REPARTITION) { | |||
for (SubPlan child : completedChildren) { | |||
PartitioningScheme partitioningScheme = child.getFragment().getOutputPartitioningScheme(); | |||
PartitioningHandle handle = partitioningScheme.getPartitioning().getHandle(); | |||
if (handle.equals(FIXED_BROADCAST_DISTRIBUTION)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@losipiuk Can you take a look at this? This seems hacky but I'm not sure what's the best way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause of that is that handle not only describes how data is distributed, but also how it is consumed, which is not important if you look at fragment output.
But I do not see an easy way out of that without turning lot's of things around.
Maybe this is the best we can get.
Can you explain more the proposed change with extra PlanNode which can be used for adaptive planning in place of RemoteSourceNode
. How does that simplify things?
cc: @martint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to solve the problem by introducing a new plan node specifically for the AdaptivePlan source. Instead of using RemoteSourceNode
, we can create a new node that includes additional information like partitionHandle
, which can be used during the PlanFragmenter
. This extra information would be added through Adaptive planner rules. By doing this, we eliminate the need for an if
condition, simplifying the PlanFragmenter
code.
Additionally, currently, we are using RemoteSourceNode
during AdaptivePlanning which is not intended for that use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - that sounds fine to me.
8a157c7
to
7b7b18a
Compare
if (node.getScope() == LOCAL) { | ||
return rewriteSources(this, node, globalContext); | ||
} | ||
verify(node.getScope() == REMOTE && node.getType() == REPLICATE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you also verify that node.partitioningColumns
were not set or that using buildSymbols
as partitioning columns is compatible with we had previously (buildSymbols
would need to be subset of previous set)
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
Added performance label @martint fyi Also please reopen if you plan to continue @gaurav8297 and @losipiuk and add stale-ignore label |
Description
Issue: #23182
This rule converts a broadcast join to a partitioned join at runtime which can significantly reduce memory usage and, in some cases, improve performance.
We can fix this kind of query at runtime using this rule.
Example (TPCDS):
Before:
It will fail due to high memory usage
After:
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: