Adaptive broadcast to partitioned #23206

gaurav8297 · 2024-08-31T06:22:11Z

Description

This rule converts a broadcast join to a partitioned join at runtime which can significantly reduce memory usage and, in some cases, improve performance.

We can fix this kind of query at runtime using this rule.

Example (TPCDS):

set session join_distribution_type='BROADCAST';

SELECT
sum(ss.ss_quantity), sum(ss.ss_list_price), sum(ss.ss_coupon_amt), sum(cs.cs_wholesale_cost), sum(cs.cs_list_price),
sum(cs.cs_sales_price), sum(cs.cs_ext_sales_price),
sum(cs.cs_net_paid_inc_tax), sum(cs.cs_net_paid_inc_ship),
sum(cs.cs_net_profit), sum(cs.cs_ext_tax), sum(cs.cs_coupon_amt),
sum(cs.cs_ext_ship_cost), sum(cs.cs_net_paid), sum(cs.cs_net_paid_inc_tax),
sum(cs.cs_net_paid_inc_ship), sum(cs.cs_call_center_sk),
sum(cs.cs_net_paid_inc_ship_tax), sum(cs.cs_net_profit),
sum(cs.cs_call_center_sk), sum(cs.cs_warehouse_sk), sum(cs.cs_bill_hdemo_sk)
FROM store_sales AS ss
LEFT JOIN catalog_sales AS cs ON (ss.ss_customer_sk=cs.cs_bill_customer_sk AND ss.ss_sold_date_sk = cs.cs_sold_date_sk);

Before:

It will fail due to high memory usage

Query 20240808_203757_00002_ivgw3, FAILED, 8 nodes
Splits: 4,410 total, 4,333 done (98.25%)
1:55 [13.8B rows, 694GB] [120M rows/s, 6.03GB/s]

Query 20240808_203757_00002_ivgw3 failed: Cannot allocate enough memory for task 
20240808_203757_00002_ivgw3.1.44.0. Reported peak memory reservation: 77323190690B. Maximum possible reservation: 77309411328B.

After:

Query 20240808_204327_00005_ivgw3, FINISHED, 8 nodes
Splits: 9,818 total, 9,818 done (100.00%)
41.83 [4.32B rows, 59.7GB] [103M rows/s, 1.43GB/s]

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

gaurav8297 · 2024-09-05T07:02:32Z

@losipiuk I'm still adding more tests. But you can take a look.

losipiuk · 2024-09-09T09:07:08Z

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java

+ *  RemoteExchangeNodes that can be reused instead of adding a new one. For instance, this will be helpful in
+ *  cases where either side of the join has union nodes.
+ */
+public class AdaptiveBroadcastToPartitionedJoin


general request: for this and adaptive reordering PRs.
Can you add means to track how often the optimizers trigger.
Let's have metrics for each optimizer which is increased whenever rule triggers.
Also it would be great to list all adaptive optimizers which triggered for a query, with some context (which stage etc in query completion event, so we can do offline analysis).

sure, I'll create a PR for this. We already record metrics through JMX as part of the planner. However, we need to expose that through the query completion event.

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java

losipiuk · 2024-09-09T09:37:03Z

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java

+        if (mustReplicate(node, context)) {
+            return Result.empty();
+        }
+        boolean isExtraRemoteExchangeNeededAtProbeSide = captures.getOptional(LEFT_EXCHANGE_NODE).isEmpty();


How does having a remote exchange on left side ensures that we do not need to repartition again.
You are checking if there is an remote exchange with FIXED_ARBITRARY_DISTRIBUTION. Where do we guarantee that data is actually distributed according to join keys?

Oh - ok the meaning of that is:

are we changing existing exchange or adding new one.

added a comment

losipiuk · 2024-09-09T09:46:09Z

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java

+    {
+        DataSize joinMaxBroadcastTableSize = getJoinMaxBroadcastTableSize(context.getSession());
+
+        PlanNodeWithCost replicatedJoinCost = getJoinNodeWithCost(


reming me where are we taking into account that some of the progress has been made alrady with current plan shape, and if we replan we need to start from scratch.

Right now we only consider the cost of adding a new exchange.

We consider the subplan finished if it's 20% done and do not change it, and if it's less than 20%, we will always restart. There's an open issue around handling speculative execution with adaptive planning: #23180

gaurav8297 · 2024-09-12T05:48:04Z

core/trino-main/src/main/java/io/trino/sql/planner/PlanFragmenter.java

@@ -484,8 +486,12 @@ public PlanNode visitRemoteSource(RemoteSourceNode node, RewriteContext<Fragment
            else if (node.getExchangeType() == ExchangeNode.Type.REPARTITION) {
                for (SubPlan child : completedChildren) {
                    PartitioningScheme partitioningScheme = child.getFragment().getOutputPartitioningScheme();
+                    PartitioningHandle handle = partitioningScheme.getPartitioning().getHandle();
+                    if (handle.equals(FIXED_BROADCAST_DISTRIBUTION)) {


@losipiuk Can you take a look at this? This seems hacky but I'm not sure what's the best way to do this.

The root cause of that is that handle not only describes how data is distributed, but also how it is consumed, which is not important if you look at fragment output.
But I do not see an easy way out of that without turning lot's of things around.

Maybe this is the best we can get.

Can you explain more the proposed change with extra PlanNode which can be used for adaptive planning in place of RemoteSourceNode. How does that simplify things?

cc: @martint

Another way to solve the problem by introducing a new plan node specifically for the AdaptivePlan source. Instead of using RemoteSourceNode, we can create a new node that includes additional information like partitionHandle, which can be used during the PlanFragmenter. This extra information would be added through Adaptive planner rules. By doing this, we eliminate the need for an if condition, simplifying the PlanFragmenter code.

Additionally, currently, we are using RemoteSourceNode during AdaptivePlanning which is not intended for that use case.

cc @martint @losipiuk

Yeah - that sounds fine to me.

losipiuk · 2024-09-13T09:54:49Z

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java

+            if (node.getScope() == LOCAL) {
+                return rewriteSources(this, node, globalContext);
+            }
+            verify(node.getScope() == REMOTE && node.getType() == REPLICATE);


should you also verify that node.partitioningColumns were not set or that using buildSymbols as partitioning columns is compatible with we had previously (buildSymbols would need to be subset of previous set)

github-actions · 2024-10-21T17:02:58Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions · 2024-11-12T17:03:11Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

mosabua · 2024-11-13T19:00:16Z

Added performance label @martint fyi

Also please reopen if you plan to continue @gaurav8297 and @losipiuk and add stale-ignore label

cla-bot bot added the cla-signed label Aug 31, 2024

gaurav8297 force-pushed the gaurav8297/adaptive_broadcast_to_partitioned branch 2 times, most recently from 2296bcb to 857c672 Compare September 5, 2024 06:56

gaurav8297 changed the title ~~[WIP] Adaptive broadcast to partitioned~~ Adaptive broadcast to partitioned Sep 5, 2024

gaurav8297 marked this pull request as ready for review September 5, 2024 06:57

gaurav8297 force-pushed the gaurav8297/adaptive_broadcast_to_partitioned branch from 857c672 to 3631135 Compare September 5, 2024 06:59

gaurav8297 requested a review from losipiuk September 5, 2024 07:02

gaurav8297 force-pushed the gaurav8297/adaptive_broadcast_to_partitioned branch 4 times, most recently from 0d9a9c0 to 5aa5d0f Compare September 5, 2024 20:17

losipiuk reviewed Sep 9, 2024

View reviewed changes

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java Outdated Show resolved Hide resolved

losipiuk reviewed Sep 9, 2024

View reviewed changes

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java Show resolved Hide resolved

losipiuk reviewed Sep 9, 2024

View reviewed changes

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java Show resolved Hide resolved

losipiuk reviewed Sep 9, 2024

View reviewed changes

...in/src/main/java/io/trino/sql/planner/iterative/rule/AdaptiveBroadcastToPartitionedJoin.java Outdated Show resolved Hide resolved

losipiuk reviewed Sep 9, 2024

View reviewed changes

Fix OrPattern to return on first match

909fca7

gaurav8297 force-pushed the gaurav8297/adaptive_broadcast_to_partitioned branch 2 times, most recently from 217f09e to 8a157c7 Compare September 12, 2024 05:42

gaurav8297 commented Sep 12, 2024

View reviewed changes

Add AdaptiveBroadcastToPartitionedJoin

7b7b18a

gaurav8297 force-pushed the gaurav8297/adaptive_broadcast_to_partitioned branch from 8a157c7 to 7b7b18a Compare September 12, 2024 06:01

losipiuk reviewed Sep 13, 2024

View reviewed changes

github-actions bot added the stale label Oct 21, 2024

github-actions bot closed this Nov 12, 2024

mosabua added the performance label Nov 13, 2024

losipiuk reopened this Nov 13, 2024

losipiuk added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Nov 13, 2024

Adaptive broadcast to partitioned #23206

Are you sure you want to change the base?

Adaptive broadcast to partitioned #23206

Uh oh!

Conversation

gaurav8297 commented Aug 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

gaurav8297 commented Sep 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaurav8297 Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaurav8297 Sep 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 21, 2024

Uh oh!

github-actions bot commented Nov 12, 2024

Uh oh!

mosabua commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gaurav8297 commented Aug 31, 2024 •

edited

Loading

gaurav8297 Sep 12, 2024 •

edited

Loading

gaurav8297 Sep 28, 2024 •

edited

Loading

mosabua commented Nov 13, 2024 •

edited

Loading