Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsOf Loop Joins #4785

Merged
merged 21 commits into from
Feb 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 281 additions & 0 deletions _posts/2025-02-19-asof-plans.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
---
layout: post
title: "Planning AsOf Joins"
author: "Richard Wesley"
excerpt: "AsOf Joins are a great example of how DuckDB can choose different implementations for an expensive operation."
tags: ["deep dive"]
thumb: "/images/blog/thumbs/asof-join.svg"
image: "/images/blog/thumbs/asof-join.png"
---

<p>“I love it when a plan comes together.”<br/>
— Hannibal Smith, <cite>The A-Team</cite></p>

## Introduction

AsOf joins are a very useful kind of operation for temporal analytics.
As the name suggests, they are a kind of lookup for when you have a
table of values that change over time, and you want to look up the most recent value
at another set of times.
Put another way, they let you ask _“What was the value of the property **as of this time**?”_

DuckDB [added AsOf joins about 18 months ago]({% post_url 2023-09-15-asof-joins-fuzzy-temporal-lookups %})
and you can read that post to learn about their semantics.

## What's the Plan?

In that [earlier post](({% post_url 2023-09-15-asof-joins-fuzzy-temporal-lookups %})), I explained why we have a custom operator and syntax for AsOf joins
when you can implement them in traditional SQL.
The superpower of SQL is that it is _declarative:_
you tell us _what_ you want and we figure out an efficient _how._
By allowing you to say you want an AsOf join, we can think about how to get you results faster!

Still, the [AsOf operator]({% link docs/sql/query_syntax/from.md %}#as-of-joins) has to do a lot of work.
Namely:

* Read all the data in the right side (lookup) table
* Partition it on any equality conditions
* Sort it on the inequality condition
* Repeat the process for the left side (probe) table
* Do a merge join on the two tables that only returns the “most recent” value

That's a lot of data movement!
Plus, if any of the tables are large, we may end up exceeding memory and spilling to disk,
slowing the operation down even further.
Still, as we will see, it is much faster than the plain SQL implementation.

This is such a burden, that many databases that support AsOf joins
require the right side table to be partitioned and ordered on any keys you might want to join on.
That doesn't fit well with DuckDB's “friendly SQL” approach,
so (for now) we have to do it every time.

### Let's Get Small

It turns out there is a very common case for AsOf where the left table is small.
Suppose you have a year's worth of price data, recorded at high granularity
(e.g., fractions of a second),
but you only want to look up a small number (say 20) values that are the times you actually bought or sold?

The price table could run to hundreds of millions, if not billions of rows,
and just sorting that will take a lot of time and memory.
Given how expensive that is, one might wonder if there is a way to avoid all that sorting?
Happily the answer is _yes!_

### Simple Joins

Suppose we were to swap the sides of the join, build the old left side as a small right side table,
and stream the huge table through the left side of the join.
We could use the AsOf conditions for the join, and hopefully find a way to throw out the older matches
(we only want to keep the latest match).
This would use very little memory, and the streaming could be highly parallelized.

There are two streaming physical join operators we could use for this:

* _Nested Loop Join_ – Literally what it sounds like: loop over each left block and the right side table, checking for matches;
* _Piecewise Merge Join_ – A tricky join for one inequality condition that sorts the right side and each left block before merging to find matches;

We can try both of these once we have a way to eliminate the duplicates.
One thing to be aware of, though, is that they are both `N^2` algorithms,
so there will be a limit on how big “small” can be.

### Grouping

If you have been around databases long enough,
you know that the phrase “eliminate the duplicates” means `GROUP BY`!
So to eliminate the duplicates, we want to add an aggregation operator onto the output.
The tricky part is that we want to keep only the matched values that have the “largest” times.
Fortunately, DuckDB has a pair of aggregate functions that do just that:
[`arg_max`]({% link docs/sql/functions/aggregates.md %}#arg_maxarg-val) and
[`arg_min`]({% link docs/sql/functions/aggregates.md %}#arg_minarg-val)
(also called `max_by` and `min_by`).

That takes care of the fields from the lookup table, but what about the fields from the small table?
Well, those values will all be the same, so we can just use the `first` aggregate function for them.

### Streaming Window

But what should we group on?
One might be tempted to group on the times that are being looked up,
but that could be problematic if there are duplicate lookup times
(only one of the rows would be returned!).
Instead, we need to have a unique identifier for each row being looked up.
The simplest way to do this is to use the
[_streaming window operator_]({% post_url 2025-02-14-window-flying %})
with the [`row_number()` window function]({% link docs/sql/functions/window_functions.md %}#row_numberorder-by-ordering).
We then group on this row number.

## Coming Together

This all sounds good, but how does it work in practice?
How big can “small” get?
To answer that I ran a number of benchmarks joining small tables against large ones.
The tables are called `prices` and `times`:

```sql
CREATE OR REPLACE TABLE prices_{prices_size} AS
SELECT
r AS id,
'2021-01-01T00:00:00'::TIMESTAMP +
INTERVAL (random() * 365 * 24 * 60 * 60) SECOND
AS time,
(random() * 100000)::INTEGER AS price,
FROM range({prices_size}) tbl(r);

CREATE OR REPLACE TABLE times_{times_size} AS
SELECT
r AS id,
'2021-01-01'::TIMESTAMP +
INTERVAL ((random() * 365 * 24 * 60 * 60)::INTEGER) SECONDS
AS probe
FROM range({times_size}) tbl(r);
```

I then ran a benchmark query:

```sql
SELECT count(*)
FROM (
SELECT
t.probe,
p.price
FROM times_{times_size} t
ASOF JOIN prices_{prices_size} p
ON t.probe >= p.time
) t;
```

for a matrix of the following values:

* Prices – 100K to 1B rows in steps of 10x;
* Time – 1 to 2048 rows in steps of 2x (until it got too slow);
* Threads – 36, 18 and 9;

Here are the results:

<div align="center">
<img src="/images/blog/asof/asof-plans.png" alt="AsOf Plan Matrix" title="AsOf Plan Matrix" style="max-width:100%;width:100%;height:auto"/>
</div>

As you can see, the quadratic nature of the joins means that “small” means “<= 64”.
That is pretty small, but the table in the original user issue had only 21 values.

We can also see that the sorting provided by piecewise merge join does not seem to help much,
so plain old Nested Loop Join is the best choice.

It is clear that the performance of the standard operator is stable at each size,
but decreases slowly as the number of threads increases.
This makes sense because sorting is compute-intensive and the fewer cores we can assign,
the longer it will take.

If you want to play with the data more, you can find
the [interactive vizualization](https://public.tableau.com/app/profile/duckdb.labs/viz/AsOfLoopJoin/Tuning)
on our [Tableau Public site](https://public.tableau.com/app/profile/duckdb.labs/vizzes).

### Memory

The Loop Join plan is clearly faster at small sizes,
but how much memory do the two plans use?
These are the rough amounts of memory needed before excessive paging or allocation failures occur:

| Price Rows | AsOf Memory | Loop Join Memory |
| ---------: | ----------: | ---------------: |
| 1B | 48 GB | 64 MB |
| 100M | 6 GB | 64 MB |
| 10M | 256 MB | 64 MB |
| 1M | 32 MB | 64 MB |
| 100K | 32 MB | 64 MB |

In other words, the Loop Join plan only needs enough memory to page in the lookup table!
So if the table is large and you have limited memory, the Loop Join plan is the best option,
even if it is painfully slow.
Just remember that the Loop Join plan has to compete with the speed of the standard operator
under paging, and that may still be faster past a certain point.

### Backup Plans

As part of the experiment, I also measured how the old SQL implementation would perform,
and it did not fare well.
At the 1B row level I had to cut it off after one run to avoid wasting time:

<div align="center">
<img src="/images/blog/asof/asof-debug.png" alt="AsOf SQL Implementation" title="AsOf SQL Implementation" style="max-width:100%;width:100%;height:auto"/>
</div>

Note that the Y-axis is a log scale here!

### Setting

While it is nice that we provide a default value for making plan choices like this, your mileage may vary as they say.
If you have more time than memory, it might be worth it to you to bump up the Loop Join threshold a bit.
The threshold is a new setting called `asof_loop_join_threshold` with a default value of `64`,
and you can change it using a `PRAGMA` statement:

```sql
PRAGMA asof_loop_join_threshold = 128;
```

Remember, though, this is a quadratic operation, and pushing it up too high might take a Very Long Time
(especially if you express it in [Old Entish](https://tolkiengateway.net/wiki/Entish)!).

If you wish to disable the feature, you can just set it to zero:

```sql
PRAGMA asof_loop_join_threshold = 0;
```

## Roll Your Own

This Loop Join plan optimization will not ship until v1.3, but if you are having problems today,
you can always write your own version like this:

```sql
SELECT
first(t.probe) AS probe,
arg_max(p.price, p.time) AS price
FROM prices p
INNER JOIN (
SELECT
*,
row_number() OVER () AS pk
FROM times
) t
ON t.probe >= p.time
GROUP BY pk
ORDER BY 1;
```

If you know the probe times are unique, you can simplify this to:

```sql
SELECT
t.probe,
arg_max(p.price, p.time) AS price
FROM prices p
INNER JOIN times t
ON t.probe >= p.time
GROUP BY 1
ORDER BY 1;
```

## Future Work

The new AsOf Loop Join plan feature only covers a common but very specific situation,
and the standard operator could be made a lot more efficient if it knew that the data was already sorted.
This is often the case, but we do not yet have the ability to track partitioning and ordering between operators.
Tracking that kind of metadata would be very useful for speeding up a large number of operations,
including sorting (!), partitioned aggregation, windowing, AsOf joins and merge joins.
This is work we are very interested in, so stay tuned!

## Conclusion

With apologies to [Guido van Rossum](https://en.wikipedia.org/wiki/Guido_van_Rossum), there is usually more than one way to do something,
but each way may have radically different performance characteristics.
One of the jobs of a relational database with a declarative query language like SQL
is to make intelligent choices between the options so you the user can focus on the result.
Here at DuckDB we look forward to finding more ways to plan your queries so you can focus on what you do best!

## Notes

* The tests were all run on an iMac Pro with a 2.3 GHz 18-core Intel Xeon W CPU and 128 GB of RAM.
* The [raw test data and visualizations](https://public.tableau.com/views/AsOfLoopJoin/Tuning?:language=en-US&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link) are available on our [Tableau Public repository](https://public.tableau.com/app/profile/duckdb.labs/vizzes).
* The script to generate the data is in my [public DuckDB tools repository](https://github.com/hawkfish/feathers/blob/main/joins/asof-plans.py).
Binary file added images/blog/asof/asof-debug.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blog/asof/asof-plans.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified images/blog/thumbs/asof-join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading