Sort results on replica, merge on envd #30558

antiguru · 2024-11-19T13:06:44Z

Sort results on replica, merge on environmentd.

Previously, we'd sort data only on evironmentd, which would cause it to consume more CPU than necessary. This change moves some of the sorting to clusterd, and only leaves the last merge step on environmentd.

The PR selects a minimal approach, and leaves most of the code related to result finishing untouched. It introduces an invariant that peek results must always be sorted according to the finishing, anything else will lead to undefined results. However, there's nothing that enforces the results to be sorted with the same ordering, which is potentially bad. Inside environmentd, it uses a simple heap to combine $k$ sorted runs into a single permutation map.

The interfaces to RowCollection (new, sorted_view) now take a &[ColumnOrder], and internally the implementation picks the right comparison function. If the column order slice is empty, it'll skip decoding the rows and directly defer to the tiebreaker.

The PR moves the RowCollection type into mz-expr, which isn't ideal. This is required because the ColumnOrder type is defined here, and we'd like to pass it to the constructor of the type. Alternatives would be to have a function here that passes the correct comparison function to RowCollection, but that seems to be strictly worse than moving the type.
I considered moving the type to compute-types, which seems a better fit, but not all uses of RowCollection depend on compute-types. If this is upsetting, I can think about alternatives.

This complexity for sorting on the cluster is roughly $\frac{n}{k}\cdot\log \frac{n}{k}$, where $n$ is the total number of result records, and $k$ the number of workers. The last merge step then has a time complexity of $n\cdot\log k$ to combine $k$ sorted runs into one.

Follow-up items include:

Avoid a single Bytes allocation for all rows, and instead keep the individual allocations.
Assert that all RowCollections are sorted equally.
Move the binary heap into an iterator to avoid materializing the sorted view permutation.

Tips to the reviewer

Don't look at individual commits.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

ParkMyCar

Woohoo! Love to see this happening. IMO the biggest feedback I have is enforcing the invariant that peek results must sorted at the type level, and maybe at the same time reducing the repetitiveness of creating DatumVecs and calling .sort_by(...).

It seems like everywhere we currently sort a Vec<Row> we're immediately passing the results into RowCollection::new. What if we push the sorting into RowCollection::new? i.e.

impl RowCollection {
  pub fn new(mut rows: Vec<Row>, finishing: &RowSetFinishing) -> Self { ... }
}

At which point RowCollection is sorted so what's the point of SortedRowCollection? It kind of feels like RowCollection could naturally become a SortedRowRun and then SortedRowCollection becomes a collection of SortedRowRuns? e.g.

struct SortedRowRun {
    encoded: Bytes,
    metadata: Arc<[EncodedRowMetadata]>,
}

struct SortedRowCollection {
    runs: Vec<SortedRowRun>,
}

This is a much larger change, and I think only part one (pushing the sort into RowCollection) is enough to get this across the line because it mostly solves the invariant that a RowCollection must be sorted. But I think we can still do part 2 without having to touch the code related to result finishing since most of that should use a Box<dyn RowIterator> IIRC.

ParkMyCar · 2024-11-19T20:17:45Z

src/repr/src/row/collection.rs

+        while let Some(Reverse(mut finger)) = heap.pop() {
+            view.push(finger.start);
+            finger.start += 1;
+            if finger.start < finger.end {
+                heap.push(Reverse(finger));
+            }
+        }


It feels like there is a great opportunity to push this logic into SortedRowCollection or SortedRowCollectionIter maybe? i.e. as folks iterate through a row collection is when we do this streaming merge sort?

Agreed! I assume we're going to iterate through the result rows only once when we send them over the wire, so we can avoid having the extra view buffer around and decoding the rows twice.

Unfortunately, we need to iterate multiple times to determine the size of the whole result. I agree we shouldn't (and there is no deep reason we have to), but it requires more changes.

src/repr/src/row/collection.proto

teskje · 2024-11-20T08:43:23Z

src/repr/src/row/collection.rs

@@ -33,6 +35,8 @@ pub struct RowCollection {
    encoded: Bytes,
    /// Metadata about an individual Row in the blob.
    metadata: Vec<EncodedRowMetadata>,
+    /// Start of sorted runs of rows in rows.
+    fingers: Vec<usize>,


The documentation here confused me. This field actually stores the indexes of the ends of sorted runs, right? Is there a reason for that? It does feel like storing the start indexes would be more natural.

I know you said the PR lacks documentation, so if you still planned to adjust it here then nvm!

teskje · 2024-11-20T08:45:50Z

src/repr/src/row/collection.rs

+        while let Some(Reverse(mut finger)) = heap.pop() {
+            view.push(finger.start);
+            finger.start += 1;
+            if finger.start < finger.end {
+                heap.push(Reverse(finger));
+            }
+        }


Agreed! I assume we're going to iterate through the result rows only once when we send them over the wire, so we can avoid having the extra view buffer around and decoding the rows twice.

teskje · 2024-11-20T09:08:30Z

We discussed a bit offline already, an my understanding is that this PR is meant to stop the bleeding with an as-small-as-possible diff and do improvements as a follow-up. That plan is fine with me.

IMO the biggest feedback I have is enforcing the invariant that peek results must sorted at the type level

I agree with that. Once concern is that we add some place where we create PeekResponse::Rows but forget that the contained rows are expected to be sorted runs. Currently there is an assert in the merging but we don't want to keep that in prod, so it'd be easy to end up with incorrect results returned to the user. So PeekResponse::Rows should contain a type that ensures the sorting invariants we need.

Another reason for wanting such a type is that we sometimes don't have to sort! Specifically, if the order_by is empty, we can return the data in any order, I think. I initially thought that wasn't the case because results from different workers can cancel out, but since their diffs are NonZeroU64, they can only add up but never cancel. Don't trust me on this, I'm probably missing something about how peek finishing works.

But if it's true that we don't need to sort if the order_by is empty, then we want a type that knows about the order_by and does the right thing (sorting or not) depending on it. For example:

struct RowRuns {
    runs: Vec<RowCollection>,
    order_by: Vec<ColumnOrder>,
}

impl RowRuns {
    fn push(&mut self, mut rows: Vec<(Row, NonZeroU64)>) {
        if !self.order_by.is_empty() {
            sort(&mut rows, &self.order_by);
        }
        self.runs.push(RowCollection::new(&rows));
    }
}

antiguru · 2024-11-20T09:47:27Z

Another reason for wanting such a type is that we sometimes don't have to sort!

I agree that strictly speaking there are cases where we don't have to sort, but I'm not comfortable changing the invariant as part of this PR. We might have downstream code that relies on a certain row order, as well as our tests, so I'd like to separate this from the current effort.

antiguru · 2024-11-20T10:30:09Z

It kind of feels like RowCollection could naturally become a SortedRowRun and then SortedRowCollection becomes a collection of SortedRowRuns?

I agree it could! It's a non-trivial departure to what we currently have: At the moment, we allow to index into a RowCollection and the sorted variants, which we use primarily to iterate. If we want to avoid the index lookup, we could change the iterator to sit on the binary heap, but then we'd need to be careful not to clone the iterator -- the cost of iterating would be $n\log k$ instead of $n$.

teskje · 2024-11-20T12:37:49Z

I agree that strictly speaking there are cases where we don't have to sort, but I'm not comfortable changing the invariant as part of this PR. We might have downstream code that relies on a certain row order, as well as our tests, so I'd like to separate this from the current effort.

Yes, I'm very much in favor of taking small steps! Just wanted to record my thoughts for follow-ups we can/should do. Also partly to check my thinking around whether or not sorting is necessary.

shepherdlybot · 2024-11-20T13:15:01Z

Mitigations

Completing required mitigations increases Resilience Coverage.

Risk Summary:

The pull request has a high risk score of 80, driven by predictors such as the "Sum Bug Reports Of Files" and the "Delta of Executable Lines." Historically, PRs with these predictors are 116% more likely to cause a bug than the repository baseline. The observed bug trend in the repository is steady.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

antiguru · 2024-11-20T13:15:11Z

Nightly run: https://buildkite.com/materialize/nightly/builds/10466

teskje

LGTM!

src/expr/src/row/collection.rs

Signed-off-by: Moritz Hoffmann <[email protected]>

ParkMyCar

LGTM!

Sorry I didn't realize pushing the sort into RowCollection would require moving the struct to the expr crate 🙈 thanks for making that change!

antiguru · 2024-11-20T16:22:56Z

Thanks for the reviews!

antiguru requested review from a team as code owners November 19, 2024 13:06

antiguru marked this pull request as draft November 19, 2024 13:08

antiguru force-pushed the clusterd_sort branch from e27537d to d991c6c Compare November 19, 2024 17:45

ParkMyCar reviewed Nov 19, 2024

View reviewed changes

teskje reviewed Nov 20, 2024

View reviewed changes

antiguru force-pushed the clusterd_sort branch from d991c6c to 0e6ee9d Compare November 20, 2024 09:55

antiguru force-pushed the clusterd_sort branch from 568a580 to dab4bcd Compare November 20, 2024 11:03

antiguru marked this pull request as ready for review November 20, 2024 13:14

antiguru requested a review from a team as a code owner November 20, 2024 13:14

antiguru requested review from ParkMyCar and teskje November 20, 2024 13:14

teskje approved these changes Nov 20, 2024

View reviewed changes

src/expr/src/row/collection.rs Outdated Show resolved Hide resolved

src/expr/src/row/collection.rs Outdated Show resolved Hide resolved

antiguru added 9 commits November 20, 2024 16:33

Sort results on replica, merge on envd

9036727

Signed-off-by: Moritz Hoffmann <[email protected]>

Fix finger merging

2376d70

Signed-off-by: Moritz Hoffmann <[email protected]>

Fix bug!

b8cd991

Signed-off-by: Moritz Hoffmann <[email protected]>

Move RowCollection to expr, sort in ::new

693160b

Signed-off-by: Moritz Hoffmann <[email protected]>

s/fingers/runs/g

9eb60d9

Signed-off-by: Moritz Hoffmann <[email protected]>

Fix build

a90242c

Signed-off-by: Moritz Hoffmann <[email protected]>

Renaming

41efec3

Signed-off-by: Moritz Hoffmann <[email protected]>

Cleanup

537c6ea

Signed-off-by: Moritz Hoffmann <[email protected]>

Feedback

562c3d3

Signed-off-by: Moritz Hoffmann <[email protected]>

antiguru force-pushed the clusterd_sort branch from e470aac to 562c3d3 Compare November 20, 2024 15:34

ParkMyCar approved these changes Nov 20, 2024

View reviewed changes

antiguru enabled auto-merge (squash) November 20, 2024 16:22

antiguru merged commit 11593f4 into MaterializeInc:main Nov 20, 2024
81 checks passed

Sort results on replica, merge on envd #30558

Sort results on replica, merge on envd #30558

Uh oh!

Conversation

antiguru commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips to the reviewer

Checklist

Uh oh!

ParkMyCar left a comment

Choose a reason for hiding this comment

Uh oh!

ParkMyCar Nov 19, 2024

Choose a reason for hiding this comment

Uh oh!

teskje Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

antiguru Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teskje Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

teskje Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

teskje commented Nov 20, 2024

Uh oh!

antiguru commented Nov 20, 2024

Uh oh!

antiguru commented Nov 20, 2024

Uh oh!

teskje commented Nov 20, 2024

Uh oh!

shepherdlybot bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Mitigations

Uh oh!

antiguru commented Nov 20, 2024

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ParkMyCar left a comment

Choose a reason for hiding this comment

Uh oh!

antiguru commented Nov 20, 2024

Uh oh!

Uh oh!

Uh oh!

antiguru commented Nov 19, 2024 •

edited

Loading

shepherdlybot bot commented Nov 20, 2024 •

edited

Loading