design: Columnar Encodings for Persist #26175

ParkMyCar · 2024-03-20T22:20:25Z

Rendered

Motivation

Design for https://github.com/MaterializeInc/database-issues/issues/7411

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

danhhz

Exciting!

doc/developer/design/20320318_persist_columnar.md

danhhz · 2024-03-21T14:36:06Z

doc/developer/design/20320318_persist_columnar.md

+<td>
+
+Represented by encoding the `months`, `days`, and `micros` fields encoded as 
+little endian. 


How do these sort in postgres? We might be able to get meaningful sorting if we encode it in big endian instead (right? isn't that how that works?)

You're totally right, if we want them to sort by month, day, micros then big endian is how we would want to encode these, updated!

I'll double check they sort in Postgres

Checked on this, and if we encode the values as big endian, and apply an offset for negative numbers, we do retain the same order. Here's a draft PR #26799

Also should note that Intervals do not currently sort correctly in Materialize, https://github.com/MaterializeInc/materialize/issues/10854. I don't think this matters for this project, but is worth calling out

doc/developer/design/20320318_persist_columnar.md

rjobanp

this is great! I'd love to be able to re-use some of the datum -> arrow/parquet encoding as part of the copy-to-s3 work.
Copy-to-s3 would prefer encoding as native arrow/parquet types rather than opaque rust-type blobs, whereas for persist we are optimizing for encoding/decoding perf and the ability to support projection pushdown. If we can structure the code in a way that it's easy to swap out implementations per type that would be great.

doc/developer/design/20320328_persist_columnar.md

danhhz · 2024-03-29T15:44:06Z

Hmm, @rjobanp I'm inclined to have any sort of reuse with copy-to-s3 as a non-goal. This is all going to be tricky enough without additional constraints on the implementation. Let's discuss offline! (Maybe in the storage hangout next week?)

rjobanp · 2024-03-29T16:06:19Z

@danhhz sounds good, let's discuss next week!

bkirwi

Exciting!

doc/developer/design/20320328_persist_columnar.md

bkirwi · 2024-03-29T16:12:39Z

doc/developer/design/20320328_persist_columnar.md

+
+For all of these scenarios I believe we can handle consolidation by first
+decoding to the Rust type (e.g. `Row`) and consolidating based on that. It will
+be relatively inefficient, but none of these scenarios should happen too often.


This will leave data in an order that doesn't match either the existing consolidated output (based on bytes) or the eventual-future output (based on the structured representation).

I'd pitch just sorting based the Codec representation while we have it! The two notes below seem to handle things well.

I'd pitch just sorting based the Codec representation while we have it! The two notes below seem to handle things well.

Totally agree! These are scenarios in which we don't have Codec data, or we're trying to consolidate Codec with Arrow. Although thinking about this and streaming consolidation, made me realize the best approach might be decoding to the Rust type K and then encoding into the latest Arrow format. Will need to think on this some more and get back to you.

Ah, yeah this is tricky. If we're sorting and consolidating based on the decoded types anywhere, we probably need to do that everywhere to make stuff like streaming consolidation work. That said, if we're making the decode path as cheap as possible, then maybe that's fine?

If everything was always kept sorted by the decoded data, would that remove the deterministic requirement on the format? Seems initially to me like yes an might be pretty nice if so

bkirwi · 2024-03-29T16:14:14Z

doc/developer/design/20320328_persist_columnar.md

+sorted consolidating an Arrow array is relatively easy and fast! To delete a
+value all we need to do is unset the bit in the
+[validity bitmap](https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps)
+for that value.


Could we say a little bit about streaming consolidation -- would we support it, and if so how would it work?

Need to think on this some bit, will get back to you!

doc/developer/design/20320328_persist_columnar.md

danhhz · 2024-04-02T15:15:34Z

doc/developer/design/20320328_persist_columnar.md

+of writing there are only two accepted [canonical extension types](https://arrow.apache.org/docs/format/CanonicalExtensions.html#official-list)
+but we are free to create our own.
+
+As part of our mapping from `Datum` to Arrow column we could include an


This idea is cool, but I worry about the effect on the size of the resulting blobs. Inline writes will perhaps change this math a bit (speaking of which, we should hash out how inline writes work with this columnar encoding shift), but we currently have a pretty silly overhead of data we write to s3 to encode small batches of data (e.g. 1 row). Given that an overall format version (that you describe below) could easily have a mapping in code to the versions of each column type, I'm not sure what duplicating that into the individual column metadata buys us.

That's fair! We can shorten this then to something like mz.p.<v>.<datum type tag>, e.g. assuming String has a type tag of 12 it would look like mz.p.1.12.

I like having the Datum specific version because it allows us to evolve each Datum encoding independently which allows us to avoid unnecessary migrations. For example, say we change the encoding of Interval, if we bumped the overall format version then all batches regardless of whether they contain an Interval, would need to migrate during consolidation. But with a Datum specific version we would only have to migrate batches that contain Intervals

doc/developer/design/20320328_persist_columnar.md

danhhz · 2024-04-02T15:27:45Z

doc/developer/design/20320328_persist_columnar.md

+   impact cost, and requires `ConsolidatingIter` to handle a larger number of
+   Parts.
+
+Neither option is great, but I am leanning towards [1] since it puts the


This is the biggest open question for me.

Another big factor is memory pressure. I worry a bit about larger part sizes getting decoded simultaneously and OOMing where it wouldn't have before. It'd be good to collect some data here on the actual sizes we see from your prototyping (e.g. try (1) with tpch data in staging and see how big the parts are with just k v, and then also with k v k_s v_s).

Maybe we can get some of this back by enabling parquet compression on all of k, v, k_s, v_s (I don't think we currently do any 😳) but only when we're writing both (to see if we can keep sizes the ~same during the transition). That might let us "hide" some of the extra bytes. We could then be careful about lazily decoding things into only one of k, v OR k_s, v_s in the important codepaths (persist_source)

Another big factor is memory pressure.

Very true! Added a note about Prometheus metrics to track the size of blobs we read from S3.

Maybe we can get some of this back by enabling parquet compression on all of k, v, k_s, v_s

This would be great! Parquet supports a number of compression types, I'll need to read up but I think either LZ4 or Snapp would be good here? AFAIU they're fairly fast compression algos. Another win we could get is enabling run-length encoding on the diff and time columns. Thinking about some of our largest blobs which are physically compacted (I think?) all of the time and diff fields should be the same(?).

danhhz · 2024-04-02T15:32:37Z

doc/developer/design/20320328_persist_columnar.md

+
+For all of these scenarios I believe we can handle consolidation by first
+decoding to the Rust type (e.g. `Row`) and consolidating based on that. It will
+be relatively inefficient, but none of these scenarios should happen too often.


Ah, yeah this is tricky. If we're sorting and consolidating based on the decoded types anywhere, we probably need to do that everywhere to make stuff like streaming consolidation work. That said, if we're making the decode path as cheap as possible, then maybe that's fine?

If everything was always kept sorted by the decoded data, would that remove the deterministic requirement on the format? Seems initially to me like yes an might be pretty nice if so

doc/developer/design/20320328_persist_columnar.md

doc/developer/design/20320318_persist_columnar.md

This PR implements "packed" encodings for `Interval` and `Time` that will be used for writing structured data in Persist. The packed encodings are designed to be as fast as possible and have the same sort order as the original types. They're based on discussion from #26175. I added benchmarks that measure throughput and locally I get the following results: type | encode | decode ---------|---------------------|-------- interval | ~1.5 billion/second | ~1.6 billion/second time | ~2.5 billion/second | ~1.3 billion/second I believe the implementations can be optimized further with SIMD, but right now that requires `inline_asm!` or the `Nightly` compiler. Note: The code placement feels a bit weird since these impls should only be used by Persist, so I was thinking of putting the impl behind a trait like `mz_persist_types::ColumnarCodec`, but was curious what other folks thought. ### Motivation Progress towards https://github.com/MaterializeInc/materialize/issues/24830 ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [ ] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): -

ParkMyCar force-pushed the columnar/design-doc branch 2 times, most recently from 9b8e77f to 6c91284 Compare March 20, 2024 22:25

danhhz reviewed Mar 21, 2024

View reviewed changes

ParkMyCar force-pushed the columnar/design-doc branch 4 times, most recently from e241e83 to 66a1377 Compare March 29, 2024 00:51

ParkMyCar requested a review from bkirwi March 29, 2024 00:51

ParkMyCar marked this pull request as ready for review March 29, 2024 00:51

rjobanp reviewed Mar 29, 2024

View reviewed changes

doc/developer/design/20320328_persist_columnar.md Outdated Show resolved Hide resolved

doc/developer/design/20320328_persist_columnar.md Show resolved Hide resolved

bkirwi reviewed Mar 29, 2024

View reviewed changes

benesch reviewed Apr 1, 2024

View reviewed changes

ParkMyCar force-pushed the columnar/design-doc branch from 48bc0dc to cb349d0 Compare April 1, 2024 22:43

benesch reviewed Apr 2, 2024

View reviewed changes

doc/developer/design/20320328_persist_columnar.md Outdated Show resolved Hide resolved

danhhz reviewed Apr 2, 2024

View reviewed changes

ParkMyCar force-pushed the columnar/design-doc branch from bf9ea07 to eba9e09 Compare April 2, 2024 22:02

ParkMyCar mentioned this pull request May 28, 2024

persist: "packed" encodings for Interval and Time #27336

Merged

5 tasks

danhhz approved these changes Sep 19, 2024

View reviewed changes

ParkMyCar added 5 commits September 19, 2024 10:53

initial draft

95f4a7f

mark open questions as none

8cef2a9

round of feedback

c701df8

update Array and List encoding

60849c4

another round of feedback

051c616

ParkMyCar force-pushed the columnar/design-doc branch from eba9e09 to 051c616 Compare September 19, 2024 14:53

ParkMyCar enabled auto-merge (squash) September 19, 2024 14:53

ParkMyCar merged commit f741ab7 into MaterializeInc:main Sep 19, 2024
9 checks passed

github-actions bot locked and limited conversation to collaborators Sep 19, 2024

ParkMyCar deleted the columnar/design-doc branch January 13, 2025 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design: Columnar Encodings for Persist #26175

design: Columnar Encodings for Persist #26175

ParkMyCar commented Mar 20, 2024 •

edited

Loading

danhhz left a comment

danhhz Mar 21, 2024

ParkMyCar Mar 29, 2024

ParkMyCar Apr 24, 2024

rjobanp left a comment

danhhz commented Mar 29, 2024

rjobanp commented Mar 29, 2024

bkirwi left a comment

bkirwi Mar 29, 2024

ParkMyCar Apr 1, 2024

danhhz Apr 2, 2024

bkirwi Mar 29, 2024

ParkMyCar Apr 1, 2024

danhhz Apr 2, 2024

ParkMyCar Apr 2, 2024 •

edited

Loading

danhhz Apr 2, 2024

ParkMyCar Apr 2, 2024

danhhz Apr 2, 2024

design: Columnar Encodings for Persist #26175

design: Columnar Encodings for Persist #26175

Conversation

ParkMyCar commented Mar 20, 2024 • edited Loading

Motivation

Checklist

danhhz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjobanp left a comment

Choose a reason for hiding this comment

danhhz commented Mar 29, 2024

rjobanp commented Mar 29, 2024

bkirwi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParkMyCar Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParkMyCar commented Mar 20, 2024 •

edited

Loading

ParkMyCar Apr 2, 2024 •

edited

Loading