Skip to content

Conversation

@ianmcook
Copy link
Member

@ianmcook ianmcook commented Jan 7, 2025

This adds the first in a series of posts that aim to demystify the use of Arrow as a data interchange format for databases and query engines. A Google Docs version is available here. The vector source file for the figures is here.

cc @lidavidm @zeroshade

@drin
Copy link

drin commented Jan 7, 2025

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?)

@ianmcook
Copy link
Member Author

ianmcook commented Jan 7, 2025

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).

I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?)

Agreed. I'd like to include something like this in the second post too, comparing time and size using Arrow IPC vs. Parquet, ORC, Avro, CSV, JSON. But there are so many different variables at play (network speed, CPU and memory specs, encoding and compression options, how optimized the implementation is, whether or not to aggressively downcast based on the range of values in the data, what column types to use in the example, ... ) that I expect it will be impossible to claim that any results we present are representative. So the main message might end up being "YMMV" and we will probably want to provide a repo with some tools that readers can use to experiment for themselves.

@drin
Copy link

drin commented Jan 7, 2025

great! Thanks Ian! looking forward to the posts. I'll give this post a deeper look soon and I'd be happy to help with something like a cookbook repo for examples you might build up over the course of the posts, if I can.

@ianmcook
Copy link
Member Author

ianmcook commented Jan 7, 2025

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
I'll translate this into Japanese when this is published.

@telemenar
Copy link

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

Another thing that feeds into this beyond the storage benefits called out here:

Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).

Is that for archival storage in addition to the cost aspect, you are generally doing ser once and de many times. Which changes your tradeoffs. In the pure compression algo space, this might be the difference between choosing lz4 (wire) and zstd (archival).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great in general, well-explained with impressive examples.

I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.

(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)

@ianmcook
Copy link
Member Author

ianmcook commented Jan 9, 2025

This is looking great in general, well-explained with impressive examples.

Thanks @pitrou!

I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.

(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)

I changed the language in a couple of places and expanded footnote 3 to help prevent readers from getting this idea.

@ianmcook ianmcook merged commit 2788cf7 into apache:main Jan 10, 2025
1 check passed
@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

FWIW I read the rendered version https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/

It is nicely done. Great work 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants