[Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" #569

ianmcook · 2025-01-07T16:59:31Z

This adds the first in a series of posts that aim to demystify the use of Arrow as a data interchange format for databases and query engines. A Google Docs version is available here. The vector source file for the figures is here.

cc @lidavidm @zeroshade

drin · 2025-01-07T20:23:37Z

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?)

ianmcook · 2025-01-07T20:58:58Z

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).

I realize it's a bit of a scope creep, but I think the comparison of ser/de time and compression size would be really valuable to readers (and I think some naive numbers wouldn't be very time consuming to get?)

Agreed. I'd like to include something like this in the second post too, comparing time and size using Arrow IPC vs. Parquet, ORC, Avro, CSV, JSON. But there are so many different variables at play (network speed, CPU and memory specs, encoding and compression options, how optimized the implementation is, whether or not to aggressively downcast based on the range of values in the data, what column types to use in the example, ... ) that I expect it will be impossible to claim that any results we present are representative. So the main message might end up being "YMMV" and we will probably want to provide a repo with some tools that readers can use to experiment for themselves.

_posts/2025-01-07-arrow-result-transfer.md

drin · 2025-01-07T21:46:29Z

great! Thanks Ian! looking forward to the posts. I'll give this post a deeper look soon and I'd be happy to help with something like a cookbook repo for examples you might build up over the course of the posts, if I can.

ianmcook · 2025-01-07T23:55:05Z

PDF of preview: How the Apache Arrow Format Accelerates Query Result Transfer | Apache Arrow.pdf

kou

Great!
I'll translate this into Japanese when this is published.

_posts/2025-01-07-arrow-result-transfer.md

Co-authored-by: Sutou Kouhei <[email protected]>

telemenar · 2025-01-08T17:23:21Z

The intro of the blog post points to ser/de as a benefit to the arrow format. I'm curious if a reference exists (and can be, or will eventually be, added) that shows a similar comparison for arrow vs parquet. Mostly in the sense that storage sits in a mechanically similar spot (but the serialization and deserialization have an arbitrarily large time gap between their execution).

Another thing that feeds into this beyond the storage benefits called out here:

Thanks @drin. This is part of what the second post in the series will cover. It will describe why formats like Parquet and ORC are typically better than Arrow for archival storage (mostly because higher compression ratios mean lower cost to store for long periods, which easily outweighs the tradeoff of higher ser/de overheads).

Is that for archival storage in addition to the cost aspect, you are generally doing ser once and de many times. Which changes your tradeoffs. In the pure compression algo space, this might be the difference between choosing lz4 (wire) and zstd (archival).

pitrou

This is looking great in general, well-explained with impressive examples.

I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.

(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)

_posts/2025-01-10-arrow-result-transfer.md

ianmcook · 2025-01-09T16:56:02Z

This is looking great in general, well-explained with impressive examples.

Thanks @pitrou!

I posted some comments below. Also, I think that maybe this should focus as usage of Arrow for in-memory analytics, not storage. Putting up Arrow files against Parquet is a bit misleading and contentious; I think it's better to present them as complementary.

(for example, reading a Parquet file from storage might be faster than reading the equivalent Arrow file, if the storage is not super fast, because Parquet often has a much better storage efficiency)

I changed the language in a couple of places and expanded footnote 3 to help prevent readers from getting this idea.

alamb · 2025-01-13T15:20:43Z

FWIW I read the rendered version https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/

It is nicely done. Great work 👏

ianmcook added 7 commits January 7, 2025 11:49

Add post

5b699fe

Crop figures

ef90663

Improve caption about record batch size

fd7f46b

Remove redundant horizontal rule

f22647e

Fix footnote placement

ec9628a

List more columnar BI tools

e0a88ef

Minor tweaks

9a2ae94

wesm reviewed Jan 7, 2025

View reviewed changes

_posts/2025-01-07-arrow-result-transfer.md Outdated Show resolved Hide resolved

Minor additions

50dcc1d

ianmcook added 3 commits January 7, 2025 16:49

Add Vertica

2356cf7

Minor fixes

b393cd1

Add target=_blank to links

07c7803

kou reviewed Jan 8, 2025

View reviewed changes

_posts/2025-01-07-arrow-result-transfer.md Outdated Show resolved Hide resolved

lidavidm approved these changes Jan 8, 2025

View reviewed changes

ianmcook and others added 4 commits January 7, 2025 20:04

List Swift

e8028a2

Co-authored-by: Sutou Kouhei <[email protected]>

Add Velox

c88f151

Rewrite type safety section

d393ef8

Improve type safety section

00a5634

ianmcook added 2 commits January 8, 2025 12:36

Minor improvements

6fcaca5

Change publish date to Friday

3172562

pitrou reviewed Jan 9, 2025

View reviewed changes

Improvements based on review

3dac5d0

ianmcook added 2 commits January 9, 2025 13:55

Improve zero-copy section

aad7c2c

Minor tweaks

5a7e25f

zeroshade approved these changes Jan 10, 2025

View reviewed changes

ianmcook added 3 commits January 10, 2025 14:20

Add banner image

3ec9b0f

Minor tweaks

fdc1888

Improve banner image

5fd921a

ianmcook merged commit 2788cf7 into apache:main Jan 10, 2025
1 check passed

kou mentioned this pull request Feb 18, 2025

Translate "How the Apache Arrow Format Accelerates Query Result Transfer" into Japanese #592

Merged

[Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" #569

[Website] Add post "How the Apache Arrow Format Accelerates Query Result Transfer" #569

Uh oh!

Conversation

ianmcook commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drin commented Jan 7, 2025

Uh oh!

ianmcook commented Jan 7, 2025

Uh oh!

Uh oh!

drin commented Jan 7, 2025

Uh oh!

ianmcook commented Jan 7, 2025

Uh oh!

kou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

telemenar commented Jan 8, 2025

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ianmcook commented Jan 9, 2025

Uh oh!

Uh oh!

alamb commented Jan 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

ianmcook commented Jan 7, 2025 •

edited

Loading