GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217

Vishwanatha-HD · 2025-11-21T18:34:16Z

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the "util/byte_stream_split_internal" logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/arrow/util/byte_stream_split_internal.h

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No.

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on Big-Endian (s390x) systems #48216

github-actions · 2025-11-21T18:34:41Z

⚠️ GitHub issue #48216 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/util/byte_stream_split_internal.h

k8ika0s · 2025-11-23T21:54:49Z

@Vishwanatha-HD

Something I’ve seen on s390x is that ByteStreamSplit behaves most predictably when the data feeding into the split is already in a well-defined byte order before the interleaving happens. When values arrive in native order on BE, the shuffling pattern can produce different byte layouts than what downstream readers or stats logic expect on LE hosts.

Looking at this patch, the swap + reversed-stream approach inside DoSplitStreams makes sense mechanically. I was wondering, though, how this interacts with callers that assume the inputs are already LE-normalized. In particular, mixed Arrow/non-Arrow inputs sometimes reveal subtle differences because Arrow arrays tend to carry scalars in canonical LE format even on BE machines.

On the merge side, I’m also curious whether the current stream reversal covers the cases where BE decoding would otherwise lean on helpers that expect the shuffled bytes to correspond to LE-origin data.

Not raising any correctness objections here — just sharing a few behaviors I’ve run into while testing BSS more broadly on BE systems.

…rquet DB support on s390x

Vishwanatha-HD

I have addressed all the code review comments.

cpp/src/arrow/util/byte_stream_split_internal.h

kiszk · 2025-11-27T20:29:35Z

cpp/src/arrow/util/byte_stream_split_internal.h

-        dest[stream + (i + 5) * width] = static_cast<uint8_t>(v >> 40);
-        dest[stream + (i + 6) * width] = static_cast<uint8_t>(v >> 48);
-        dest[stream + (i + 7) * width] = static_cast<uint8_t>(v >> 56);
+        const int dest_stream = stream;


Is it better to move const int ... for both endians to line 387? While a compiler may move it in the current code, it is sure to move it in the source code.

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on Big-Endian (s390x) systems #48216

Open

github-actions bot added Component: C++ awaiting review Awaiting review labels Nov 21, 2025

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

Vishwanatha-HD force-pushed the fixUtilByteStreamSpltIntrnl branch from 189bd42 to 1c499e4 Compare November 22, 2025 05:05

kou changed the title ~~GH-48216 Fix Util Byte Stream Split Internal logic to enable Parquet …~~ GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x Nov 22, 2025

kou reviewed Nov 22, 2025

View reviewed changes

cpp/src/arrow/util/byte_stream_split_internal.h Outdated Show resolved Hide resolved

apacheGH-48216 Fix Util Byte Stream Split Internal logic to enable Pa…

70fd9c4

…rquet DB support on s390x

Vishwanatha-HD force-pushed the fixUtilByteStreamSpltIntrnl branch from 1c499e4 to 70fd9c4 Compare November 26, 2025 12:54

Vishwanatha-HD commented Nov 26, 2025

View reviewed changes

cpp/src/arrow/util/byte_stream_split_internal.h Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 26, 2025

kiszk reviewed Nov 27, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Nov 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217

GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Vishwanatha-HD left a comment

Uh oh!

Uh oh!

kiszk Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217

Are you sure you want to change the base?

GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Vishwanatha-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kiszk Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading