-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-48216: [C++][Parquet] Fix Util Byte Stream Split Internal logic to enable Parquet DB support on s390x #48217
Conversation
|
|
189bd42 to
1c499e4
Compare
|
Something I’ve seen on s390x is that ByteStreamSplit behaves most predictably when the data feeding into the split is already in a well-defined byte order before the interleaving happens. When values arrive in native order on BE, the shuffling pattern can produce different byte layouts than what downstream readers or stats logic expect on LE hosts. Looking at this patch, the swap + reversed-stream approach inside On the merge side, I’m also curious whether the current stream reversal covers the cases where BE decoding would otherwise lean on helpers that expect the shuffled bytes to correspond to LE-origin data. Not raising any correctness objections here — just sharing a few behaviors I’ve run into while testing BSS more broadly on BE systems. |
…rquet DB support on s390x
1c499e4 to
70fd9c4
Compare
Vishwanatha-HD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed all the code review comments.
| dest[stream + (i + 5) * width] = static_cast<uint8_t>(v >> 40); | ||
| dest[stream + (i + 6) * width] = static_cast<uint8_t>(v >> 48); | ||
| dest[stream + (i + 7) * width] = static_cast<uint8_t>(v >> 56); | ||
| const int dest_stream = stream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to move const int ... for both endians to line 387? While a compiler may move it in the current code, it is sure to move it in the source code.
Rationale for this change
This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the "util/byte_stream_split_internal" logic.
What changes are included in this PR?
The fix includes changes to following file:
cpp/src/arrow/util/byte_stream_split_internal.h
Are these changes tested?
Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.
Are there any user-facing changes?
No.
GitHub main Issue link: #48151