Skip to content

Conversation

@Vishwanatha-HD
Copy link
Contributor

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the "Arrow/Reader_internal" logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/parquet/arrow/reader_internal.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No.

GitHub main Issue link: #48151

@github-actions
Copy link

⚠️ GitHub issue #48214 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-48214 Fix Arrow Reader Internal logic to enable Parquet DB support… GH-48214: [C++][Parquet] Fix Arrow Reader Internal logic to enable Parquet DB support on s390x Nov 22, 2025
@k8ika0s
Copy link

k8ika0s commented Nov 23, 2025

@Vishwanatha-HD

These little corners of the Arrow/Parquet bridge tend to hide the more “surprising” BE behaviors, so it’s always nice to see them getting attention.

My own s390x work didn’t touch reader_internal.cc, so I’m mostly reading this with the lens of “does this match the patterns I’ve seen on hardware.” The decimal min/max extraction you added looks straightforward, and from what I’ve observed, normalizing those integer-backed stats before they’re handed downstream makes a real difference on BE.

Same with the half-float swap: I’ve noticed that half-floats are one of the places where BE architectures drift quickest if the reader doesn’t explicitly re-LE them, so calling FromLittleEndian here feels like the safer side of the fence.

I don’t see any conflicts with what I’ve been doing in encode/decode land — just wanted to chime in and confirm the behavior you’re targeting here lines up with what I’ve seen when running the full Parquet → Arrow → Parquet round-trip paths on BE.

#if ARROW_LITTLE_ENDIAN
ARROW_ASSIGN_OR_RAISE(*out, chunked_array->View(field->type()));
#else
// Convert little-endian bytes from Parquet to native-endian HalfFloat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would favor a different approach: turn TransferBinary into:

Status TransferBinary(RecordReader* reader, MemoryPool* pool,
                      const std::shared_ptr<Field>& logical_type_field,
                      std::function<Result<std::shared_ptr<Array>>(std::shared_ptr<Array>)> array_process,
                      std::shared_ptr<ChunkedArray>* out) {

such that the optional array_process is called for each chunk. If carefully coded, this will help limit memory consumption by disposing of old chunks while creating the new ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou.. Thanks for your comments here.. Can I plan to implement this change in the next pass, since the present code change is working.. We want to enable the Apache/Arrow support on s390x as soon as possible since its blocking many of the OCP AI tests on IBM Z.. Thanks.. !!

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 24, 2025
Copy link
Contributor Author

@Vishwanatha-HD Vishwanatha-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed all the review comments.. Thanks..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants