Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions cpp/src/arrow/util/bit_stream_utils_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -365,6 +365,9 @@ inline bool BitReader::GetVlqInt(Int* v) {
// In all case, we read a byte-aligned value, skipping remaining bits
const uint8_t* data = NULLPTR;
int max_size = 0;
#if ARROW_LITTLE_ENDIAN
// The data that we will pass to the LEB128 parser
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The data that we will pass to the LEB128 parser
// The data that we will pass to the LEB128 parser.

// In all case, we read a byte-aligned value, skipping remaining bits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// In all case, we read a byte-aligned value, skipping remaining bits
// In all case, we read a byte-aligned value, skipping remaining bits.


// Number of bytes left in the buffered values, not including the current
// byte (i.e., there may be an additional fraction of a byte).
Expand All @@ -381,6 +384,17 @@ inline bool BitReader::GetVlqInt(Int* v) {
max_size = bytes_left();
data = buffer_ + (max_bytes_ - max_size);
}
#else
// For VLQ reading, always read directly from buffer to avoid endianness issues
// with buffered_values_ on big-endian systems like s390x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// with buffered_values_ on big-endian systems like s390x
// with buffered_values_ on big-endian systems like s390x.

// Calculate current position in buffer accounting for bit offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Calculate current position in buffer accounting for bit offset
// Calculate current position in buffer accounting for bit offset.

const int current_byte_offset = byte_offset_ + bit_util::BytesForBits(bit_offset_);
const int bytes_left_in_buffer = max_bytes_ - current_byte_offset;

// Always read from buffer directly to avoid endianness issues
data = buffer_ + current_byte_offset;
max_size = bytes_left_in_buffer;
Comment on lines +391 to +396
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this the same logic as

} else {
max_size = bytes_left();
data = buffer_ + (max_bytes_ - max_size);
}
?

If so, should we reuse it something like the following?

diff --git a/cpp/src/arrow/util/bit_stream_utils_internal.h b/cpp/src/arrow/util/bit_stream_utils_internal.h
index d8c7317fe8..7352312782 100644
--- a/cpp/src/arrow/util/bit_stream_utils_internal.h
+++ b/cpp/src/arrow/util/bit_stream_utils_internal.h
@@ -366,6 +366,9 @@ inline bool BitReader::GetVlqInt(Int* v) {
   const uint8_t* data = NULLPTR;
   int max_size = 0;
 
+#if ARROW_LITTLE_ENDIAN
+  // TODO: Describe why we need this only for little-endian.
+
   // Number of bytes left in the buffered values, not including the current
   // byte (i.e., there may be an additional fraction of a byte).
   const int bytes_left_in_cache =
@@ -377,7 +380,9 @@ inline bool BitReader::GetVlqInt(Int* v) {
     data = reinterpret_cast<const uint8_t*>(&buffered_values_) +
            bit_util::BytesForBits(bit_offset_);
     // Otherwise, we try straight from buffer (ignoring few bytes that may be cached)
-  } else {
+  } else
+#endif
+  {
     max_size = bytes_left();
     data = buffer_ + (max_bytes_ - max_size);
   }

#endif

const auto bytes_read = bit_util::ParseLeadingLEB128(data, max_size, v);
if (ARROW_PREDICT_FALSE(bytes_read == 0)) {
Expand Down