Skip to content

Handle different data types #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 19, 2025
Merged

Handle different data types #8

merged 17 commits into from
Aug 19, 2025

Conversation

Hind-M
Copy link
Member

@Hind-M Hind-M commented Jul 28, 2025

No description provided.

REQUIRE_NE(arrow_array_deserialized_ar, nullptr);
compare_arrow_arrays(*arrow_array_ar, *arrow_array_deserialized_ar);

// compare_values<T>(ar, deserialized_ar);
Copy link
Member Author

@Hind-M Hind-M Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing on windows, as CHECK_EQ(ar, deserialized_ar);.
cf. error here.
We should either fix this upstream in sparrow (if indeed something is missing) or at least try and use comparison functions from sparrow (to be exposed if not already).

Note that it wasn't the case before this PR, i.e CHECK_EQ(ar, deserialized_ar); was not failing on windows when building (tests are commented out there for now), but maybe this was some latent issue that was exposed when adding more tests and changing the structure?...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a compiler issue, this operator is defined ins sparrow

Copy link
Member Author

@Hind-M Hind-M Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE:
We should test again - with this PR merged - using sparrow built from source (and then directly from conda-forge after releasing next version and confirming this fixes the issue).

@Hind-M Hind-M marked this pull request as ready for review July 30, 2025 12:23
// - Correctly populating the Flatbuffer-defined metadata for both messages.

// Create a mutable copy of the input array to allow moving its internal structures
sparrow::primitive_array<T> mutable_arr = arr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also provide a destructive version of serialize to avoid this copy? I.e. something like

template <typename T>
std::vector<uint8_t> serialize_primitive_array(sparrow::primitive_array<T>&& arr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I decided to use extract_arrow_structures before handling the metadata (which we need to get using the primitive_array input argument through arr.metadata() and not through arrow_schema to get the right format).

If we add the overload you suggested, we would need to deep copy the metadata, and that's not ideal.
Instead I'm using now get_arrow_structures to avoid copying...
But I had to remove the const from the primitive_array input argument because of errors due to get_arrow_structures not taking a const.

=> Should we provide an overload for const in sparrow for get_arrow_structures knowing that it's calling get_arrow_proxy, and that can take either a const or a non-const ARRAY?

Comment on lines +50 to +82
{
// Create a new builder for the Schema message's metadata
flatbuffers::FlatBufferBuilder schema_builder;

// Create the Field metadata, which describes a single column (or array)
flatbuffers::Offset<flatbuffers::String> fb_name_offset = 0;
if (arrow_schema.name)
{
fb_name_offset = schema_builder.CreateString(arrow_schema.name);
}

// Determine the Flatbuffer type information from the C schema's format string
auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format);

// Handle metadata
flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<org::apache::arrow::flatbuf::KeyValue>>>
fb_metadata_offset = 0;

if (arr.metadata())
{
sparrow::key_value_view metadata_view = *(arr.metadata());
std::vector<flatbuffers::Offset<org::apache::arrow::flatbuf::KeyValue>> kv_offsets;

auto mv_it = metadata_view.cbegin();
for (auto i = 0; i < metadata_view.size(); ++i, ++mv_it)
{
auto key_offset = schema_builder.CreateString(std::string((*mv_it).first));
auto value_offset = schema_builder.CreateString(std::string((*mv_it).second));
kv_offsets.push_back(
org::apache::arrow::flatbuf::CreateKeyValue(schema_builder, key_offset, value_offset));
}
fb_metadata_offset = schema_builder.CreateVector(kv_offsets);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a dedicated function ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is some more work on #10 which is based on this PR and where there is more refactoring.

sparrow::primitive_array<T> deserialize_primitive_array(const std::vector<uint8_t>& buffer);

template <typename T>
std::vector<uint8_t> serialize_primitive_array(sparrow::primitive_array<T>& arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<uint8_t> serialize_primitive_array(sparrow::primitive_array<T>& arr)
std::vector<uint8_t> serialize_primitive_array(const sparrow::primitive_array<T>& arr)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not possible for now when using get_arrow_structures, cf. this comment.

Comment on lines +41 to +43
auto [arrow_arr_ptr, arrow_schema_ptr] = sparrow::get_arrow_structures(arr);
auto& arrow_arr = *arrow_arr_ptr;
auto& arrow_schema = *arrow_schema_ptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto [arrow_arr_ptr, arrow_schema_ptr] = sparrow::get_arrow_structures(arr);
auto& arrow_arr = *arrow_arr_ptr;
auto& arrow_schema = *arrow_schema_ptr;
const auto [arrow_arr_ptr, arrow_schema_ptr] = sparrow::get_arrow_structures(arr);
const ArrowArray& arrow_arr = *arrow_arr_ptr;
const ArrowSchema& arrow_schema = *arrow_schema_ptr;

Comment on lines +45 to +47
// This will be the final buffer holding the complete IPC stream.
std::vector<uint8_t> final_buffer;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it to l.114 please

}

// Determine the Flatbuffer type information from the C schema's format string
auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format);
const auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format);

auto fb_fields = schema_builder.CreateVector(fields_vec);

// Build the Schema object from the vector of fields
auto schema_offset = org::apache::arrow::flatbuf::CreateSchema(schema_builder, org::apache::arrow::flatbuf::Endianness::Little, fb_fields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const?

schema_builder.Finish(schema_message_offset);

// Assemble the Schema message bytes
uint32_t schema_len = schema_builder.GetSize(); // Get the size of the serialized metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t schema_len = schema_builder.GetSize(); // Get the size of the serialized metadata
const uint32_t schema_len = schema_builder.GetSize(); // Get the size of the serialized metadata

Comment on lines +133 to +135
int64_t validity_size = (arrow_arr.length + 7) / 8;
int64_t data_size = arrow_arr.length * sizeof(T);
int64_t body_len = validity_size + data_size; // The total size of the message body
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64_t validity_size = (arrow_arr.length + 7) / 8;
int64_t data_size = arrow_arr.length * sizeof(T);
int64_t body_len = validity_size + data_size; // The total size of the message body
const int64_t validity_size = (arrow_arr.length + 7) / 8;
const int64_t data_size = arrow_arr.length * sizeof(T);
const int64_t body_len = validity_size + data_size; // The total size of the message body

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you improve the constness of the variables ?
Can you also split the functions: like for the schema, it's common to all the types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you improve the constness of the variables ? Can you also split the functions: like for the schema, it's common to all the types

Yes, see #10 (further improvements will be done there as well).

flatbuffers::FlatBufferBuilder builder;
SUBCASE("Null and Boolean types")
{
CHECK_EQ(utils::get_flatbuffer_type(builder, "n").first, org::apache::arrow::flatbuf::Type::Null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use sparrow::data_type_to_format to get the type in string format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still missing data types in sparrow::data_type_to_format in sparrow actually.
I changed the ones available for now.

@Hind-M Hind-M merged commit f7dd61b into QuantStack:main Aug 19, 2025
12 checks passed
@Hind-M Hind-M deleted the format_fct branch August 19, 2025 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants