-
Notifications
You must be signed in to change notification settings - Fork 2
Handle different data types #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
REQUIRE_NE(arrow_array_deserialized_ar, nullptr); | ||
compare_arrow_arrays(*arrow_array_ar, *arrow_array_deserialized_ar); | ||
|
||
// compare_values<T>(ar, deserialized_ar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing on windows, as CHECK_EQ(ar, deserialized_ar);
.
cf. error here.
We should either fix this upstream in sparrow
(if indeed something is missing) or at least try and use comparison functions from sparrow (to be exposed if not already).
Note that it wasn't the case before this PR, i.e CHECK_EQ(ar, deserialized_ar);
was not failing on windows when building (tests are commented out there for now), but maybe this was some latent issue that was exposed when adding more tests and changing the structure?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a compiler issue, this operator is defined ins sparrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE:
We should test again - with this PR merged - using sparrow
built from source (and then directly from conda-forge
after releasing next version and confirming this fixes the issue).
src/serialize.cpp
Outdated
// - Correctly populating the Flatbuffer-defined metadata for both messages. | ||
|
||
// Create a mutable copy of the input array to allow moving its internal structures | ||
sparrow::primitive_array<T> mutable_arr = arr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also provide a destructive version of serialize to avoid this copy? I.e. something like
template <typename T>
std::vector<uint8_t> serialize_primitive_array(sparrow::primitive_array<T>&& arr)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I decided to use extract_arrow_structures
before handling the metadata (which we need to get using the primitive_array
input argument through arr.metadata()
and not through arrow_schema
to get the right format).
If we add the overload you suggested, we would need to deep copy the metadata, and that's not ideal.
Instead I'm using now get_arrow_structures
to avoid copying...
But I had to remove the const
from the primitive_array
input argument because of errors due to get_arrow_structures
not taking a const
.
=> Should we provide an overload for const
in sparrow
for get_arrow_structures
knowing that it's calling get_arrow_proxy, and that can take either a const or a non-const ARRAY
?
{ | ||
// Create a new builder for the Schema message's metadata | ||
flatbuffers::FlatBufferBuilder schema_builder; | ||
|
||
// Create the Field metadata, which describes a single column (or array) | ||
flatbuffers::Offset<flatbuffers::String> fb_name_offset = 0; | ||
if (arrow_schema.name) | ||
{ | ||
fb_name_offset = schema_builder.CreateString(arrow_schema.name); | ||
} | ||
|
||
// Determine the Flatbuffer type information from the C schema's format string | ||
auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format); | ||
|
||
// Handle metadata | ||
flatbuffers::Offset<flatbuffers::Vector<flatbuffers::Offset<org::apache::arrow::flatbuf::KeyValue>>> | ||
fb_metadata_offset = 0; | ||
|
||
if (arr.metadata()) | ||
{ | ||
sparrow::key_value_view metadata_view = *(arr.metadata()); | ||
std::vector<flatbuffers::Offset<org::apache::arrow::flatbuf::KeyValue>> kv_offsets; | ||
|
||
auto mv_it = metadata_view.cbegin(); | ||
for (auto i = 0; i < metadata_view.size(); ++i, ++mv_it) | ||
{ | ||
auto key_offset = schema_builder.CreateString(std::string((*mv_it).first)); | ||
auto value_offset = schema_builder.CreateString(std::string((*mv_it).second)); | ||
kv_offsets.push_back( | ||
org::apache::arrow::flatbuf::CreateKeyValue(schema_builder, key_offset, value_offset)); | ||
} | ||
fb_metadata_offset = schema_builder.CreateVector(kv_offsets); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in a dedicated function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is some more work on #10 which is based on this PR and where there is more refactoring.
sparrow::primitive_array<T> deserialize_primitive_array(const std::vector<uint8_t>& buffer); | ||
|
||
template <typename T> | ||
std::vector<uint8_t> serialize_primitive_array(sparrow::primitive_array<T>& arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<uint8_t> serialize_primitive_array(sparrow::primitive_array<T>& arr) | |
std::vector<uint8_t> serialize_primitive_array(const sparrow::primitive_array<T>& arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not possible for now when using get_arrow_structures
, cf. this comment.
auto [arrow_arr_ptr, arrow_schema_ptr] = sparrow::get_arrow_structures(arr); | ||
auto& arrow_arr = *arrow_arr_ptr; | ||
auto& arrow_schema = *arrow_schema_ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto [arrow_arr_ptr, arrow_schema_ptr] = sparrow::get_arrow_structures(arr); | |
auto& arrow_arr = *arrow_arr_ptr; | |
auto& arrow_schema = *arrow_schema_ptr; | |
const auto [arrow_arr_ptr, arrow_schema_ptr] = sparrow::get_arrow_structures(arr); | |
const ArrowArray& arrow_arr = *arrow_arr_ptr; | |
const ArrowSchema& arrow_schema = *arrow_schema_ptr; |
// This will be the final buffer holding the complete IPC stream. | ||
std::vector<uint8_t> final_buffer; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it to l.114 please
} | ||
|
||
// Determine the Flatbuffer type information from the C schema's format string | ||
auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format); | |
const auto [type_enum, type_offset] = utils::get_flatbuffer_type(schema_builder, arrow_schema.format); |
auto fb_fields = schema_builder.CreateVector(fields_vec); | ||
|
||
// Build the Schema object from the vector of fields | ||
auto schema_offset = org::apache::arrow::flatbuf::CreateSchema(schema_builder, org::apache::arrow::flatbuf::Endianness::Little, fb_fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const?
schema_builder.Finish(schema_message_offset); | ||
|
||
// Assemble the Schema message bytes | ||
uint32_t schema_len = schema_builder.GetSize(); // Get the size of the serialized metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t schema_len = schema_builder.GetSize(); // Get the size of the serialized metadata | |
const uint32_t schema_len = schema_builder.GetSize(); // Get the size of the serialized metadata |
int64_t validity_size = (arrow_arr.length + 7) / 8; | ||
int64_t data_size = arrow_arr.length * sizeof(T); | ||
int64_t body_len = validity_size + data_size; // The total size of the message body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int64_t validity_size = (arrow_arr.length + 7) / 8; | |
int64_t data_size = arrow_arr.length * sizeof(T); | |
int64_t body_len = validity_size + data_size; // The total size of the message body | |
const int64_t validity_size = (arrow_arr.length + 7) / 8; | |
const int64_t data_size = arrow_arr.length * sizeof(T); | |
const int64_t body_len = validity_size + data_size; // The total size of the message body |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you improve the constness of the variables ?
Can you also split the functions: like for the schema, it's common to all the types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you improve the constness of the variables ? Can you also split the functions: like for the schema, it's common to all the types
Yes, see #10 (further improvements will be done there as well).
tests/test_utils.cpp
Outdated
flatbuffers::FlatBufferBuilder builder; | ||
SUBCASE("Null and Boolean types") | ||
{ | ||
CHECK_EQ(utils::get_flatbuffer_type(builder, "n").first, org::apache::arrow::flatbuf::Type::Null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use sparrow::data_type_to_format to get the type in string format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still missing data types in sparrow::data_type_to_format
in sparrow
actually.
I changed the ones available for now.
No description provided.