Add output streams and serializers #29

Alex-PLACET · 2025-10-03T14:00:30Z

No description provided.

Copilot

Pull Request Overview

Introduces a new streaming-based serialization architecture replacing vector-returning functions with pluggable output_stream abstractions (memory, file, chunked) and adds higher-level serializer / chunk_serializer utilities. Refactors flatbuffer construction logic into flatbuffer_utils, moves type mapping out of utils, and adds size‑estimation helpers for preallocation.

Added output_stream interface with memory, file, and chunked implementations plus new (chunk_)serializer classes
Refactored flatbuffer and body serialization into modular helpers; functions now write directly to streams
Added size estimation utilities (calculate_*_message_size / calculate_total_serialized_size) and extensive new tests

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
src/utils.cpp	Removes flatbuffer type logic; retains parsing & alignment helpers
src/flatbuffer_utils.cpp / include/sparrow_ipc/flatbuffer_utils.hpp	New central flatbuffer construction and buffer/node utilities
src/serialize_utils.cpp / include/sparrow_ipc/serialize_utils.hpp	Stream-oriented serialization helpers and size calculators
src/serialize.cpp / include/sparrow_ipc/serialize.hpp	Stream-based schema & record batch serialization entry points
src/serializer.cpp / include/sparrow_ipc/serializer.hpp	Adds serializer class for continuous IPC stream writing
src/chunk_memory_serializer.cpp / include/sparrow_ipc/chunk_memory_serializer.hpp	Adds chunked (per-message vector) serialization
include/sparrow_ipc/output_stream.hpp	Defines abstract output_stream interface
include/sparrow_ipc/memory_output_stream.hpp	In-memory implementation
include/sparrow_ipc/file_output_stream.hpp / src/file_output_stream.cpp	File-backed implementation
include/sparrow_ipc/chunk_memory_output_stream.hpp	Chunked multi-vector output stream (name misspelled)
tests/*	Updated & expanded tests for new streaming APIs and utilities
CMakeLists.txt / tests/CMakeLists.txt	Adds new sources & headers to build system
include/sparrow_ipc/utils.hpp	API changes: align_to_8 signature and new parse_format exposure
Other headers	Minor adjustments (e.g., added )

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-03T14:05:09Z

include/sparrow_ipc/serializer.hpp

+        template <std::ranges::input_range R>
+            requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
+        serializer(const R& record_batches, output_stream& stream)
+            : m_pstream(&stream)
+            , m_dtypes(get_column_dtypes(record_batches[0]))
+        {
+            if (record_batches.empty())
+            {
+                throw std::invalid_argument("Record batches collection is empty");
+            }


The multi-batch constructor writes the first record batch twice: once via append(record_batches) (which iterates all elements) after already serializing record_batches[0]. This duplicates the first batch in the output. Either start the append loop from the second element or exclude record_batches[0] from append.

Copilot · 2025-10-03T14:05:10Z

include/sparrow_ipc/serializer.hpp

+            };
+            m_pstream->reserve(reserve_function);
+            serialize_schema_message(record_batches[0], *m_pstream);
+            append(record_batches);


The multi-batch constructor writes the first record batch twice: once via append(record_batches) (which iterates all elements) after already serializing record_batches[0]. This duplicates the first batch in the output. Either start the append loop from the second element or exclude record_batches[0] from append.

Suggested change

append(record_batches);

if (std::ranges::distance(record_batches) > 1) {

append(std::ranges::subrange(std::next(record_batches.begin()), record_batches.end()));

}

No we first serialize the schema, then we serialize the arrays

Copilot · 2025-10-03T14:05:10Z

include/sparrow_ipc/chunk_memory_serializer.hpp

+        template <std::ranges::input_range R>
+            requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
+        chunk_serializer(
+            const R& record_batches,
+            chuncked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream
+        )
+            : m_pstream(&stream)
+        {
+            if (record_batches.empty())
+            {
+                throw std::invalid_argument("Record batches collection is empty");
+            }
+            m_dtypes = get_column_dtypes(record_batches[0]);


Same duplication issue: the first record batch is serialized twice (once explicitly and once through append(record_batches)). Adjust append to skip the first element or iterate from the second batch here.

include/sparrow_ipc/chunk_memory_serializer.hpp

include/sparrow_ipc/serializer.hpp

include/sparrow_ipc/chunk_memory_output_stream.hpp

include/sparrow_ipc/utils.hpp

src/serialize_utils.cpp

codecov-commenter · 2025-10-03T14:22:39Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 89.48413% with 53 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@25a9ef6). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/serializer.cpp	40.00%	21 Missing ⚠️
src/flatbuffer_utils.cpp	91.74%	18 Missing ⚠️
include/sparrow_ipc/chunk_memory_output_stream.hpp	90.90%	3 Missing ⚠️
include/sparrow_ipc/output_stream.hpp	70.00%	3 Missing ⚠️
include/sparrow_ipc/serializer.hpp	90.62%	3 Missing ⚠️
src/chunk_memory_serializer.cpp	93.54%	2 Missing ⚠️
src/file_output_stream.cpp	93.54%	2 Missing ⚠️
src/serialize_utils.cpp	96.00%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #29   +/-   ##
=======================================
  Coverage        ?   78.46%           
=======================================
  Files           ?       32           
  Lines           ?     1291           
  Branches        ?        0           
=======================================
  Hits            ?     1013           
  Misses          ?      278           
  Partials        ?        0

Flag	Coverage Δ
unittests	`78.46% <89.48%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Copilot <[email protected]>

…m/Alex-PLACET/sparrow-ipc into add_output_stream_and_serializers

Co-authored-by: Copilot <[email protected]>

Alex-PLACET · 2025-10-07T12:42:52Z

tests/test_utils.cpp

        CHECK_EQ(utils::align_to_8(15), 16);
        CHECK_EQ(utils::align_to_8(16), 16);
    }
-


Moved to another test file

Alex-PLACET · 2025-10-07T12:43:30Z

tests/test_flatbuffer_utils.cpp

Moved from another test file

…m/Alex-PLACET/sparrow-ipc into add_output_stream_and_serializers

JohanMabille

The implementation is neat, I have just a few remarks regarding the architecture and API.

API

We want to be able to write something like:

std::ofstream out("my_file");
auto serializer ser(out);
ser << my_schema << my_array << my_list_of_batches << end_of_record;

Serializer

This means that:

The serializer must accept any kind of streams, including the standard ones (see next section for more detail)
The serializer should store an internal state (Waiting for a schema before accepting array ror record batches for instance, etc) so that its constructor does not need to accept a record_batch

The serializer methods names should reflect the stream methods names (i.e. append is actually write); also it should contain all the logic specific to sparrow, like adding padding.

Streams

The hierarchy of streams is actually a hierarchy of stream adaptor. I think this hierarchy can be removed, and some concepts can be provided instead, to help the implementation of the serializer:

template <class T>
concept output_stream = requires(T& t, const char* str)
{
    t.write(str, size_t(0));
    t.flush();
};

template <class T>
concept reservable_output_stream = output_stream<T> && requires(T& t)
{
    t.reserve(size_t(0));
};

If you need a layer between the streams and the serializer to adapt the signatures (because streams accept const char* while you serialize into std::span<uint8_t>, a stream_adapter class can be used to avoid repeating the cast everywhere. This layer should not contain additional logic like add_padding, which should live in the serializer.

Markers

The markers (for indicating the end of a record_batch list for instance) should follow the same pattern as std:endl for instance: a function accepting and returning a serializer.

Add output streams and serializers

2e32ebb

Alex-PLACET requested a review from Copilot October 3, 2025 14:00

fix

c4d42eb

Copilot AI reviewed Oct 3, 2025

View reviewed changes

Alex-PLACET added 3 commits October 3, 2025 16:47

fix

4785a7b

wip

b3beb9a

fix

08e0183

Alex-PLACET requested a review from JohanMabille October 6, 2025 12:27

Alex-PLACET and others added 4 commits October 6, 2025 15:56

Update include/sparrow_ipc/serializer.hpp

36fe236

Co-authored-by: Copilot <[email protected]>

Fix name

b5b3d97

wip

458a9cf

Merge branch 'add_output_stream_and_serializers' of https://github.co…

14952d0

…m/Alex-PLACET/sparrow-ipc into add_output_stream_and_serializers

Alex-PLACET marked this pull request as ready for review October 7, 2025 11:05

Update include/sparrow_ipc/serializer.hpp

75c9827

Co-authored-by: Copilot <[email protected]>

Alex-PLACET force-pushed the add_output_stream_and_serializers branch from 14952d0 to 75c9827 Compare October 7, 2025 12:41

Alex-PLACET commented Oct 7, 2025

View reviewed changes

tests/test_flatbuffer_utils.cpp

Copy link

Member Author

Alex-PLACET Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from another test file

Alex-PLACET added 3 commits October 8, 2025 14:42

Merge branch 'add_output_stream_and_serializers' of https://github.co…

7d0e68b

…m/Alex-PLACET/sparrow-ipc into add_output_stream_and_serializers

Try fix

37ec2fe

fix

c69100f

JohanMabille requested changes Oct 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add output streams and serializers #29

Add output streams and serializers #29

Alex-PLACET commented Oct 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Alex-PLACET Oct 6, 2025

Uh oh!

Copilot AI Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 3, 2025 •

edited

Loading

Uh oh!

Alex-PLACET Oct 7, 2025

Uh oh!

Alex-PLACET Oct 7, 2025

Uh oh!

JohanMabille left a comment •

edited

Loading

Uh oh!

Uh oh!

-            append(record_batches);
+            if (std::ranges::distance(record_batches) > 1) {
+                append(std::ranges::subrange(std::next(record_batches.begin()), record_batches.end()));
+            }

Add output streams and serializers #29

Are you sure you want to change the base?

Add output streams and serializers #29

Conversation

Alex-PLACET commented Oct 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-PLACET Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Alex-PLACET Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-PLACET Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

JohanMabille left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

API

Serializer

Streams

Markers

Uh oh!

Uh oh!

codecov-commenter commented Oct 3, 2025 •

edited

Loading

JohanMabille left a comment •

edited

Loading