-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avro codec enhancements #6965
base: main
Are you sure you want to change the base?
Avro codec enhancements #6965
Conversation
1. Namespaces 2. Enums 3. Maps 4. Decimals
…al types. Signed-off-by: Connor Sanders <[email protected]>
…al types. Signed-off-by: Connor Sanders <[email protected]>
…al types. Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
* Implemented reader decoder for Avro Lists * Cleaned up reader/record.rs and added comments for readability. Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
- Fixed - Interval Signed-off-by: Connor Sanders <[email protected]>
Added Avro codec + decoder support for new types
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
👋 thank you for working on this, it is very exciting to see arrow-avro getting some love. I think what would probably help is to break this up into smaller pieces that can be delivered separately. Whilst I accept this is more work for you, it makes reviewing the code much more practical, especially given our relatively limited review bandwidth. Many of the bullets in your PR description would warrant a separate PR IMO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838 I triggered CI for this PR
I haven't had a chance to review the code yet, but I did start to look at the test coverage. What would you think about adding tests using the existing avro testing data in https://github.com/apache/arrow-testing/tree/master/data/avro (already a submodule in this repo)
Key tests in my mind would be:
- Read the avro testing files and verify the schema and data read (and leave comments for tests that don't pass)
- For the writer implement round trip tests: create a
RecordBatch
(es), write it to an.avro
file and then read it back in and ensure the round tripped batches are equal
You might be able to find code / testing that you can reuse in the datafusion copy: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/avro_to_arrow/arrow_array_reader.rs
I also wonder how/if this code is related to the avro rust reader/decoder in https://github.com/apache/avro-rs?
FYI @Jefffrey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some review of this PR, I think I could likely find some additional time to review / help it along if/when it has end to end tests of reading existing avro files (that would give me confidence that the code being reviewed did the right thing functionally).
To make a specific proposal for splitting up this PRs' functionality as suggested by @tustvold -- #6965 (comment), one way to do so would be:
- First PR: Reader improvements / tests showing reading existing .avro files
- Second PR: Minimal Writer support (with tests showing round tripping for a few basic data types)
- Subsequent PRs: Additional PRs to support writing additional data types
The rationale for breaking it up this way would be to have better confidence in the read path to ensure round tripping also works as needed
Let me know what you think
And thanks again
@alamb @tustvold Thank you both so much for getting to our PR so quickly! We'd be more than happy to break this PR up as advised and add those additional tests. Your recommendation makes a lot of sense.
I'm sure there's functional overlap we can look into. We just attempted to extend the patterns @tustvold put in place. Definitely interested in hearing your thoughts on this however. |
Signed-off-by: Connor Sanders <[email protected]>
Sounds good . Ideally it would be great to avoid duplication, but if the existing avro-rs crate is row oriented, the amount of reusable code might be small as this decoder will need to be row oriented Thanks for the changes -- I just started the CI checks and took a quick look through this PR. I think the only thing left now is to add some tests that read from the existing .avro files in Thank you for working on this. |
@alamb, we're improving the reader and providing more Avro tests. As I was working on this, I noticed the module signature for arrow's parquet reader and datafusion's arrow_array_reader.rs are based on a Builder pattern. To simplify adopting these arrow-avro changes into DataFusion, should we implement the module signature to something similar to what is already in DataFusion today? Or are you looking for something else? Also, do you prefer whether the public signature for the reader should be included in this PR or a PR following this PR that proves the reader changes? I'm happy to contribute the public module signature. |
* Fixed lint issues Signed-off-by: Connor Sanders <[email protected]>
…Enum with an empty list of symbols. Signed-off-by: Connor Sanders <[email protected]>
- Reduced repeated casting to improve efficiency. - Reduced cyclomatic complexity Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Added bigzip2 and xz compression Improved nullability handling Added additional avro file tests to reader/mod.rs. Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
…/mod.rs` Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
…nforcement strictness of avro schema, i.e. Impala usecase. Signed-off-by: Connor Sanders <[email protected]>
…w-avro/src/reader/mod.rs` Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
…e cleanup Signed-off-by: Connor Sanders <[email protected]>
Thank you for this, I wonder if there is some way we might break this up into smaller pieces. A single 5000 line diff is not something I can realistically review... |
Signed-off-by: Connor Sanders <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>
@tustvold I completely understand, my apologies about that! Roughly 3500 lines of the diff are tests and ~1000 lines of that 3500 is test code in If I removed all the tests we added save for the ones in |
It isn't just the sheer size of the PR but also the sheer number of things it is doing. I think this needs to be split up into smaller focused PRs to make reviewing it practical. |
…coder support + .avro file tests Signed-off-by: Connor Sanders <[email protected]>
@tustvold @alamb I have removed all changes and functionality not related to Avro reader type support and
The code remaining in this PR successfully reads all If I further split the changes, the CC: @svencowart |
Thanks @jecsand838 -- I am out on vacation this week so I may not have time to review this PR for a while. I have put it on my list however |
…ctions from codec.rs
@alamb Completely understand! Also I pushed a fix for that lint error. All of the clippy tests should be passing now. |
Which issue does this PR close?
Part of #4886
Rationale for this change
The primary objective of this PR is to enhance the arrow-rs Avro implementation by introducing full support for Avro data types, support for Avro aliases, and implementing a public facing experimental Avro Reader. These enhancements are crucial for several reasons:
1. Enhanced Data Interoperability:
By supporting these additional types, the Avro reader becomes more compatible with a wider range of Avro schemas. This ensures that users can ingest and process diverse datasets without encountering type-related limitations.
What changes are included in this PR?
Avro Codec + RecordDecoder
Are there any user-facing changes?