-
Notifications
You must be signed in to change notification settings - Fork 354
feat!: Support compression codecs for Avro Files (inlcuding manifest and manifest lists) #1851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously these properties where not honored on tabel properties. - Adds table properties for these values. - Plumbs them through for writers.
|
looks like a clippy issue in CI |
Yeah working on a fix. |
|
@liurenjie1024 @kevinjqliu would you have time to review? |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, I left a few comments around using the same string as the java implementation
crates/iceberg/src/spec/avro_util.rs
Outdated
| /// | ||
| /// # Compression Levels | ||
| /// | ||
| /// The compression level mapping is based on miniz_oxide's CompressionLevel enum: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarify this is for gzip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe merge this comment with the one above (L61-L66), the compression levels are explained twice
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great, thanks!
adding the "breaking" label since we're modifying a few public functions
|
@kevinjqliu thanks for the reviews, I also added more e2e tests for manifest and manifest list writers to confirm compression. |
I'm not sure that I can access labels as a contributor, I put a breaking change note in the PR description. |
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @emkornfield for this pr. In general, I think this is a useful feature implementation. But we should not mix two things in one pr, I would suggest to split them into two prs: one for table metadata, and one for manifests.
| fn parse_optional_property<T: std::str::FromStr>( | ||
| properties: &HashMap<String, String>, | ||
| key: &str, | ||
| ) -> Result<Option<T>, anyhow::Error> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use this crate's error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this method? I think passing a None to default value in parse_property would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to rust but would that work? Doesn't parse_property, always return an extension of String instead of an Option? It is useful to know if the value was actually configured (and thus the option return)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can make this work with empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.
The problem here is that the default value is dependent on the value for compression, it seems baking this into the parsing layer is coupling two concerns (business logic for parsing) and the actual values. To eliminate this method I've used a sentinel value for the old method. Let me know if that is seems OK.
|
|
||
| // Helper function to parse a property from a HashMap | ||
| // If the property is not found, use the default value | ||
| fn parse_property<T: std::str::FromStr>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is cleanup because it sounds like the crate's errors are preferred I can revert this.
|
@liurenjie1024 split of this PR to just be about Avro. Also based on feedback, I cleaned up some additional imports/Errors that weren't using the Crate's error before. |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, a couple of nit comments
Thanks for adding the additional tests for manifest and manifest list writers
I also added the "breaking" label, forgot to after i commented 😄
| avro_compression_level: { | ||
| let level = parse_property( | ||
| props, | ||
| TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL, | ||
| 255u8, | ||
| )?; | ||
| if level == 255 { None } else { Some(level) } | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not valid here?
avro_compression_level: parse_property(
props,
TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL,
TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT,
)?,
avro_compression_level and PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT are both Option<u8>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding the comment. But I should have cleaned up pub const PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT: Option<u8> = None;, I don't think it is used anymore.
In a prior revision I added a method (parse_optional_property) that could return None value directly but @liurenjie1024 questioned whether this was useful.
| fn default() -> Self { | ||
| Self { | ||
| codec: TableProperties::PROPERTY_AVRO_COMPRESSION_CODEC_DEFAULT.to_string(), | ||
| level: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| level: None, | |
| level: TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this.
crates/iceberg/src/spec/avro_util.rs
Outdated
| /// | ||
| /// # Compression Levels | ||
| /// | ||
| /// The compression level mapping is based on miniz_oxide's CompressionLevel enum: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe merge this comment with the one above (L61-L66), the compression levels are explained twice
| pub write_target_file_size_bytes: usize, | ||
| /// Compression codec for Avro files (manifests, manifest lists) | ||
| pub avro_compression_codec: String, | ||
| /// Compression level for Avro files (None uses codec-specific defaults: gzip=9, zstd=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// Compression level for Avro files (None uses codec-specific defaults: gzip=9, zstd=1) | |
| /// Compression level for Avro files (None uses codec-specific defaults) |
nit: these might change in the future, so i think its better to not specify the values here
there are a few other instances of this same comment
Which issue does this PR close?
What changes are included in this PR?
Previously these properties where not honored on tabel properties.
Are these changes tested?
Added unit tests
BREAKING CHANGE: Adds codec parameter to some public functions. By default start compressing manifests and manifest lists.