Skip to content

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Nov 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Previously these properties where not honored on tabel properties.

  • Adds table properties for these values.
  • Plumbs them through for writers.

Are these changes tested?

Added unit tests

BREAKING CHANGE: Adds codec parameter to some public functions. By default start compressing manifests and manifest lists.

Previously these properties where not honored on tabel properties.

- Adds table properties for these values.
- Plumbs them through for writers.
@kevinjqliu
Copy link
Contributor

looks like a clippy issue in CI

error: this function has too many arguments (8/7)
   --> crates/iceberg/src/spec/manifest_list.rs:243:5
    |
243 | /     fn new(
244 | |         format_version: FormatVersion,
245 | |         output_file: OutputFile,
246 | |         metadata: HashMap<String, String>,
...   |
251 | |         compression_level: u8,
252 | |     ) -> Self {
    | |_____________^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#too_many_arguments
    = note: `-D clippy::too-many-arguments` implied by `-D warnings`
    = help: to override `-D warnings` add `#[allow(clippy::too_many_arguments)]`

@emkornfield
Copy link
Contributor Author

looks like a clippy issue in CI

Yeah working on a fix.

@emkornfield
Copy link
Contributor Author

@liurenjie1024 @kevinjqliu would you have time to review?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, I left a few comments around using the same string as the java implementation

///
/// # Compression Levels
///
/// The compression level mapping is based on miniz_oxide's CompressionLevel enum:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify this is for gzip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe merge this comment with the one above (L61-L66), the compression levels are explained twice

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great, thanks!
adding the "breaking" label since we're modifying a few public functions

@emkornfield
Copy link
Contributor Author

@kevinjqliu thanks for the reviews, I also added more e2e tests for manifest and manifest list writers to confirm compression.

@emkornfield
Copy link
Contributor Author

adding the "breaking" label since we're modifying a few public functions

I'm not sure that I can access labels as a contributor, I put a breaking change note in the PR description.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for this pr. In general, I think this is a useful feature implementation. But we should not mix two things in one pr, I would suggest to split them into two prs: one for table metadata, and one for manifests.

fn parse_optional_property<T: std::str::FromStr>(
properties: &HashMap<String, String>,
key: &str,
) -> Result<Option<T>, anyhow::Error>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use this crate's error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this method? I think passing a None to default value in parse_property would be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to rust but would that work? Doesn't parse_property, always return an extension of String instead of an Option? It is useful to know if the value was actually configured (and thus the option return)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can make this work with empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in iceberg all table properties are optional, if they have default values, then we should return a default value, otherwise we should return an error.

The problem here is that the default value is dependent on the value for compression, it seems baking this into the parsing layer is coupling two concerns (business logic for parsing) and the actual values. To eliminate this method I've used a sentinel value for the old method. Let me know if that is seems OK.


// Helper function to parse a property from a HashMap
// If the property is not found, use the default value
fn parse_property<T: std::str::FromStr>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cleanup because it sounds like the crate's errors are preferred I can revert this.

@emkornfield emkornfield changed the title feat!: Support compression codecs for JSON metadata and Avro feat!: Support compression codecs for Avro Files (inlcuding manifest and manifest lists) Nov 19, 2025
@emkornfield
Copy link
Contributor Author

@liurenjie1024 split of this PR to just be about Avro. Also based on feedback, I cleaned up some additional imports/Errors that weren't using the Crate's error before.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, a couple of nit comments

Thanks for adding the additional tests for manifest and manifest list writers
I also added the "breaking" label, forgot to after i commented 😄

Comment on lines +204 to +211
avro_compression_level: {
let level = parse_property(
props,
TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL,
255u8,
)?;
if level == 255 { None } else { Some(level) }
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not valid here?

            avro_compression_level: parse_property(
                props,
                TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL,
                TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT,
            )?,

avro_compression_level and PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT are both Option<u8>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding the comment. But I should have cleaned up pub const PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT: Option<u8> = None;, I don't think it is used anymore.

In a prior revision I added a method (parse_optional_property) that could return None value directly but @liurenjie1024 questioned whether this was useful.

fn default() -> Self {
Self {
codec: TableProperties::PROPERTY_AVRO_COMPRESSION_CODEC_DEFAULT.to_string(),
level: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level: None,
level: TableProperties::PROPERTY_AVRO_COMPRESSION_LEVEL_DEFAULT,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this.

///
/// # Compression Levels
///
/// The compression level mapping is based on miniz_oxide's CompressionLevel enum:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe merge this comment with the one above (L61-L66), the compression levels are explained twice

pub write_target_file_size_bytes: usize,
/// Compression codec for Avro files (manifests, manifest lists)
pub avro_compression_codec: String,
/// Compression level for Avro files (None uses codec-specific defaults: gzip=9, zstd=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Compression level for Avro files (None uses codec-specific defaults: gzip=9, zstd=1)
/// Compression level for Avro files (None uses codec-specific defaults)

nit: these might change in the future, so i think its better to not specify the values here

there are a few other instances of this same comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] iceberg-rust does not respect compression settings for metadata & avro

3 participants