Skip to content

iceberg: add configurable Parquet string encoding for Redshift Spectrum compatibility#4401

Merged
josephwoodward merged 3 commits into
mainfrom
jw/parquetredshipcompatibility
May 6, 2026
Merged

iceberg: add configurable Parquet string encoding for Redshift Spectrum compatibility#4401
josephwoodward merged 3 commits into
mainfrom
jw/parquetredshipcompatibility

Conversation

@josephwoodward
Copy link
Copy Markdown
Contributor

@josephwoodward josephwoodward commented May 6, 2026

image

@josephwoodward josephwoodward marked this pull request as ready for review May 6, 2026 17:54
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Commits
LGTM

Review
Small, focused change adding a parquet.string_encoding config field on the Iceberg output to opt into plain encoding for Redshift Spectrum compatibility. Threads []parquet.WriterOption through NewRouterNewWriternewParquetSink, and the inline config parsing mirrors the existing parseCommitConfig pattern in the same file. The default (delta_length_byte_array) preserves prior behavior.

LGTM

@josephwoodward josephwoodward force-pushed the jw/parquetredshipcompatibility branch from b572717 to bc5417a Compare May 6, 2026 18:04
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Commits
LGTM

Review
PR adds an optional parquet.string_encoding config field to the iceberg output to allow choosing plain over delta_length_byte_array for Redshift Spectrum compatibility. Field constants follow the existing io prefix convention, the optional parquet block is correctly gated via conf.Contains(), the default matches parquet-go's library default (so no option is added unless plain is selected), and pqSchema is placed first in the options slice so user-supplied options can override schema-derived defaults. Test helpers and e2e callers are updated for the new NewRouter/NewWriter parameter.

LGTM

Comment thread internal/impl/iceberg/output_iceberg.go Outdated
@josephwoodward josephwoodward force-pushed the jw/parquetredshipcompatibility branch from 58f3a1d to 77d7709 Compare May 6, 2026 19:45
@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Commits
LGTM

Review
Adds an optional parquet.string_encoding field on the iceberg output to switch the Parquet BYTE_ARRAY encoding from delta_length_byte_array to plain for Redshift Spectrum compatibility. The new option is threaded through Router and writer into parquetSink, with the prior behavior preserved by default. Field constants, Optional()/Default() placement, Contains() guard, and error wrapping all follow project patterns.

LGTM

Comment on lines +106 to 122
// Parse parquet config
var writerOpts []parquet.WriterOption
if conf.Contains(ioFieldParquet) {
strEnc, err := conf.FieldString(ioFieldParquet, ioFieldParquetStringEncoding)
if err != nil {
return nil, fmt.Errorf("parsing %s: %w", ioFieldParquetStringEncoding, err)
}
switch strEnc {
case "plain":
writerOpts = append(writerOpts, parquet.DefaultEncodingFor(parquet.ByteArray, &parquet.Plain))
case "delta_length_byte_array":
// default - noop
default:
return nil, fmt.Errorf("unsupported %s value: %q, please consider raising an issue to request support for feature gap", ioFieldParquetStringEncoding, strEnc)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new parquet.string_encoding config field is introduced without any test coverage. Per the project's test patterns, changed code should be accompanied by tests — at minimum a config-parsing/linting test exercising both plain and delta_length_byte_array (and ideally a writer test verifying the option is actually plumbed through to the underlying parquet writer). The existing writer_test.go and router_test.go could be extended for this.

@claude
Copy link
Copy Markdown

claude Bot commented May 6, 2026

Commits

  1. Commits iceberg: use switch statement to show options explicitly (77d7709) and iceberg: address linting error (de2b6df) are fixup-style follow-ups to the first commit (bc5417a) that touch the same code paths introduced in that change. They should be squashed into the first commit so the PR lands as a single self-contained logical change, per the granularity rule.

Review

Adds a configurable Parquet string_encoding option (plain or delta_length_byte_array) to the iceberg output to support readers like AWS Redshift Spectrum. The option is plumbed through RouterwriterparquetSink via a []parquet.WriterOption slice. Wiring looks correct.

  1. New configuration field is introduced without test coverage — see inline comment on internal/impl/iceberg/output_iceberg.go.

@josephwoodward josephwoodward merged commit 2cc2cde into main May 6, 2026
6 of 7 checks passed
@josephwoodward josephwoodward deleted the jw/parquetredshipcompatibility branch May 6, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants