Skip to content

Support read-then-reshred for shredded variants (compaction/clustering over shredded base files) on HoodieRecordType.AVRO #18931

@voonhous

Description

@voonhous

Background

PR #18065 adds write support for shredded variants on the HoodieRecordType.AVRO path. At write time HoodieAvroWriteSupport shreds unshredded variant records ({metadata, value}) into the Parquet {metadata, value, typed_value} layout via Spark4VariantShreddingProvider.

Gap

There is no reader-side step that reconstructs the unshredded {metadata, value} form before records reach the writer. So when compaction or clustering reads a base file previously written with shredding enabled and re-writes it through the Avro path, the variant records arrive already shredded (typed_value populated, value may be null):

  • Shredding enabled: Spark4VariantShreddingProvider.shredVariantRecord returns null when value is null, silently dropping the variant payload.
  • Shredding disabled: generateEffectiveSchema strips typed_value from the writer schema (value becomes REQUIRED), and super.write(record) hits a null at the REQUIRED value field.

PR #18065 adds a fail-fast guard in HoodieAvroWriteSupport.write that throws on an already-shredded input variant, so this path fails loudly instead of losing data. This issue tracks the real fix.

Proposed fix

Reconstruct the unshredded variant on read (combine value + typed_value back into a single variant binary) before records reach the writer, or have the shredding provider detect already-shredded input and re-shred from the reconstructed variant. Add a round-trip test: write shredded -> compact/cluster -> read back.

Related

Metadata

Metadata

Assignees

Labels

area:schemaSchema evolution and data typestype:devtaskDevelopment tasks and maintenance work

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions