Background
PR #18065 adds write support for shredded variants on the HoodieRecordType.AVRO path. At write time HoodieAvroWriteSupport shreds unshredded variant records ({metadata, value}) into the Parquet {metadata, value, typed_value} layout via Spark4VariantShreddingProvider.
Gap
There is no reader-side step that reconstructs the unshredded {metadata, value} form before records reach the writer. So when compaction or clustering reads a base file previously written with shredding enabled and re-writes it through the Avro path, the variant records arrive already shredded (typed_value populated, value may be null):
- Shredding enabled:
Spark4VariantShreddingProvider.shredVariantRecord returns null when value is null, silently dropping the variant payload.
- Shredding disabled:
generateEffectiveSchema strips typed_value from the writer schema (value becomes REQUIRED), and super.write(record) hits a null at the REQUIRED value field.
PR #18065 adds a fail-fast guard in HoodieAvroWriteSupport.write that throws on an already-shredded input variant, so this path fails loudly instead of losing data. This issue tracks the real fix.
Proposed fix
Reconstruct the unshredded variant on read (combine value + typed_value back into a single variant binary) before records reach the writer, or have the shredding provider detect already-shredded input and re-shred from the reconstructed variant. Add a round-trip test: write shredded -> compact/cluster -> read back.
Related
Background
PR #18065 adds write support for shredded variants on the
HoodieRecordType.AVROpath. At write timeHoodieAvroWriteSupportshreds unshredded variant records ({metadata, value}) into the Parquet{metadata, value, typed_value}layout viaSpark4VariantShreddingProvider.Gap
There is no reader-side step that reconstructs the unshredded
{metadata, value}form before records reach the writer. So when compaction or clustering reads a base file previously written with shredding enabled and re-writes it through the Avro path, the variant records arrive already shredded (typed_valuepopulated,valuemay be null):Spark4VariantShreddingProvider.shredVariantRecordreturns null whenvalueis null, silently dropping the variant payload.generateEffectiveSchemastripstyped_valuefrom the writer schema (value becomes REQUIRED), andsuper.write(record)hits a null at the REQUIREDvaluefield.PR #18065 adds a fail-fast guard in
HoodieAvroWriteSupport.writethat throws on an already-shredded input variant, so this path fails loudly instead of losing data. This issue tracks the real fix.Proposed fix
Reconstruct the unshredded variant on read (combine
value+typed_valueback into a single variant binary) before records reach the writer, or have the shredding provider detect already-shredded input and re-shred from the reconstructed variant. Add a round-trip test: write shredded -> compact/cluster -> read back.Related