Support read-then-reshred for shredded variants (compaction/clustering over shredded base files) on HoodieRecordType.AVRO

### Background

PR #18065 adds write support for shredded variants on the `HoodieRecordType.AVRO` path. At write time `HoodieAvroWriteSupport` shreds unshredded variant records (`{metadata, value}`) into the Parquet `{metadata, value, typed_value}` layout via `Spark4VariantShreddingProvider`.

### Gap

There is no reader-side step that reconstructs the unshredded `{metadata, value}` form before records reach the writer. So when compaction or clustering reads a base file previously written with shredding enabled and re-writes it through the Avro path, the variant records arrive already shredded (`typed_value` populated, `value` may be null):

- **Shredding enabled:** `Spark4VariantShreddingProvider.shredVariantRecord` returns null when `value` is null, silently dropping the variant payload.
- **Shredding disabled:** `generateEffectiveSchema` strips `typed_value` from the writer schema (value becomes REQUIRED), and `super.write(record)` hits a null at the REQUIRED `value` field.

PR #18065 adds a fail-fast guard in `HoodieAvroWriteSupport.write` that throws on an already-shredded input variant, so this path fails loudly instead of losing data. This issue tracks the real fix.

### Proposed fix

Reconstruct the unshredded variant on read (combine `value` + `typed_value` back into a single variant binary) before records reach the writer, or have the shredding provider detect already-shredded input and re-shred from the reconstructed variant. Add a round-trip test: write shredded -> compact/cluster -> read back.

### Related

- PR #18065
- #18066
- #17748


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support read-then-reshred for shredded variants (compaction/clustering over shredded base files) on HoodieRecordType.AVRO #18931

Background

Gap

Proposed fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Support read-then-reshred for shredded variants (compaction/clustering over shredded base files) on HoodieRecordType.AVRO #18931

Description

Background

Gap

Proposed fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions