Task Description
What needs to be done:
https://github.com/apache/hudi/pull/17833/changes#r2874744476
#17833 (comment)
Why this task is needed:
Comment from @vinothchandar
What kind of evolution are we able to support with Variant columns? Can we ensure there are tests around.
- Adding a Variant column to an existing table
- Removing a Variant column
maxColumnId calculation when Variant fields (with negative IDs) coexist with regular fields
- Round-trip:
HoodieSchema → InternalSchema → HoodieSchema for Variant types
Additional consideration: variant projection vs schema-evolution ordering (from PR #18923)
PR #18923 moved Spark 4.1 PushVariantIntoScan projection out of the engine-neutral merge buffer into the log readers (parquet reader native projection, avro deserializeRecords). This flips the order of schema-evolution projection and variant projection in the MOR read path:
- Before: evolve then project.
composeEvolvedSchemaTransformer ran against full VariantType rows (consistent with dataBlock.getSchema()), then variant projection rewrote them.
- After: project then evolve. Rows reach the buffer already projected, but
composeEvolvedSchemaTransformer still builds its projector with from = dataBlock.getSchema() (variant typed as VariantType). When internalSchema is non-empty, the evolve step would mis-decode the projected struct bytes as VariantVal.
This only triggers with the (schema-on-read active + variant column + PushVariantIntoScan + MOR log blocks) combination, which is the same unsupported matrix this task covers. When designing variant schema evolution, make composeEvolvedSchemaTransformer aware that variant fields may already be projected (build the evolve step against the projected schema) or otherwise reconcile the ordering. Refs: FileGroupRecordBuffer.composeEvolvedSchemaTransformer, InternalSchemaConverter line 510.
Exact trigger: both internalSchema is non-empty AND shouldProjectVariants is true (Spark only). If either is false there is no mismatch. With an empty internalSchema the evolve step is identity (composeEvolvedSchemaTransformer returns empty), so rows pass through untouched. With shouldProjectVariants false the rows are never projected upstream (both the avro projectLogBlockRecords and the parquet getUnsafeRowIterator overload are gated on it), so the evolve step sees consistent full VariantType rows.
Guard placement caveat: a fail-fast guard is awkward to add today because internalSchema lives in the engine-neutral buffer while shouldProjectVariants lives in the Spark reader context. The buffer cannot see the variant gate directly, so a clean guard would need a small reader-context hook (e.g. readerContext.hasPendingVariantProjection()) checked alongside the non-empty internalSchema.
Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.
Task Description
What needs to be done:
https://github.com/apache/hudi/pull/17833/changes#r2874744476
#17833 (comment)
Why this task is needed:
Comment from @vinothchandar
What kind of evolution are we able to support with Variant columns? Can we ensure there are tests around.
maxColumnIdcalculation when Variant fields (with negative IDs) coexist with regular fieldsHoodieSchema → InternalSchema → HoodieSchemafor Variant typesAdditional consideration: variant projection vs schema-evolution ordering (from PR #18923)
PR #18923 moved Spark 4.1 PushVariantIntoScan projection out of the engine-neutral merge buffer into the log readers (parquet reader native projection, avro
deserializeRecords). This flips the order of schema-evolution projection and variant projection in the MOR read path:composeEvolvedSchemaTransformerran against full VariantType rows (consistent withdataBlock.getSchema()), then variant projection rewrote them.composeEvolvedSchemaTransformerstill builds its projector withfrom = dataBlock.getSchema()(variant typed as VariantType). WheninternalSchemais non-empty, the evolve step would mis-decode the projected struct bytes as VariantVal.This only triggers with the (schema-on-read active + variant column + PushVariantIntoScan + MOR log blocks) combination, which is the same unsupported matrix this task covers. When designing variant schema evolution, make
composeEvolvedSchemaTransformeraware that variant fields may already be projected (build the evolve step against the projected schema) or otherwise reconcile the ordering. Refs:FileGroupRecordBuffer.composeEvolvedSchemaTransformer,InternalSchemaConverterline 510.Exact trigger: both
internalSchemais non-empty ANDshouldProjectVariantsis true (Spark only). If either is false there is no mismatch. With an emptyinternalSchemathe evolve step is identity (composeEvolvedSchemaTransformerreturns empty), so rows pass through untouched. WithshouldProjectVariantsfalse the rows are never projected upstream (both the avroprojectLogBlockRecordsand the parquetgetUnsafeRowIteratoroverload are gated on it), so the evolve step sees consistent full VariantType rows.Guard placement caveat: a fail-fast guard is awkward to add today because
internalSchemalives in the engine-neutral buffer whileshouldProjectVariantslives in the Spark reader context. The buffer cannot see the variant gate directly, so a clean guard would need a small reader-context hook (e.g.readerContext.hasPendingVariantProjection()) checked alongside the non-emptyinternalSchema.Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use
Relationshipsbutton to add parent/blocking issues after issue is created.