Skip to content

Flesh out schema evolution support for variant #18285

@voonhous

Description

@voonhous

Task Description

What needs to be done:

https://github.com/apache/hudi/pull/17833/changes#r2874744476

#17833 (comment)

Why this task is needed:

Comment from @vinothchandar
What kind of evolution are we able to support with Variant columns? Can we ensure there are tests around.

  • Adding a Variant column to an existing table
  • Removing a Variant column
  • maxColumnId calculation when Variant fields (with negative IDs) coexist with regular fields
  • Round-trip: HoodieSchema → InternalSchema → HoodieSchema for Variant types

Additional consideration: variant projection vs schema-evolution ordering (from PR #18923)

PR #18923 moved Spark 4.1 PushVariantIntoScan projection out of the engine-neutral merge buffer into the log readers (parquet reader native projection, avro deserializeRecords). This flips the order of schema-evolution projection and variant projection in the MOR read path:

  • Before: evolve then project. composeEvolvedSchemaTransformer ran against full VariantType rows (consistent with dataBlock.getSchema()), then variant projection rewrote them.
  • After: project then evolve. Rows reach the buffer already projected, but composeEvolvedSchemaTransformer still builds its projector with from = dataBlock.getSchema() (variant typed as VariantType). When internalSchema is non-empty, the evolve step would mis-decode the projected struct bytes as VariantVal.

This only triggers with the (schema-on-read active + variant column + PushVariantIntoScan + MOR log blocks) combination, which is the same unsupported matrix this task covers. When designing variant schema evolution, make composeEvolvedSchemaTransformer aware that variant fields may already be projected (build the evolve step against the projected schema) or otherwise reconcile the ordering. Refs: FileGroupRecordBuffer.composeEvolvedSchemaTransformer, InternalSchemaConverter line 510.

Exact trigger: both internalSchema is non-empty AND shouldProjectVariants is true (Spark only). If either is false there is no mismatch. With an empty internalSchema the evolve step is identity (composeEvolvedSchemaTransformer returns empty), so rows pass through untouched. With shouldProjectVariants false the rows are never projected upstream (both the avro projectLogBlockRecords and the parquet getUnsafeRowIterator overload are gated on it), so the evolve step sees consistent full VariantType rows.

Guard placement caveat: a fail-fast guard is awkward to add today because internalSchema lives in the engine-neutral buffer while shouldProjectVariants lives in the Spark reader context. The buffer cannot see the variant gate directly, so a clean guard would need a small reader-context hook (e.g. readerContext.hasPendingVariantProjection()) checked alongside the non-empty internalSchema.

Task Type

Code improvement/refactoring

Related Issues

Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:devtaskDevelopment tasks and maintenance work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Open

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions