Skip to content

Support Union data types for row format #8828

@friendlymatthew

Description

@friendlymatthew

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Since arrow-row currently lacks support for Union data types, we cannot use RowConverter for sorting operations on Union columns. This forces us to fall back to converting Union types to strings, which is inefficient and loses proper type ordering semantics

Describe the solution you'd like
Add support for encoding and decoding Union data types in the row format. Union types represent a tagged union where each row contains a type ID indicating which variant is active and the corresponding value

Proposed encoding format

Each union row would be encoded as:

[null_sentinel: 1 byte][type_id: 1 byte][child_row: variable length]

where:

  • null_sentinel: 0x00 or 0x01 following existing conventions (inverted for desc sort)
  • type_id: the i8 value from the union's type_ids buffer
  • child_row: the encoded bytes from the apppropriate child field's converter

I guess the main design decision involves the ordering semantics across different union variants. Rows would sort by type id first, value second. This maintains the row format's memcmp-based sorting invariant while providing a predictable cross-type ordering

For example, in a Union<0: Int32, 1: Utf8>:

  • all i32 values sort before all string values
  • within integers: standard numeric ordering
  • within strings: lexicographic ordering

Both sparse and dense union modes use the same row encoding format, differing only in how they index child arrays during encoding

Metadata

Metadata

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions