-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Since arrow-row currently lacks support for Union data types, we cannot use RowConverter for sorting operations on Union columns. This forces us to fall back to converting Union types to strings, which is inefficient and loses proper type ordering semantics
Describe the solution you'd like
Add support for encoding and decoding Union data types in the row format. Union types represent a tagged union where each row contains a type ID indicating which variant is active and the corresponding value
Proposed encoding format
Each union row would be encoded as:
[null_sentinel: 1 byte][type_id: 1 byte][child_row: variable length]
where:
null_sentinel: 0x00 or 0x01 following existing conventions (inverted for desc sort)type_id: thei8value from the union'stype_idsbufferchild_row: the encoded bytes from the apppropriate child field's converter
I guess the main design decision involves the ordering semantics across different union variants. Rows would sort by type id first, value second. This maintains the row format's memcmp-based sorting invariant while providing a predictable cross-type ordering
For example, in a Union<0: Int32, 1: Utf8>:
- all
i32values sort before all string values - within integers: standard numeric ordering
- within strings: lexicographic ordering
Both sparse and dense union modes use the same row encoding format, differing only in how they index child arrays during encoding