Skip to content

Commit 32ea155

Browse files
authored
[VARIANT] Accept variantType RFC (#4096)
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md 2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 3. Be sure to keep the PR description updated to reflect all changes. 4. Please write your PR title to summarize what this PR proposes. 5. If possible, provide a concise example to reproduce the issue for a faster review. 6. If applicable, include the corresponding issue number in the PR title and link it in the body. --> #### Which Delta project/connector is this regarding? - [ ] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description <!-- - Describe what this PR changes. - Describe why we need the change. If this PR resolves an issue be sure to include "Resolves #XXX" to correctly link and close the issue upon merge. --> moves the variant type RFC into the accepted folder and inlines it into `PROTOCOL.md`. Variant has been used in production systems for many months now and has also been added as a logical type in parquet [here](apache/parquet-format@dff0b3e). Additionally, Spark has a robust implementation of the variant type Additionally specifies that variant shredding will be a different table feature and removed the portion about struct fields with `_` are ignored. ## How was this patch tested? manually tested that links work ## Does this PR introduce _any_ user-facing changes? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Delta Lake versions or within the unreleased branches such as master. If no, write 'No'. -->
1 parent 41a4e81 commit 32ea155

File tree

2 files changed

+100
-3
lines changed

2 files changed

+100
-3
lines changed

Diff for: PROTOCOL.md

+94
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,11 @@
6262
- [Reader Requirements for Vacuum Protocol Check](#reader-requirements-for-vacuum-protocol-check)
6363
- [Clustered Table](#clustered-table)
6464
- [Writer Requirements for Clustered Table](#writer-requirements-for-clustered-table)
65+
- [Variant Data Type](#variant-data-type)
66+
- [Variant data in Parquet](#variant-data-in-parquet)
67+
- [Writer Requirements for Variant Type](#writer-requirements-for-variant-type)
68+
- [Reader Requirements for Variant Data Type](#reader-requirements-for-variant-data-type)
69+
- [Compatibility with other Delta Features](#compatibility-with-other-delta-features)
6570
- [Requirements for Writers](#requirements-for-writers)
6671
- [Creation of New Log Entries](#creation-of-new-log-entries)
6772
- [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
@@ -100,6 +105,7 @@
100105
- [Struct Field](#struct-field)
101106
- [Array Type](#array-type)
102107
- [Map Type](#map-type)
108+
- [Variant Type](#variant-type)
103109
- [Column Metadata](#column-metadata)
104110
- [Example](#example)
105111
- [Checkpoint Schema](#checkpoint-schema)
@@ -1353,6 +1359,86 @@ The example above converts `configuration` field into JSON format, including esc
13531359
}
13541360
```
13551361

1362+
1363+
# Variant Data Type
1364+
1365+
This feature enables support for the `variant` data type, which stores semi-structured data.
1366+
The schema serialization method is described in [Schema Serialization Format](#schema-serialization-format).
1367+
1368+
To support this feature:
1369+
- The table must be on Reader Version 3 and Writer Version 7
1370+
- The feature `variantType` must exist in the table `protocol`'s `readerFeatures` and `writerFeatures`.
1371+
1372+
## Example JSON-Encoded Delta Table Schema with Variant types
1373+
1374+
```
1375+
{
1376+
"type" : "struct",
1377+
"fields" : [ {
1378+
"name" : "raw_data",
1379+
"type" : "variant",
1380+
"nullable" : true,
1381+
"metadata" : { }
1382+
}, {
1383+
"name" : "variant_array",
1384+
"type" : {
1385+
"type" : "array",
1386+
"elementType" : {
1387+
"type" : "variant"
1388+
},
1389+
"containsNull" : false
1390+
},
1391+
"nullable" : false,
1392+
"metadata" : { }
1393+
} ]
1394+
}
1395+
```
1396+
1397+
## Variant data in Parquet
1398+
1399+
The Variant data type is represented as two binary encoded values, according to the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
1400+
The two binary values are named `value` and `metadata`.
1401+
1402+
When writing Variant data to parquet files, the Variant data is written as a single Parquet struct, with the following fields:
1403+
1404+
Struct field name | Parquet primitive type | Description
1405+
-|-|-
1406+
value | binary | The binary-encoded Variant value, as described in [Variant binary encoding](https://github.com/apache/spark/blob/master/common/variant/README.md)
1407+
metadata | binary | The binary-encoded Variant metadata, as described in [Variant binary encoding](https://github.com/apache/spark/blob/master/common/variant/README.md)
1408+
1409+
The parquet struct must include the two struct fields `value` and `metadata`.
1410+
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
1411+
1412+
[Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) will be introduced in a separate `variantShredding` table feature. will be introduced later, as a separate `variantShredding` table feature.
1413+
1414+
## Writer Requirements for Variant Data Type
1415+
1416+
When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
1417+
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
1418+
- must not write a parquet struct field named `typed_value` to avoid confusion with the field required by [Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) with the same name.
1419+
1420+
## Reader Requirements for Variant Data Type
1421+
1422+
When Variant type is supported (`readerFeatures` field of a table's `protocol` action contains `variantType`), readers:
1423+
- must recognize and tolerate a `variant` data type in a Delta schema
1424+
- must use the correct physical schema (struct-of-binary, with fields `value` and `metadata`) when reading a Variant data type from file
1425+
- must make the column available to the engine:
1426+
- [Recommended] Expose and interpret the struct-of-binary as a single Variant field in accordance with the [Spark Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md).
1427+
- [Alternate] Expose the raw physical struct-of-binary, e.g. if the engine does not support Variant.
1428+
- [Alternate] Convert the struct-of-binary to a string, and expose the string representation, e.g. if the engine does not support Variant.
1429+
1430+
## Compatibility with other Delta Features
1431+
1432+
Feature | Support for Variant Data Type
1433+
-|-
1434+
Partition Columns | **Supported:** A Variant column is allowed to be a non-partitioned column of a partitioned table. <br/> **Unsupported:** Variant is not a comparable data type, so it cannot be included in a partition column.
1435+
Clustered Tables | **Supported:** A Variant column is allowed to be a non-clustering column of a clustered table. <br/> **Unsupported:** Variant is not a comparable data type, so it cannot be included in a clustering column.
1436+
Delta Column Statistics | **Supported:** A Variant column supports the `nullCount` statistic. <br/> **Unsupported:** Variant is not a comparable data type, so a Variant column does not support the `minValues` and `maxValues` statistics.
1437+
Generated Columns | **Supported:** A Variant column is allowed to be used as a source in a generated column expression, as long as the Variant type is not the result type of the generated column expression. <br/> **Unsupported:** The Variant data type is not allowed to be the result type of a generated column expression.
1438+
Delta CHECK Constraints | **Supported:** A Variant column is allowed to be used for a CHECK constraint expression.
1439+
Default Column Values | **Supported:** A Variant column is allowed to have a default column value.
1440+
Change Data Feed | **Supported:** A table using the Variant data type is allowed to enable the Delta Change Data Feed.
1441+
13561442
# In-Commit Timestamps
13571443

13581444
The In-Commit Timestamps writer feature strongly associates a monotonically increasing timestamp with each commit by storing it in the commit's metadata.
@@ -1965,6 +2051,14 @@ type| Always the string "map".
19652051
keyType| The type of element used for the key of this map, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
19662052
valueType| The type of element used for the key of this map, represented as a string containing the name of a primitive type, a struct definition, an array definition or a map definition
19672053

2054+
### Variant Type
2055+
2056+
Variant data uses the Delta type name `variant` for Delta schema serialization.
2057+
2058+
Field Name | Description
2059+
-|-
2060+
type | Always the string "variant"
2061+
19682062
### Column Metadata
19692063
A column metadata stores various information about the column.
19702064
For example, this MAY contain some keys like [`delta.columnMapping`](#column-mapping) or [`delta.generationExpression`](#generated-columns) or [`CURRENT_DEFAULT`](#default-columns).

Diff for: protocol_rfcs/variant-type.md renamed to protocol_rfcs/accepted/variant-type.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
# Variant Data Type
2+
**Folded into [PROTOCOL.md](../../protocol.md#variant-data-type)**
3+
24
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2864**
35

46
This protocol change adds support for the Variant data type.
@@ -10,7 +12,7 @@ The Variant data type is beneficial for storing and processing semi-structured d
1012
1113
# Variant Data Type
1214

13-
This feature enables support for the Variant data type, for storing semi-structured data.
15+
This feature enables support for the `variant` data type, which stores semi-structured data.
1416
The schema serialization method is described in [Schema Serialization Format](#schema-serialization-format).
1517

1618
To support this feature:
@@ -56,13 +58,14 @@ metadata | binary | The binary-encoded Variant metadata, as described in [Varian
5658

5759
The parquet struct must include the two struct fields `value` and `metadata`.
5860
Supported writers must write the two binary fields, and supported readers must read the two binary fields.
59-
Struct fields which start with `_` (underscore) can be safely ignored.
61+
62+
[Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) will be introduced in a separate `variantShredding` table feature. will be introduced later, as a separate `variantShredding` table feature.
6063

6164
## Writer Requirements for Variant Data Type
6265

6366
When Variant type is supported (`writerFeatures` field of a table's `protocol` action contains `variantType`), writers:
6467
- must write a column of type `variant` to parquet as a struct containing the fields `value` and `metadata` and storing values that conform to the [Variant binary encoding specification](https://github.com/apache/spark/blob/master/common/variant/README.md)
65-
- must not write additional, non-ignorable parquet struct fields. Writing additional struct fields with names starting with `_` (underscore) is allowed.
68+
- must not write a parquet struct field named `typed_value` to avoid confusion with the field required by [Variant shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md) with the same name.
6669

6770
## Reader Requirements for Variant Data Type
6871

0 commit comments

Comments
 (0)