[Feature Request] [Spark] Optionally sort within partitions when Z-ordering #4000
Closed
2 of 8 tasks
Labels
enhancement
New feature or request
Feature request
Which Delta project/connector is this regarding?
Overview
Z-ordering tables doesn’t sort data within partitions (files) and consequently data skipping on the Parquet level, based on row group metadata, is inefficient.
Motivation
To increase read efficiency by leveraging mdc on the row group level. Global sort is considered in the design details, but deemed too slow. Sorting within partitions, on the other hand, is relatively fast because it does not introduce a shuffle. It can be optionally applied after the current
repartitionByRange
step. To the best of my knowledge, this approach has not been considered.Further details
I originally discussed this problem in the Slack channel with @Kimahriman, who suggested I raise an issue here.
I've implemented the feature by adding configuration property
spark.databricks.io.skipping.mdc.sortWithinPartitions
defaulting tofalse
. When the property is enabled, the partitions are sorted onrepartitionKeyColName
afterrepartitionByRange
.I ran a comparison based on the Delta Lake Z Order blog post and notebook by @MrPowers. I don't have local disk for the large data set (
G1_1e9_1e2_0_0.csv
), so I used a medium-sized one instead (G1_1e8_1e8_100_0.csv
) and timedquery_c
on four table versions:id1
andid2
id1
andid2
, and sorted within partitionsOn a 2021 MBP with 16 GB RAM. The results were
The
id
column values queried are different because the original combination did not exist in my data set.Update: I ran the experiment on the larger data set (
G1_1e9_1e9_100_0.csv
) using cloud storage and the results areWillingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
I have opened PR #4006.
The text was updated successfully, but these errors were encountered: