Skip to content

Display sort order in Hash Repartition #18594

@gene-bordegaray

Description

@gene-bordegaray

Is your feature request related to a problem or challenge?

Hash Repartitions track whether input / sort order is maintained but does not always display this property when using EXPLAIN on a query. We should display if a RepartitionExec maintains input order only if the repartitions input has an sort ordering requirement.

Describe the solution you'd like

We should display a tag maintains_sort_order from the maintains_intput_order() function for RepartionExec operators when its inupt has an ordering requirement. This function returns true if the repartition if the preserve_order=true or input_partitions <= 1 thus it will always display when a repartition is maintaining input order explicitly and implicitly.

This tag should only be displayed if the input data is sorted since we only care to know if order is maintained when the sorting has significance.

These checks will ensure that we are only displaying if a RepartitionExec preserves order in cases where it is beneficial (the data is sorted).

Describe alternatives you've considered

Only displaying maintains_input_order() function output and eliminate the preserver_order display => This would lose visibility into the implicit and explicit decisions that are being made.

Additional context

Order within repartitions is dependent on two things, the preserve_order flag and input_partition_count

  1. If the preserve order flag is true then no matter the number of input partitions the order will be preserved
  2. If the preserve order flag is false then order is only preserved if there is a single input partition
  3. Otherwise ordering is not preserved

Here are two example plans that will clearly highlight when this display is useful and when it is not:

When it is helpful

... More Nodes ...
  SortExec: expr=[a@0 ASC, b@1 ASC], ....
    RepartitionExec: ... **maintains_sort_order=true**, input_partitions=1
      DataSourceExec: file_groups={..., **output_ordering=[a@0 ASC, b@1 ASC]**}

The maintains_sort_order is useful here because the data is sorted before the RepartitionExec and having a SortExec above is not optimal. Having the display that the RepartitionExec maintains sort order would help spot inefficiencies in plans.

When it is not helpful (this is why we need the output order condition)

... More Nodes ...
  SortExec: expr=[a@0 ASC, b@1 ASC], ....
    RepartitionExec: ... **maintains_sort_order=true**, input_partitions=1
      DataSourceExec: file_groups={...}

If we were to not check if the input data was sorted and only check if the RepartitionExec maintains order it would NOT be useful because the data is NOT sorted before the RepartitionExec. Saying we maintain sort order here has no significance because the sort order is not actually sorted.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions