-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve dictionary null handling in hashing and expand aggregate test coverage for nulls #16466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… rows and sort keys
…ys handling null keys and values
…columns handling nulls
…ll handling in dictionary columns
…dling of null values
…ndling null keys and values
…e null handling in dictionary columns
…h GROUP BY on dictionary columns
…h GROUP BY on dictionary columns, including null keys and values
…h GROUP BY on dictionary columns, including single and multiple partition scenarios
…s in dictionary columns across various aggregate functions
… by consolidating batch creation logic
…with_partitions function
…y columns containing nulls
…ry columns containing nulls
…c types and dictionary columns containing nulls
…tionary key hashes
![]() datafusion/common/src/hash_utils.rs | 50 +- datafusion/core/tests/sql/aggregates.rs | 1442 ++++++++++++++++++++ In working on #16266, there was initially a lot of fuzz failures. |
Which issue does this PR close?
Rationale for this change
This change addresses a bug where
combine_hashes
was applied even if a dictionary value was null, leading to incorrect hash computations.This was discovered while investigating #16266
Additionally, this PR extends the test coverage for aggregate functions to better validate behavior with dictionary arrays containing nulls.
What changes are included in this PR?
hash_dictionary
to ensurecombine_hashes
is only applied when the dictionary value is valid.COUNT
,SUM
,MIN
,MAX
,MEDIAN
,FIRST_VALUE
,LAST_VALUE
) using dictionary arrays with null keys and values.Are these changes tested?
Yes, extensive new tests are added covering:
Are there any user-facing changes?
No direct API changes, but query behavior involving dictionary arrays with nulls will now produce correct and consistent results in line with SQL semantics.