Skip to content

Improve dictionary null handling in hashing and expand aggregate test coverage for nulls #16466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
bf9f396
fuzz: include dict with null values
kosiew Jun 17, 2025
3596051
feat: enhance error reporting in AggregationFuzzTestTask with dataset…
kosiew Jun 17, 2025
befef48
Fix typo in expect message
kosiew Jun 18, 2025
54742d1
feat: add tests for GROUP BY with MAX aggregations, including handlin…
kosiew Jun 18, 2025
397d159
feat: add tests for GROUP BY with MAX aggregations on dictionary arra…
kosiew Jun 18, 2025
8f23b9b
feat: enhance GROUP BY tests with detailed assertions for dictionary …
kosiew Jun 18, 2025
0f27f18
feat: enhance GROUP BY MAX tests with comprehensive assertions for nu…
kosiew Jun 18, 2025
abd9a64
feat: update test assertions for GROUP BY MAX with nulls and dictiona…
kosiew Jun 18, 2025
af4a7c8
feat: add tests for COUNT, SUM, MIN, MEDIAN, and FIRST/LAST_VALUE han…
kosiew Jun 18, 2025
bce3223
feat: add comprehensive tests for GROUP BY with dictionary columns ha…
kosiew Jun 18, 2025
3a1d9a8
fix: correct median value assertions in GROUP BY tests for dictionary…
kosiew Jun 18, 2025
08cd65a
feat: remove redundant GROUP BY tests for MAX aggregations and enhanc…
kosiew Jun 18, 2025
dc08a58
feat: enhance COUNT, SUM, MIN, and MEDIAN tests for null handling wit…
kosiew Jun 18, 2025
c7fb93a
feat: enhance COUNT, SUM, MIN, and MEDIAN tests for null handling wit…
kosiew Jun 18, 2025
721ec4f
feat: enhance COUNT, SUM, MIN, and MEDIAN tests for null handling wit…
kosiew Jun 18, 2025
e0801d1
feat: add helper functions and test data structures for handling null…
kosiew Jun 18, 2025
3e415c7
Merge branch 'main' into fuzz-16266a
kosiew Jun 19, 2025
ebff6ff
refactor: improve batch splitting logic in setup_test_contexts
kosiew Jun 19, 2025
c0a5cec
refactor: add num_partitions variable, streamline setup_test_contexts…
kosiew Jun 19, 2025
a88dfe0
refactor: simplify setup_test_contexts by introducing create_context_…
kosiew Jun 19, 2025
5c86e08
refactor: consolidate dictionary creation functions in aggregate tests
kosiew Jun 19, 2025
dc058d1
refactor: introduce string_dict_type function to simplify dictionary …
kosiew Jun 19, 2025
41fc02b
refactor: move create_context_with_partitions function to improve tes…
kosiew Jun 19, 2025
ad1478e
add more first_value, last_value tests
kosiew Jun 19, 2025
f9da39b
test: add fuzz tests for MAX with dictionary columns containing null …
kosiew Jun 19, 2025
3148a27
test: add fuzz tests for MIN aggregation with timestamp and dictionar…
kosiew Jun 19, 2025
5b66ce6
test: add COUNT and COUNT DISTINCT tests for fuzz table with dictiona…
kosiew Jun 19, 2025
f9a9f08
test: add median and median distinct tests for fuzz table with numeri…
kosiew Jun 19, 2025
5f4bd43
test: fix first and last value null handling in aggregate tests
kosiew Jun 19, 2025
8198035
test: update expected values in count distinct test for dictionary co…
kosiew Jun 19, 2025
c0486b7
test: update expected values in aggregate tests for dictionary column…
kosiew Jun 19, 2025
824bce4
fix: improve null handling in dictionary hash calculations
kosiew Jun 19, 2025
b3c66c8
test: enable generation of null values in record batch generator
kosiew Jun 19, 2025
f0373e4
fix: remove dataset rows and sort keys from error report in Aggregati…
kosiew Jun 19, 2025
3746634
test: remove overlapping tests
kosiew Jun 19, 2025
b2b6053
fix: correct formatting in AggregationFuzzTestTask output
kosiew Jun 19, 2025
f7e3feb
Merge branch 'main' into fuzz-16266a
kosiew Jun 19, 2025
fbc9980
fix: rename variables for clarity in hash_dictionary function
kosiew Jun 20, 2025
936446a
fix: refactor hash_dictionary to use helper function for updating dic…
kosiew Jun 20, 2025
19abf53
fix: add debug assertion for null percentage in RecordBatchGenerator
kosiew Jun 20, 2025
2fadece
fix: introduce create_test_dict helper function for improved dictiona…
kosiew Jun 20, 2025
cbe3b81
refactor: remove legacy create_dict function in favor of create_test_…
kosiew Jun 20, 2025
82e8446
refactor: enhance test helper functions for improved readability and …
kosiew Jun 20, 2025
912c91e
feat: add fuzz timestamp test data structure and related setup functions
kosiew Jun 20, 2025
1675ec2
refactor: simplify run_snapshot_test function by removing unused para…
kosiew Jun 20, 2025
5410004
Merge branch 'main' into fuzz-16266b
kosiew Jun 20, 2025
07eeaa4
fix clippy errors
kosiew Jun 20, 2025
dcdf0db
Merge branch 'main' into fuzz-16266b
kosiew Jun 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 35 additions & 15 deletions datafusion/common/src/hash_utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,26 @@ fn hash_array<T>(
}
}

/// Helper function to update hash for a dictionary key if the value is valid
#[cfg(not(feature = "force_hash_collisions"))]
#[inline]
fn update_hash_for_dict_key(
hash: &mut u64,
dict_hashes: &[u64],
dict_values: &dyn Array,
idx: usize,
multi_col: bool,
) {
if dict_values.is_valid(idx) {
if multi_col {
*hash = combine_hashes(dict_hashes[idx], *hash);
} else {
*hash = dict_hashes[idx];
}
}
// no update for invalid dictionary value
}

/// Hash the values in a dictionary array
#[cfg(not(feature = "force_hash_collisions"))]
fn hash_dictionary<K: ArrowDictionaryKeyType>(
Expand All @@ -195,23 +215,23 @@ fn hash_dictionary<K: ArrowDictionaryKeyType>(
// Hash each dictionary value once, and then use that computed
// hash for each key value to avoid a potentially expensive
// redundant hashing for large dictionary elements (e.g. strings)
let values = Arc::clone(array.values());
let mut dict_hashes = vec![0; values.len()];
create_hashes(&[values], random_state, &mut dict_hashes)?;
let dict_values = Arc::clone(array.values());
let mut dict_hashes = vec![0; dict_values.len()];
create_hashes(&[dict_values], random_state, &mut dict_hashes)?;

// combine hash for each index in values
if multi_col {
for (hash, key) in hashes_buffer.iter_mut().zip(array.keys().iter()) {
if let Some(key) = key {
*hash = combine_hashes(dict_hashes[key.as_usize()], *hash)
} // no update for Null, consistent with other hashes
}
} else {
for (hash, key) in hashes_buffer.iter_mut().zip(array.keys().iter()) {
if let Some(key) = key {
*hash = dict_hashes[key.as_usize()]
} // no update for Null, consistent with other hashes
}
let dict_values = array.values();
for (hash, key) in hashes_buffer.iter_mut().zip(array.keys().iter()) {
if let Some(key) = key {
let idx = key.as_usize();
update_hash_for_dict_key(
hash,
&dict_hashes,
dict_values.as_ref(),
idx,
multi_col,
);
} // no update for Null key
}
Ok(())
}
Expand Down
8 changes: 4 additions & 4 deletions datafusion/core/tests/fuzz_cases/aggregation_fuzzer/fuzzer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ impl AggregationFuzzer {
let datasets = self
.dataset_generator
.generate()
.expect("should success to generate dataset");
.expect("should succeed to generate dataset");

// Then for each of them, we random select a test sql for it
let query_groups = datasets
Expand Down Expand Up @@ -216,16 +216,16 @@ impl AggregationFuzzer {
// Generate the baseline context, and get the baseline result firstly
let baseline_ctx_with_params = ctx_generator
.generate_baseline()
.expect("should success to generate baseline session context");
.expect("should succeed to generate baseline session context");
let baseline_result = run_sql(&sql, &baseline_ctx_with_params.ctx)
.await
.expect("should success to run baseline sql");
.expect("should succeed to run baseline sql");
let baseline_result = Arc::new(baseline_result);
// Generate test tasks
for _ in 0..CTX_GEN_ROUNDS {
let ctx_with_params = ctx_generator
.generate()
.expect("should success to generate session context");
.expect("should succeed to generate session context");
let task = AggregationFuzzTestTask {
dataset_ref: dataset_ref.clone(),
expected_result: baseline_result.clone(),
Expand Down
6 changes: 2 additions & 4 deletions datafusion/core/tests/fuzz_cases/record_batch_generator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -724,15 +724,13 @@ impl RecordBatchGenerator {
{
// We generate just num_distinct values because they will be reused by different keys
let mut array_gen_rng = array_gen_rng;

debug_assert!((0.0..=1.0).contains(&null_pct));
let values = Self::generate_array_of_type_inner(
&ColumnDescr::new("values", *value_type.clone()),
num_distinct,
batch_gen_rng,
array_gen_rng.clone(),
// Once https://github.com/apache/datafusion/issues/16228 is fixed
// we can also generate nulls in values
0.0, // null values are generated on the key level
null_pct, // generate some null values
);

match key_type.as_ref() {
Expand Down
Loading