Skip to content

perf(window): Skip RowContainer round-trip in streaming window build#17558

Open
zhli1142015 wants to merge 6 commits into
facebookincubator:mainfrom
zhli1142015:optimize-rows-streaming-window-vector
Open

perf(window): Skip RowContainer round-trip in streaming window build#17558
zhli1142015 wants to merge 6 commits into
facebookincubator:mainfrom
zhli1142015:optimize-rows-streaming-window-vector

Conversation

@zhli1142015
Copy link
Copy Markdown
Contributor

@zhli1142015 zhli1142015 commented May 19, 2026

RowsStreamingWindowBuild is used when input is already sorted by the partition and order keys. In this path, copying every row into RowContainer and then extracting the same rows back into vectors adds unnecessary column-to-row and row-to-column conversion work before window functions can run.

This PR removes that round trip while keeping the rows-streaming execution model and partial-partition behavior unchanged.

Key takeaways:

  • What's the optimization? Instead of materializing rows into RowContainer and reading them back into vectors, rows-streaming window now retains input RowVector ranges directly and exposes them through a vector-backed WindowPartition.
  • When does it help most? It helps most when the window function itself is cheap and RowContainer conversion is a large part of the cost. In these median results, rank improves Window CPU by 44-47%. The rank + sum case still improves, but less (18-20%) because function evaluation dominates more of the runtime. With 7 funcs, the improvement is smaller (about 7%) because the removed conversion cost is amortized across more function work.
  • What's the tradeoff? Active rows are retained by keeping their input vectors alive instead of copying all rows into RowContainer. To reduce avoidable retention, the cross-batch previous-row state copies only the needed one-row key snapshot, so it does not pin a processed input vector just for boundary comparison.

Benchmark setup:

  • Main: origin/main plus this PR's benchmark harness only.
  • This change: current PR branch after the review follow-up commits.
  • Build: Release; median of 5 runs; each timing cell is total time / Window CPU.
Case Rows Main This change Window CPU reduction
rank 10K 496.01us / 500.03us 269.85us / 264.29us 47.1%
rank + sum 10K 1.10ms / 1.15ms 887.55us / 915.22us 20.4%
7 funcs 10K 2.71ms / 2.86ms 2.51ms / 2.65ms 7.3%
rank 100K 4.71ms / 4.89ms 2.65ms / 2.67ms 45.4%
rank + sum 100K 10.89ms / 11.46ms 9.01ms / 9.46ms 17.5%
7 funcs 100K 27.71ms / 29.28ms 25.92ms / 27.34ms 6.6%
rank 1M 49.54ms / 51.22ms 28.46ms / 28.82ms 43.7%
rank + sum 1M 113.31ms / 119.40ms 92.72ms / 97.23ms 18.6%
7 funcs 1M 284.69ms / 300.51ms 264.09ms / 280.62ms 6.6%

@zhli1142015 zhli1142015 requested a review from majetideepak as a code owner May 19, 2026 12:46
@netlify
Copy link
Copy Markdown

netlify Bot commented May 19, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 13fe802
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a21855f1b8b380007d20fd2

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 19, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Build Impact Analysis

Selective Build Targets (building these covers all 319 affected)

cmake --build _build/release --target aggregate_companion_functions_test physical_size_aggregator_test presto_sql_test spark_aggregation_fuzzer_test spark_expression_fuzzer_test velox_abfs_test velox_aggregates_GeometryAggregateTest velox_aggregates_reduce_agg_bm velox_aggregates_simple_aggregates_bm velox_aggregates_string_keys_bm velox_aggregates_test_group0 velox_aggregates_test_group1 velox_aggregates_test_group2 velox_aggregates_test_group3 velox_aggregates_test_group4 velox_aggregation_fuzzer_test velox_aggregation_runner_test velox_benchmark_array_writer_no_nulls velox_benchmark_array_writer_with_nulls velox_benchmark_basic_comparison_conjunct velox_benchmark_basic_decoded_vector velox_benchmark_basic_preproc velox_benchmark_basic_selectivity_vector velox_benchmark_basic_simple_arithmetic velox_benchmark_basic_simple_cast velox_benchmark_basic_vector_compare velox_benchmark_basic_vector_fuzzer velox_benchmark_basic_vector_slice velox_benchmark_estimate_flat_size velox_benchmark_expr_flat_no_nulls velox_benchmark_feature_normalization velox_benchmark_map_writer_no_nulls velox_benchmark_map_writer_with_nulls velox_benchmark_nested_array_writer_no_nulls velox_benchmark_nested_array_writer_with_nulls velox_cache_fuzzer velox_cast_benchmark velox_common_test velox_core_plan_consistency_checker_test velox_core_test velox_date_extract_benchmark velox_driver_test velox_duckdb_conversion_test velox_dwio_cache_test velox_dwio_common_test velox_dwio_dwrf_buffered_output_stream_test velox_dwio_dwrf_byte_rle_encoder_test velox_dwio_dwrf_byte_rle_test velox_dwio_dwrf_checksum_test velox_dwio_dwrf_column_reader_test velox_dwio_dwrf_column_statistics_test velox_dwio_dwrf_compression_test velox_dwio_dwrf_config_test velox_dwio_dwrf_data_buffer_holder_test velox_dwio_dwrf_decompression_test velox_dwio_dwrf_decryption_test velox_dwio_dwrf_dictionary_encoder_test velox_dwio_dwrf_dictionary_encoding_utils_test velox_dwio_dwrf_encoding_selector_test velox_dwio_dwrf_encryption_test velox_dwio_dwrf_flush_policy_test velox_dwio_dwrf_index_builder_test velox_dwio_dwrf_int_direct_test velox_dwio_dwrf_int_encoder_test velox_dwio_dwrf_layout_planner_test velox_dwio_dwrf_ratio_checker_test velox_dwio_dwrf_reader_base_test velox_dwio_dwrf_reader_test velox_dwio_dwrf_rle_test velox_dwio_dwrf_rlev1_encoder_test velox_dwio_dwrf_stream_labels_test velox_dwio_dwrf_stripe_dictionary_cache_test velox_dwio_dwrf_stripe_reader_base_test velox_dwio_dwrf_stripe_stream_test velox_dwio_dwrf_writer_context_test velox_dwio_dwrf_writer_encoding_manager_test velox_dwio_dwrf_writer_sink_test velox_dwio_dwrf_writer_test velox_dwio_iceberg_reader_benchmark velox_dwio_orc_column_statistics_test velox_dwio_orc_reader_filter_test velox_dwio_orc_reader_test velox_dwio_parquet_common_test velox_dwio_parquet_page_reader_test velox_dwio_parquet_reader_benchmark velox_dwio_parquet_reader_test velox_dwio_parquet_rlebp_decoder_test velox_dwio_parquet_structure_decoder_test velox_dwio_parquet_table_scan_test velox_dwio_parquet_tpch_test velox_dwrf_column_writer_index_test velox_dwrf_column_writer_stats_test velox_dwrf_column_writer_test velox_dwrf_e2e_filter_test velox_dwrf_e2e_reader_test velox_dwrf_e2e_writer_test velox_dwrf_float_column_writer_benchmark velox_dwrf_statistics_builder_utils_test velox_dwrf_writer_extended_test velox_dwrf_writer_flush_test velox_example_operator_extensibility velox_example_scan_orc velox_exchange_benchmark velox_exchange_fuzzer velox_exec_SpatialJoinTest velox_exec_bm_duplicate_project velox_exec_infra_test velox_exec_test_group0 velox_exec_test_group1 velox_exec_test_group2 velox_exec_test_group3 velox_exec_test_group4 velox_exec_test_group5 velox_exec_test_group6 velox_exec_test_group7 velox_exec_test_group8 velox_exec_util_test_group0 velox_exec_vector_hasher_benchmark velox_expression_fuzzer_test velox_expression_fuzzer_unit_test velox_expression_runner_test velox_expression_runner_unit_test velox_expression_test velox_expression_verifier_unit_test velox_filemetadata_test velox_filter_project_benchmark velox_format_datetime_benchmark velox_function_dynamic_link_test velox_function_registry_test velox_functions_aggregates_test velox_functions_benchmarks_compare velox_functions_benchmarks_row_writer_no_nulls velox_functions_benchmarks_simdjson_function_with_expr velox_functions_benchmarks_string_writer_no_nulls velox_functions_benchmarks_url velox_functions_iceberg_test velox_functions_lib_test velox_functions_prestosql_benchmarks_array_contains velox_functions_prestosql_benchmarks_array_min_max velox_functions_prestosql_benchmarks_array_position velox_functions_prestosql_benchmarks_array_sum velox_functions_prestosql_benchmarks_bitwise velox_functions_prestosql_benchmarks_cardinality velox_functions_prestosql_benchmarks_comparisons velox_functions_prestosql_benchmarks_concat velox_functions_prestosql_benchmarks_date_time velox_functions_prestosql_benchmarks_field_reference velox_functions_prestosql_benchmarks_generic velox_functions_prestosql_benchmarks_in velox_functions_prestosql_benchmarks_map_concat velox_functions_prestosql_benchmarks_map_except velox_functions_prestosql_benchmarks_map_input velox_functions_prestosql_benchmarks_map_intersect velox_functions_prestosql_benchmarks_map_subscript velox_functions_prestosql_benchmarks_map_zip_with velox_functions_prestosql_benchmarks_not velox_functions_prestosql_benchmarks_regexp_replace velox_functions_prestosql_benchmarks_row velox_functions_prestosql_benchmarks_string_ascii_utf_functions velox_functions_prestosql_benchmarks_uuid_cast velox_functions_prestosql_benchmarks_width_bucket velox_functions_prestosql_benchmarks_zip velox_functions_prestosql_benchmarks_zip_with velox_functions_spark_aggregates_test velox_functions_spark_test velox_functions_test velox_fuzzer_connector_test velox_gcs_file_test velox_gcs_insert_test velox_gcs_multiendpoints_test velox_gcsfile_example velox_hash_benchmark velox_hash_join_build_benchmark velox_hash_join_list_result_benchmark velox_hash_join_prepare_join_table_benchmark velox_hdfs_file_test velox_hdfs_insert_test velox_hive_connector_test velox_hive_iceberg_deletion_vector_test velox_hive_iceberg_deletion_vector_writer_test velox_hive_iceberg_dwrf_insert_test velox_hive_iceberg_equality_delete_test velox_hive_iceberg_insert_test velox_hive_iceberg_test velox_hive_paimon_connector velox_hive_paimon_data_file_meta_test velox_hive_paimon_deletion_file_test velox_hive_paimon_row_kind_test velox_hive_paimon_split_test velox_hive_partition_function_benchmark velox_hive_writer_options_adapter_test velox_in_10_min_demo velox_join_fuzzer velox_key_encoder_test velox_like_benchmark velox_like_tpch_benchmark velox_mark_distinct_fuzzer velox_mark_sorted_benchmark velox_memory_arbitration_fuzzer velox_memory_test velox_merge_benchmark velox_numeric_upcast_benchmark velox_orderby_benchmark velox_parquet_e2e_filter_test velox_parquet_writer_sink_test velox_parquet_writer_test velox_parse_test velox_prefixsort_benchmark velox_presto_types_fuzzer_utils_test velox_prestosql_coverage velox_query_replayer velox_re2_functions_benchmarks velox_row_number_fuzzer velox_rows_streaming_window_benchmark velox_rpc_operator_test velox_s3file_test velox_s3insert_test velox_s3metrics_test velox_s3multiendpoints_test velox_s3read_test velox_s3registration_test velox_serializer_test_group0 velox_simple_aggregate_test velox_sort_benchmark velox_spark_query_runner_test velox_spark_windows_test velox_sparksql_benchmarks_cast velox_sparksql_benchmarks_compare velox_sparksql_benchmarks_from_json velox_sparksql_benchmarks_get_funcs velox_sparksql_benchmarks_hash velox_sparksql_benchmarks_in velox_sparksql_benchmarks_simd_compare velox_sparksql_benchmarks_split velox_sparksql_coverage velox_spatial_join_benchmark velox_spatial_join_fuzzer velox_spiller_aggregate_benchmark velox_spiller_join_benchmark velox_streaming_aggregation_benchmark velox_table_evolution_fuzzer_test velox_test_util_test velox_text_reader_test velox_text_writer_test velox_tool_trace_test velox_topn_row_number_fuzzer velox_tpcds_benchmark velox_tpcds_connector_test velox_tpch_benchmark velox_tpch_connector_test velox_tpch_speed_test velox_trace_file_tool velox_unsafe_row_serialize_benchmark velox_wave_benchmark velox_wave_exec_test velox_window_fuzzer_test velox_window_prefixsort_benchmark velox_window_sub_partitioned_sort_benchmark velox_windows_agg_test velox_windows_rank_test velox_windows_value_test velox_writer_fuzzer_test

Total affected: 319/582 targets

Warning: 3 file(s) could not be mapped to any target. A full build may be needed.

  • velox/exec/benchmarks/CMakeLists.txt
  • velox/exec/tests/CMakeLists.txt
  • velox/exec/window/CMakeLists.txt
Affected targets (319)

Directly changed (15)

Target Changed Files
velox_coverage_util WindowPartition.h
velox_exec RowRange.h, RowsStreamingWindowBuild.h, SingleRowValues.h, WindowPartition.cpp, WindowPartition.h
velox_exec_infra_test WindowPartition.h
velox_exec_test_group2 WindowPartition.h
velox_exec_test_group3 WindowPartition.h
velox_exec_test_group4 RowRange.h, RowsStreamingWindowBuild.h, SingleRowValues.h, WindowPartition.h, WindowTest.cpp
velox_exec_test_group7 WindowPartition.h
velox_exec_test_group8 RowRange.h, SingleRowValues.h, VectorWindowPartition.h, VectorWindowPartitionTest.cpp, WindowPartition.h
velox_exec_test_lib WindowPartition.h
velox_exec_window RowRange.h, RowsStreamingWindowBuild.cpp, RowsStreamingWindowBuild.h, SingleRowValues.cpp, SingleRowValues.h, ... (+3 more)
velox_functions_window WindowPartition.h
velox_rows_streaming_window_benchmark RowsStreamingWindowBenchmark.cpp
velox_window WindowPartition.h
velox_window_fuzzer WindowPartition.h
velox_window_fuzzer_test WindowPartition.h

Transitively affected (304)

  • aggregate_companion_functions_test
  • physical_size_aggregator_test
  • presto_sql_test
  • spark_aggregation_fuzzer_test
  • spark_expression_fuzzer_test
  • velox_abfs_test
  • velox_aggregates
  • velox_aggregates_GeometryAggregateTest
  • velox_aggregates_reduce_agg_bm
  • velox_aggregates_simple_aggregates_bm
  • velox_aggregates_string_keys_bm
  • velox_aggregates_test_group0
  • velox_aggregates_test_group1
  • velox_aggregates_test_group2
  • velox_aggregates_test_group3
  • velox_aggregates_test_group4
  • velox_aggregation_fuzzer
  • velox_aggregation_fuzzer_base
  • velox_aggregation_fuzzer_test
  • velox_aggregation_result_verifier
  • velox_aggregation_runner_test
  • velox_benchmark_array_writer_no_nulls
  • velox_benchmark_array_writer_with_nulls
  • velox_benchmark_basic_comparison_conjunct
  • velox_benchmark_basic_decoded_vector
  • velox_benchmark_basic_preproc
  • velox_benchmark_basic_selectivity_vector
  • velox_benchmark_basic_simple_arithmetic
  • velox_benchmark_basic_simple_cast
  • velox_benchmark_basic_vector_compare
  • velox_benchmark_basic_vector_fuzzer
  • velox_benchmark_basic_vector_slice
  • velox_benchmark_builder
  • velox_benchmark_estimate_flat_size
  • velox_benchmark_expr_flat_no_nulls
  • velox_benchmark_feature_normalization
  • velox_benchmark_map_writer_no_nulls
  • velox_benchmark_map_writer_with_nulls
  • velox_benchmark_nested_array_writer_no_nulls
  • velox_benchmark_nested_array_writer_with_nulls
  • velox_cache_fuzzer
  • velox_cast_benchmark
  • velox_common_test
  • velox_core_plan_consistency_checker_test
  • velox_core_test
  • velox_date_extract_benchmark
  • velox_driver_test
  • velox_duckdb_conversion_test
  • velox_duckdb_parser
  • velox_dwio_arrow_parquet_writer
  • velox_dwio_cache_test
  • velox_dwio_common_test
  • velox_dwio_common_test_utils
  • velox_dwio_dwrf_buffered_output_stream_test
  • velox_dwio_dwrf_byte_rle_encoder_test
  • velox_dwio_dwrf_byte_rle_test
  • velox_dwio_dwrf_checksum_test
  • velox_dwio_dwrf_column_reader_test
  • velox_dwio_dwrf_column_statistics_test
  • velox_dwio_dwrf_compression_test
  • velox_dwio_dwrf_config_test
  • velox_dwio_dwrf_data_buffer_holder_test
  • velox_dwio_dwrf_decompression_test
  • velox_dwio_dwrf_decryption_test
  • velox_dwio_dwrf_dictionary_encoder_test
  • velox_dwio_dwrf_dictionary_encoding_utils_test
  • velox_dwio_dwrf_encoding_selector_test
  • velox_dwio_dwrf_encryption_test
  • velox_dwio_dwrf_flush_policy_test
  • velox_dwio_dwrf_index_builder_test
  • velox_dwio_dwrf_int_direct_test
  • velox_dwio_dwrf_int_encoder_test
  • velox_dwio_dwrf_layout_planner_test
  • velox_dwio_dwrf_ratio_checker_test
  • velox_dwio_dwrf_reader_base_test
  • velox_dwio_dwrf_reader_test
  • velox_dwio_dwrf_rle_test
  • velox_dwio_dwrf_rlev1_encoder_test
  • velox_dwio_dwrf_stream_labels_test
  • velox_dwio_dwrf_stripe_dictionary_cache_test
  • velox_dwio_dwrf_stripe_reader_base_test
  • velox_dwio_dwrf_stripe_stream_test
  • velox_dwio_dwrf_writer
  • velox_dwio_dwrf_writer_context_test
  • velox_dwio_dwrf_writer_encoding_manager_test
  • velox_dwio_dwrf_writer_sink_test
  • velox_dwio_dwrf_writer_test
  • velox_dwio_iceberg_reader_benchmark
  • velox_dwio_iceberg_reader_benchmark_lib
  • velox_dwio_orc_column_statistics_test
  • velox_dwio_orc_reader_filter_test
  • velox_dwio_orc_reader_test
  • velox_dwio_parquet_common_test
  • velox_dwio_parquet_page_reader_test
  • velox_dwio_parquet_reader_benchmark
  • velox_dwio_parquet_reader_benchmark_lib
  • velox_dwio_parquet_reader_test
  • velox_dwio_parquet_rlebp_decoder_test
  • velox_dwio_parquet_structure_decoder_test
  • velox_dwio_parquet_table_scan_test
  • velox_dwio_parquet_tpch_test
  • velox_dwio_parquet_writer
  • velox_dwrf_column_writer_index_test
  • velox_dwrf_column_writer_stats_test
  • velox_dwrf_column_writer_test
  • velox_dwrf_e2e_filter_test
  • velox_dwrf_e2e_reader_test
  • velox_dwrf_e2e_writer_test
  • velox_dwrf_float_column_writer_benchmark
  • velox_dwrf_statistics_builder_utils_test
  • velox_dwrf_test_utils
  • velox_dwrf_writer_extended_test
  • velox_dwrf_writer_flush_test
  • velox_example_operator_extensibility
  • velox_example_scan_orc
  • velox_exchange_benchmark
  • velox_exchange_fuzzer
  • velox_exec_SpatialJoinTest
  • velox_exec_bm_duplicate_project
  • velox_exec_test_group0
  • velox_exec_test_group1
  • velox_exec_test_group5
  • velox_exec_test_group6
  • velox_exec_util_test_group0
  • velox_exec_vector_hasher_benchmark
  • velox_expression_fuzzer
  • velox_expression_fuzzer_test
  • velox_expression_fuzzer_unit_test
  • velox_expression_runner
  • velox_expression_runner_test
  • velox_expression_runner_unit_test
  • velox_expression_test
  • velox_expression_test_utility
  • velox_expression_verifier
  • velox_expression_verifier_unit_test
  • velox_filemetadata_test
  • velox_filter_project_benchmark
  • velox_format_datetime_benchmark
  • velox_function_dynamic_link_test
  • velox_function_registry_test
  • velox_functions_aggregates
  • velox_functions_aggregates_test
  • velox_functions_aggregates_test_lib
  • velox_functions_benchmarks_compare
  • velox_functions_benchmarks_row_writer_no_nulls
  • velox_functions_benchmarks_simdjson_function_with_expr
  • velox_functions_benchmarks_string_writer_no_nulls
  • velox_functions_benchmarks_url
  • velox_functions_iceberg_test
  • velox_functions_lib_test
  • velox_functions_prestosql_benchmarks_array_contains
  • velox_functions_prestosql_benchmarks_array_min_max
  • velox_functions_prestosql_benchmarks_array_position
  • velox_functions_prestosql_benchmarks_array_sum
  • velox_functions_prestosql_benchmarks_bitwise
  • velox_functions_prestosql_benchmarks_cardinality
  • velox_functions_prestosql_benchmarks_comparisons
  • velox_functions_prestosql_benchmarks_concat
  • velox_functions_prestosql_benchmarks_date_time
  • velox_functions_prestosql_benchmarks_field_reference
  • velox_functions_prestosql_benchmarks_generic
  • velox_functions_prestosql_benchmarks_in
  • velox_functions_prestosql_benchmarks_map_concat
  • velox_functions_prestosql_benchmarks_map_except
  • velox_functions_prestosql_benchmarks_map_input
  • velox_functions_prestosql_benchmarks_map_intersect
  • velox_functions_prestosql_benchmarks_map_subscript
  • velox_functions_prestosql_benchmarks_map_zip_with
  • velox_functions_prestosql_benchmarks_not
  • velox_functions_prestosql_benchmarks_regexp_replace
  • velox_functions_prestosql_benchmarks_row
  • velox_functions_prestosql_benchmarks_string_ascii_utf_functions
  • velox_functions_prestosql_benchmarks_uuid_cast
  • velox_functions_prestosql_benchmarks_width_bucket
  • velox_functions_prestosql_benchmarks_zip
  • velox_functions_prestosql_benchmarks_zip_with
  • velox_functions_spark_aggregates
  • velox_functions_spark_aggregates_test
  • velox_functions_spark_test
  • velox_functions_spark_window
  • velox_functions_test
  • velox_functions_test_lib
  • velox_functions_window_test_lib
  • velox_fuzzer_connector_test
  • velox_fuzzer_util
  • velox_gcs_file_test
  • velox_gcs_insert_test
  • velox_gcs_multiendpoints_test
  • velox_gcsfile_example
  • velox_hash_benchmark
  • velox_hash_join_build_benchmark
  • velox_hash_join_list_result_benchmark
  • velox_hash_join_prepare_join_table_benchmark
  • velox_hdfs_file_test
  • velox_hdfs_insert_test
  • velox_hive_connector
  • velox_hive_connector_test
  • velox_hive_iceberg_deletion_vector_test
  • velox_hive_iceberg_deletion_vector_writer_test
  • velox_hive_iceberg_dwrf_insert_test
  • velox_hive_iceberg_equality_delete_test
  • velox_hive_iceberg_insert_test
  • velox_hive_iceberg_splitreader
  • velox_hive_iceberg_test
  • velox_hive_paimon_connector
  • velox_hive_paimon_data_file_meta_test
  • velox_hive_paimon_deletion_file_test
  • velox_hive_paimon_row_kind_test
  • velox_hive_paimon_split
  • velox_hive_paimon_split_test
  • velox_hive_partition_function
  • velox_hive_partition_function_benchmark
  • velox_hive_writer_options_adapter_test
  • velox_in_10_min_demo
  • velox_join_fuzzer
  • velox_key_encoder_test
  • velox_like_benchmark
  • velox_like_tpch_benchmark
  • velox_mark_distinct_fuzzer
  • velox_mark_distinct_fuzzer_lib
  • velox_mark_sorted_benchmark
  • velox_memory_arbitration_fuzzer
  • velox_memory_test
  • velox_merge_benchmark
  • velox_numeric_upcast_benchmark
  • velox_orderby_benchmark
  • velox_parquet_e2e_filter_test
  • velox_parquet_writer_sink_test
  • velox_parquet_writer_test
  • velox_parse_expression
  • velox_parse_parser
  • velox_parse_test
  • velox_parse_utils
  • velox_prefixsort_benchmark
  • velox_presto_types_fuzzer_utils_test
  • velox_prestosql_coverage
  • velox_query_benchmark
  • velox_query_replayer
  • velox_query_trace_replayer_base
  • velox_re2_functions_benchmarks
  • velox_row_number_fuzzer
  • velox_row_number_fuzzer_lib
  • velox_rpc_operator
  • velox_rpc_operator_test
  • velox_rpc_plan_node_translator
  • velox_s3file_test
  • velox_s3insert_test
  • velox_s3metrics_test
  • velox_s3multiendpoints_test
  • velox_s3read_test
  • velox_s3registration_test
  • velox_serializer_test_group0
  • velox_simple_aggregate
  • velox_simple_aggregate_test
  • velox_sort_benchmark
  • velox_spark_query_runner
  • velox_spark_query_runner_test
  • velox_spark_windows_test
  • velox_sparksql_benchmarks_cast
  • velox_sparksql_benchmarks_compare
  • velox_sparksql_benchmarks_from_json
  • velox_sparksql_benchmarks_get_funcs
  • velox_sparksql_benchmarks_hash
  • velox_sparksql_benchmarks_in
  • velox_sparksql_benchmarks_simd_compare
  • velox_sparksql_benchmarks_split
  • velox_sparksql_coverage
  • velox_spatial_join_benchmark
  • velox_spatial_join_fuzzer
  • velox_spill_fuzzer_base_lib
  • velox_spiller_aggregate_benchmark
  • velox_spiller_aggregate_benchmark_base
  • velox_spiller_join_benchmark
  • velox_spiller_join_benchmark_base
  • velox_streaming_aggregation_benchmark
  • velox_table_evolution_fuzzer_test
  • velox_test_util_test
  • velox_text_reader_test
  • velox_text_writer_test
  • velox_tool_trace_test
  • velox_topn_row_number_fuzzer
  • velox_topn_row_number_fuzzer_lib
  • velox_tpcds_benchmark
  • velox_tpcds_benchmark_lib
  • velox_tpcds_connector_test
  • velox_tpch_benchmark
  • velox_tpch_benchmark_lib
  • velox_tpch_connector
  • velox_tpch_connector_test
  • velox_tpch_speed_test
  • velox_trace_file_tool
  • velox_trace_file_tool_base
  • velox_unsafe_row_serialize_benchmark
  • velox_wave_benchmark
  • velox_wave_exec
  • velox_wave_exec_test
  • velox_wave_mock_reader
  • velox_window_prefixsort_benchmark
  • velox_window_sub_partitioned_sort_benchmark
  • velox_windows_agg_test
  • velox_windows_rank_test
  • velox_windows_value_test
  • velox_writer_fuzzer
  • velox_writer_fuzzer_test

Slow path • Graph generated from PR branch

@zhli1142015 zhli1142015 force-pushed the optimize-rows-streaming-window-vector branch from f8d8aa8 to 7aecf7d Compare May 20, 2026 03:03
@zhli1142015
Copy link
Copy Markdown
Contributor Author

@JkSelf and @mbasmanova Could you please help to review?

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhli1142015, Thank you for the contribution! The benchmark numbers look promising.

Before diving into the code, please update the PR title and description.

The title "Optimize RowsStreamingWindowBuild with vectors" is vague — optimize how? Consider something like "Skip RowContainer round-trip in streaming window build" or similar.

The benchmark table is hard to interpret — please add a summary of the key takeaways, e.g.:

  • What's the optimization? (eliminate RowContainer c2r/r2c round-trip for streaming window)
  • When does it help most? (simple functions like rank: ~55% CPU reduction; diminishing returns for heavier functions like sum: ~33%; minimal impact with many functions: ~6-8%)
  • What's the tradeoff? (retains input vectors in memory instead of copying into RowContainer)

The raw table is useful for validation, but the reader needs the story first.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhli1142015, Thank you for the contribution! A few concerns:

  1. Significant code duplication. VectorWindowPartition (~527 lines) reimplements extractColumn, extractNulls, computePeerBuffers, computeKRangeFrameBounds, searchFrameValue, linearSearchFrameValue, updateKRangeFrameBounds, isInvalidNanFrameBound, and isNanAt — all of which have parallel implementations in WindowPartition. Bug fixes to one will need to be manually applied to the other. Can the shared algorithmic logic (peer computation, frame bound search, NaN handling) be extracted into helpers that work with an abstract row accessor, so both implementations share the same algorithms?

  2. previousRef_ pins entire input vectors. RowReference holds a RowVectorPtr, which keeps the entire input vector alive even though only one row is needed for cross-batch comparisons. For large input vectors this wastes memory. Consider copying just the needed row values into a small vector instead.

  3. loadedVector() on all children.

for (auto& child : input->children()) {
    child->loadedVector();
}

This materializes all lazy columns, even those not used by the window function. Should this be limited to columns actually needed?

  1. PR size. +1890 lines is large for a single review. Could the WindowBuild/WindowPartition refactoring (making methods virtual, moving RowContainer init to subclasses) be split into a preparatory PR? That would make the core optimization easier to review.

@zhli1142015 zhli1142015 changed the title perf(window): Optimize RowsStreamingWindowBuild with vectors perf(window): Skip RowContainer round-trip in streaming window build May 21, 2026
@JkSelf
Copy link
Copy Markdown
Collaborator

JkSelf commented May 21, 2026

@zhli1142015 Thanks for your great work.

Just curious about the overall impact here—could you provide some end-to-end performance benchmarks, such as the TPC-DS Q67 results before and after this fix?
One more question: since the input for partitionBased StreamingWindow is pre-sorted, would it also benefit from this PR?

@zhli1142015
Copy link
Copy Markdown
Contributor Author

partitionBased StreamingWindow

@mbasmanova I updated the PR title and description to make the optimization and tradeoff clearer.

The description now starts with the main story: RowsStreamingWindowBuild avoids the RowContainer column-to-row and row-to-column round trip by retaininginput RowVector ranges directly. I also added key takeaways before the benchmark table to explain when the change helps most and why the end-to-end impactdepends on how much of the query time is spent in this removed conversion overhead.

@zhli1142015
Copy link
Copy Markdown
Contributor Author

zhli1142015 commented May 22, 2026

@zhli1142015, Thank you for the contribution! A few concerns:

  1. Significant code duplication. VectorWindowPartition (~527 lines) reimplements extractColumn, extractNulls, computePeerBuffers, computeKRangeFrameBounds, searchFrameValue, linearSearchFrameValue, updateKRangeFrameBounds, isInvalidNanFrameBound, and isNanAt — all of which have parallel implementations in WindowPartition. Bug fixes to one will need to be manually applied to the other. Can the shared algorithmic logic (peer computation, frame bound search, NaN handling) be extracted into helpers that work with an abstract row accessor, so both implementations share the same algorithms?
  2. previousRef_ pins entire input vectors. RowReference holds a RowVectorPtr, which keeps the entire input vector alive even though only one row is needed for cross-batch comparisons. For large input vectors this wastes memory. Consider copying just the needed row values into a small vector instead.
  3. loadedVector() on all children.
for (auto& child : input->children()) {
    child->loadedVector();
}

This materializes all lazy columns, even those not used by the window function. Should this be limited to columns actually needed?

  1. PR size. +1890 lines is large for a single review. Could the WindowBuild/WindowPartition refactoring (making methods virtual, moving RowContainer init to subclasses) be split into a preparatory PR? That would make the core optimization easier to review.

@mbasmanova I addressed the review comments and also split the preparatory WindowBuild/WindowPartition refactor into #17590. Could you please take a look at that change?

For the code duplication concern, #17590 extracts the shared peer-group and RANGE frame-bound algorithms into WindowPartitionAlgorithms behind storageaccessors. The existing RowContainer-backed path is the first accessor. This PR then adds the vector-backed accessor, so the peer and frame-bound logic isshared instead of copied.

For previous-row retention, I replaced the RowVectorPtr-based previous-row reference with an owned one-row key snapshot. This keeps only the key valuesneeded for cross-batch partition and peer comparisons, instead of pinning a processed input vector.

For lazy vectors, RowsStreamingWindowBuild no longer loads every child eagerly. It loads only the partition/order boundary columns needed to decidepartition and peer boundaries. Function payload columns can remain lazy until they are actually read by window function evaluation.

For PR size, #17590 contains the prep refactor. After that lands, I will rebase this PR so the remaining diff focuses on the vector-backedRowsStreamingWindowBuild optimization.

@zhli1142015
Copy link
Copy Markdown
Contributor Author

@zhli1142015 Thanks for your great work.

Just curious about the overall impact here—could you provide some end-to-end performance benchmarks, such as the TPC-DS Q67 results before and after this fix? One more question: since the input for partitionBased StreamingWindow is pre-sorted, would it also benefit from this PR?

@JkSelf In that query67, the Window is replaced by WindowGroupLimit, so this PR's RowsStreamingWindowBuild path is not exercised. We do have a real workload where this change improves end-to-end runtime by about 30%. The direct improvement, as shown by the benchmark, is removing theRowContainer round trip: rows-streaming window no longer copies each row into RowContainer and then copies it back out to vectors for window function evaluation. The end-to-end impact depends on the query shape and on how much of total time is spent in this removed overhead. If window function evaluationor surrounding operators dominate, the overall gain will be smaller.

Partition-based StreamingWindow could also use this approach, but it is not included in this PR. We have spill support for partition-based StreamingWindowbuilt on top of this optimization, and extracting only the vector-backed partition optimization for that path would require more work. I kept this PR focused on RowsStreamingWindowBuild.

@zhli1142015 zhli1142015 requested a review from mbasmanova May 22, 2026 07:39
meta-codesync Bot pushed a commit that referenced this pull request Jun 1, 2026
Summary:
Rows-streaming window can keep future partitions as input vectors instead of copying every row into a `RowContainer`. This PR extracts peer-group and k RANGE frame-bound code so both RowContainer-backed and future vector-backed storage layouts can share the same window partition algorithms.

### Before

- `WindowBuild` always initialized `RowContainer` and reusable `DecodedVector` state.
- `WindowPartition` owned RowContainer-backed row storage and also contained peer computation, frame-bound search, and NaN frame-bound handling directly coupled to `RowContainer`.

### After

- `WindowBuild` holds optional `RowContainer` and `DecodedVector` state; subclasses decide whether to initialize them.
- `WindowPartition` remains the RowContainer-backed partition representation used today, but delegates peer/frame algorithms through a RowContainer accessor.
- `RowAccessor` is a C++20 concept that defines the storage contract shared by the window partition algorithms.
- `PeerGroupComputation` contains storage-agnostic peer group bound logic.
- `KRangeFrameBound` contains storage-agnostic k RANGE frame-bound search and NaN frame-bound handling.
- `RowContainerAccessor` adapts the existing RowContainer storage to the shared `RowAccessor` concept.

### Current assembly

| WindowBuild subclass | Base RowContainer | DecodedVectors | Partition accessor |
|---|---|---|---|
| `SortWindowBuild` | yes | yes | RowContainer |
| `SubPartitionedSortWindowBuild` | no (manages its own per sub-partition) | yes | RowContainer |
| `PartitionStreamingWindowBuild` | yes | yes | RowContainer |
| `RowsStreamingWindowBuild` | yes | yes | RowContainer |

After #17558, `RowsStreamingWindowBuild` will skip RowContainer materialization and use vector-backed accessors. That is the only new storage combination this refactor is intended to enable.

### Notes

The storage-agnostic algorithms use the shared `RowAccessor` C++20 concept rather than a virtual interface so the hot peer/frame search paths remain inlineable. The current PR keeps all existing behavior RowContainer-backed; #17558 can add the vector-backed accessor implementation against the same concept.

Pull Request resolved: #17590

Reviewed By: kevinwilfong

Differential Revision: D106508428

Pulled By: kKPulla

fbshipit-source-id: e58097e58f2e5ed0f59d05fae426f3524fc8332c
@zhli1142015 zhli1142015 force-pushed the optimize-rows-streaming-window-vector branch from e916db8 to b261a07 Compare June 2, 2026 06:44
@zhli1142015
Copy link
Copy Markdown
Contributor Author

@mbasmanova, could you please take another look when you have a chance?

I rebased this PR after #17590 and addressed the review comments:

  • The preparatory WindowBuild / WindowPartition refactor was split into refactor(window): Decouple partition logic from RowContainer #17590.
  • The peer-group and RANGE-frame bound logic is now shared through WindowPartitionAccessor, PeerGroupComputation, and KRangeFrameBound, with the vector-backed path keeping only storage-specific access/copy logic.
  • The previous-row state no longer retains a processed RowVector; it now stores an owned one-row key snapshot.
  • RowsStreamingWindowBuild::addInput() now loads only the partition/order boundary columns instead of eagerly loading all lazy children.
  • I also reduced the VectorWindowPartition public API surface and moved the key snapshot / key-channel collection helper into WindowPartitionKeys.

The current CI Spark expression fuzzer failure looks unrelated to this PR. I reproduced it locally and tracked it separately in #17697: #17697.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!

  • WindowPartitionKeys.h puts all types in detail:: namespace, but they're used across multiple files. detail is for implementation details private to a single header. Move to the parent namespace.

  • WindowPartitionKeys.h is a grab-bag of three unrelated types:

    • WindowPartitionRowReference — only used within VectorWindowPartition.cpp. Move to anonymous namespace in the .cpp instead of exposing in a shared header.
    • WindowPartitionKeyRowSnapshot — used by both RowsStreamingWindowBuild and VectorWindowPartition. Legitimately shared, but the name is generic — it copies a subset of columns from one row for later comparison, nothing window-specific about it.
    • WindowPartitionKeyChannels — two static methods that loop and deduplicate column indices, called from two places. A class is over-engineered for this — a free function or inline at the call sites would be simpler. appendUnique is a private static method — use a free function in anonymous namespace in the .cpp.

    There are now 22+ window-related files flat in velox/exec/. Please submit a separate preparatory PR to move existing window files into an exec/window/ subdirectory with its own namespace, then rebase this PR on top. Adding 4 more window files to the flat exec/ directory makes the organization worse.

@mbasmanova
Copy link
Copy Markdown
Contributor

A few more:

  • blockPrefixSums_.push_back(0) in the constructor — initialize inline: std::vector<vector_size_t> blockPrefixSums_{0}.
  • Trivial one-liner getters like numRows() and numRowsForProcessing() are in the .cpp. Move to the header for readability.
  • addBlock / RowBlock — "block" is a new term not used elsewhere in window code. This is a row range from an input vector. Consider addRows(input, startRow, endRow) to match the existing addRows pattern.
  • RowBlock is defined identically in both RowsStreamingWindowBuild.h and VectorWindowPartition.h. Define once.
  • RowBlock validation is inconsistent: flushBlock checks start >= end before constructing, addBlock constructs first then validates via VELOX_CHECK. If validation is needed, put it in RowBlock's constructor so all creation paths are safe and consistent.

@zhli1142015
Copy link
Copy Markdown
Contributor Author

Hi @mbasmanova, thank you so much for the thorough review!

Following your suggestion, I've split the preparatory file relocation (moving the window files under velox/exec/window/) into a dedicated PR so this one can stay focused on the optimization itself: #17710. Whenever you have a moment, could you please take a look at #17710 first? Once it lands I'll rebase this PR on top of it.

I've also addressed your other comments here — moving the shared types out of the detail namespace, splitting up the WindowPartitionKeys grab-bag header, and renaming the shared snapshot to a generic RowColumnsSnapshot since it isn't window-specific.

Thanks again for your time and guidance!

zhli1142015 added a commit to zhli1142015/velox that referenced this pull request Jun 3, 2026
Preparatory, mechanical refactor for facebookincubator#17558. The window subsystem lived as a
flat set of files under velox/exec/. This moves the internal implementation
files into a dedicated velox/exec/window/ directory and
facebook::velox::exec::window namespace, while keeping the public-facing files
in velox/exec/ and the facebook::velox::exec namespace.

Stay in velox/exec/ (exec namespace):
- Window.h/.cpp -- the operator itself, alongside HashJoin, Aggregate, and
  TableScan.
- WindowFunction.h/.cpp -- the public API that window function authors
  implement against, like AggregateFunction.h.

Move under velox/exec/window/ (exec::window namespace):
- WindowBuild and its variants, WindowPartition, AggregateWindow,
  KRangeFrameBound, PeerGroupComputation, WindowPartitionAccessor -- internal
  implementation details.

There are no behavior changes; only file locations, the enclosing namespace of
the internal files, and references to them change. The operator and the public
API reference the internal types through the window:: namespace.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
meta-codesync Bot pushed a commit that referenced this pull request Jun 4, 2026
Summary:
Preparatory, mechanical refactor for #17558. The window subsystem lived as a flat set of files under `velox/exec/`. This moves the internal implementation files into a dedicated `velox/exec/window/` directory and `facebook::velox::exec::window` namespace, while keeping the public-facing files in `velox/exec/` and the `facebook::velox::exec` namespace.

**Stay in `velox/exec/` (`exec` namespace):**
- `Window.h`/`.cpp` — the operator itself, alongside `HashJoin`, `Aggregate`, and `TableScan`.
- `WindowFunction.h`/`.cpp` — the public API that window function authors implement against, like `AggregateFunction.h`.

**Move under `velox/exec/window/` (`exec::window` namespace):**
- `WindowBuild` and its variants, `WindowPartition`, `AggregateWindow`, `KRangeFrameBound`, `PeerGroupComputation`, `WindowPartitionAccessor` — internal implementation details.

There are no behavior changes; only file locations, the enclosing namespace of the internal files, and references to them change. The operator and the public API reference the internal types through the `window::` namespace.

Pull Request resolved: #17710

Reviewed By: apurva-meta

Differential Revision: D107420866

Pulled By: bikramSingh91

fbshipit-source-id: 7a9a306463fd5005d5b48db395d58afde2d32ea0
@zhli1142015 zhli1142015 force-pushed the optimize-rows-streaming-window-vector branch 2 times, most recently from e95ccdc to 07fc8d1 Compare June 4, 2026 01:50
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

CI Failure Analysis

Auto-generated by the CI Failure Analysis workflow. This comment is updated in place each time CI fails on a new commit, so it always reflects the latest run — re-pushing or re-running CI will refresh the analysis below. Last updated 2026-06-04 02:16:54 UTC from workflow run 26925050793.

🟡 Window Fuzzer with Presto as source of truth — FUZZER Failure View logs

Fuzzer: Window Fuzzer (Presto as source of truth)
Failed instance: 4 of 4 (seed=501594481)
Instances 1-3: Passed

Root cause: The fuzzer's verification rate dropped below the required 50% threshold (49.25% < 50%). This was caused by too many Presto reference query failures (15 out of 134 iterations = 11.19%), which reduced the number of successfully verified iterations below the minimum.

Expression: (stats_.numVerified + stats_.numVerificationSkipped) / (double)iteration >= 0.5
             (0.4925373134328358 vs. 0.5)
File: velox/exec/fuzzer/WindowFuzzer.cpp:593
Function: go

Total iterations: 134
  Verified against reference DB:    66 (49.25%)
  Verification skipped:              0 (0.00%)
  Not supported by reference DB:    36 (26.87%)
  Reference DB failed:              15 (11.19%)
  Total failed functions:           17 (12.69%)

Presto reference query errors observed:
  - "Window frame offset value must not be negative or null" (11 occurrences)
  - "integer overflow" (1 occurrence)
  - "Unsupported type parameters ... for make_set_digest" (1 occurrence)
  - "Unknown type: hugeint" (1 occurrence)

The fuzzer aborted with SIGABRT after the VELOX_CHECK on the verification rate failed.


Correlation with PR changes:

  • This failure is not caused by the PR changes. PR perf(window): Skip RowContainer round-trip in streaming window build #17558 modifies WindowPartition (adding virtual methods and a protected constructor for subclassing) and adds VectorWindowPartition/RowsStreamingWindowBuild improvements. These are structural/API changes that don't affect the fuzzer's interaction with the Presto reference query runner.
  • The failure is purely in the Presto reference DB query execution path (PrestoQueryRunner.cpp:120), where Presto itself rejects certain generated queries (e.g., negative window frame offsets, integer overflow, unsupported types). The PR does not touch PrestoQueryRunner or any fuzzer infrastructure.

Known issues:

Reproduce locally:

# Build the window fuzzer
make debug
# Run with the same seed (requires a Presto instance as reference)
./_build/debug/velox/functions/prestosql/fuzzer/velox_window_fuzzer_test \
  --seed 501594481 \
  --duration_sec 300

Note: Reproduction requires a running Presto instance as the reference query runner, which makes local reproduction difficult without the full CI environment.

Recommended fix:
No fix needed in this PR. This is a known flaky test tracked by #16917. Re-running the CI workflow should resolve the issue since the failure is non-deterministic and depends on which random queries the fuzzer generates against the Presto reference.

RowsStreamingWindowBuild handles row-streaming window input that is already
sorted by the partition and order keys. Because the input is already in window
order and the operator can stream rows through partial partitions, copying every
row into RowContainer and then extracting the same rows back into vectors adds
unnecessary c2r/r2c work.

This change keeps the existing RowsStreamingWindowBuild path, but retains input
RowVector ranges directly and exposes them through a vector-backed
WindowPartition. This removes the RowContainer materialization/extraction round
trip for rows-streaming window execution while preserving partial-partition
processing and range-frame peer/NaN semantics.

Add a row-streaming-window benchmark with pre-sorted Values input. Folly timing
excludes cursor/task setup and stats collection via BENCHMARK_SUSPEND;
windowCpu/windowWall report the Window operator's addInput + getOutput timings.

Benchmark setup:
- Input: p INTEGER, s INTEGER, v BIGINT, sorted by p, s.
- Sizes: 10K, 100K, 1M rows; 10K rows per input vector.
- Larger inputs use 25K rows per partition to cross vector boundaries.
- Cases: rank, sum(v), rank()+sum(v), and 7 funcs.
- 7 funcs: rank, dense_rank, row_number, sum, count, min, max.
- Output batch: 1K rows.
- Build: _build/release (-O3 -DNDEBUG), main 32e6cc4 vs this change.
- Command: `velox_rows_streaming_window_benchmark --bm_regex='rowsStreamingWindow.*' --bm_min_iters=5 --bm_min_usec=1000000`

Benchmark results. Each timing cell is total time / Window CPU.

| Case | Rows | Main | This change | CPU reduction |
| --- | ---: | ---: | ---: | ---: |
| rank | 10K | 638.83us / 589.03us | 303.48us / 258.87us | 56.1% |
| sum | 10K | 1.47ms / 1.42ms | 1.02ms / 951.05us | 33.0% |
| rank+sum | 10K | 1.50ms / 1.45ms | 1.01ms / 956.86us | 34.0% |
| 7 funcs | 10K | 3.13ms / 3.09ms | 2.85ms / 2.84ms | 8.1% |
| rank | 100K | 5.71ms / 5.52ms | 2.77ms / 2.57ms | 53.4% |
| sum | 100K | 14.28ms / 14.10ms | 9.67ms / 9.45ms | 33.0% |
| rank+sum | 100K | 14.24ms / 14.09ms | 9.75ms / 9.57ms | 32.1% |
| 7 funcs | 100K | 30.00ms / 30.21ms | 28.11ms / 28.30ms | 6.3% |
| rank | 1M | 56.18ms / 54.83ms | 27.95ms / 26.52ms | 51.6% |
| sum | 1M | 145.83ms / 140.15ms | 96.36ms / 95.24ms | 32.0% |
| rank+sum | 1M | 147.18ms / 141.30ms | 97.40ms / 96.25ms | 31.9% |
| 7 funcs | 1M | 299.70ms / 302.85ms | 282.31ms / 285.03ms | 5.9% |

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
zhli1142015 and others added 4 commits June 4, 2026 19:02
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
WindowPartitionKeys.h bundled three unrelated types in the detail namespace
even though they were used across translation units. Separate them by
responsibility:

- RowColumnsSnapshot (renamed from WindowPartitionKeyRowSnapshot) is the only
  genuinely shared type. It copies a subset of a row's columns so the row can
  be compared after its source vector is gone, which is not window-specific.
  It moves to RowColumnsSnapshot.{h,cpp} in the exec namespace.
- The reference into a retained input vector is used only by
  VectorWindowPartition, so it moves to an anonymous namespace in the .cpp
  together with rowAt and rowsEqual.
- The key-channel deduplication becomes a small free function inlined into
  each of the two callers instead of a class with static methods.

No behavior change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move the duplicated RowBlock struct into a single shared header
  velox/exec/window/RowBlock.h and validate its invariants in the
  constructor so all creation paths are consistent.
- Rename VectorWindowPartition::addBlock to addRows to match the
  existing add-rows vocabulary.
- Inline the trivial numRows and numRowsForProcessing getters in the
  header.
- Initialize blockPrefixSums_ inline instead of in the constructor body.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zhli1142015 zhli1142015 force-pushed the optimize-rows-streaming-window-vector branch from 07fc8d1 to 652168e Compare June 4, 2026 11:02
@zhli1142015 zhli1142015 requested a review from mbasmanova June 4, 2026 11:40
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates!

  • RowBlock — consider renaming to RowRange. "Block" implies a data structure; this is a row range within a vector.
  • RowColumnsSnapshot — consider SingleRowValues. Simple, direct — it stores values from a single row.
  • The doc comment ("Copies a subset of columns from one row into self-contained one-row vectors so the row can be compared against later rows after its source vector is gone") is hard to parse. Simpler: "Stores copies of selected column values from a single row for later comparison."
  • isValid() / clear() — use hasValue() / reset() to match std::optional vocabulary. The doc comment should describe the lifecycle: empty by default, capture() populates, reset() clears, hasValue() checks — like std::optional.
  • rowsEqual() — redundant with the class name. Just equals().
  • capture() takes channels on every call, but the same channels are passed every time. equals() takes keyInfo and inputChannels on every call, also always the same. Make these constructor parameters. Then capture(input, row) and equals(input, row) become simple — the object knows what to capture and compare.

Rename RowBlock to RowRange and RowColumnsSnapshot to SingleRowValues, and
rework SingleRowValues to bake its columns and pool into the constructor so
capture/equals take only an input row. Adopt std::optional vocabulary
(hasValue/reset) and rename the block-based members and helpers to range.

RowsStreamingWindowBuild now keeps separate partition-key and sort-key value
snapshots instead of one combined snapshot.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@zhli1142015 zhli1142015 force-pushed the optimize-rows-streaming-window-vector branch from 8384f11 to 13fe802 Compare June 4, 2026 14:02
@zhli1142015
Copy link
Copy Markdown
Contributor Author

zhli1142015 commented Jun 4, 2026

Thank you for the detailed review, @mbasmanova ! I've addressed all the points in commit 13fe802:

  • Renamed RowBlock to RowRange, and updated the related members/helpers to use "range" vocabulary for consistency.
  • Renamed RowColumnsSnapshot to SingleRowValues.
  • Simplified the class doc comment to your suggested wording and documented the std::optional-like lifecycle.
  • Renamed isValid()/clear() to hasValue()/reset(), and rowsEqual() to equals().
  • Moved the constant columns and pool into the constructor, so capture(input, row) and equals(input, row) are now simple. In RowsStreamingWindowBuild, since the previous single snapshot captured the union of partition and sort keys but compared subsets, I split it into two single-purpose snapshots (partitionKeyValues_ and peerKeyValues_) so each object knows exactly what to capture and compare.

All window unit tests and the benchmark pass with no regression. Please take another look when you have a chance.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough updates.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Jun 4, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Jun 4, 2026

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this in D107544696.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants