Aggregate functions on string columns #17192

GWphua · 2024-09-30T10:52:34Z

Description

This PR introduces min() and max() aggregators for string columns: StringMinAggregator and StringMaxAggregator, and their respective buffer and vector aggregators. These aggregators compute the minimum and maximum string values respectively in a column during query execution.

Throughout the implementation process, I had some difficulty understand some of the below, and would appreciate extra care in the reviewing process for:

Usage of expression in the aggregator factories.
Use cases of SingleValueDimensionVectorSelector and MultiValueDimensionVectorSelector for vector aggregation.
VectorAggregator classes in general.
Should I mention my aggregators under moving-average-query.md?

Regarding the issue #11659 that is linked to #16956, I did some testing, and while my string aggregators can run on datasources:

SELECT min("user")
FROM "wikipedia"

Results:

*feridiák

Running the following will still not work.

SELECT
  min("start"), max("end")
FROM sys.segments

Results:

Error: RUNTIME_FAILURE (OPERATOR)
java.lang.reflect.InvocationTargetException
java.lang.RuntimeException

With that out of the way, let's go into the description of my implementations.

Add StringMin and StringMax Aggregators

Implemented StringMinAggregator to find the minimum string value in a column.
Implemented StringMaxAggregator to find the maximum string value in a column.
Implemented StringMinBufferAggregator and StringMaxBufferAggregator for efficient in-memory aggregation.
Implemented StringMinVectorAggregator and StringMaxVectorAggregator for vectorized query execution.

Setting up Aggregator Factory classes

Implemented StringMinAggregatorFactory and StringMaxAggregatorFactory to create the respective aggregators.
Allow creation of StringMinAggregatorFactory and StringMaxAggregatorFactory under createMinAggregatorFactory and createMaxAggregatorFactory respectively.
Change logic of Aggregations#getArgumentsForSimpleAggregator to accommodate the new String aggregation functions.
Add new cache key IDs under AggregatorUtils
Add the new aggregators as modules to AggregatorsModule.

Unit Tests

Added unit tests for all above classes, except VectorAggregators.
Added the new aggregator factory classes in AggregatorFactoryTest.
Remove AGGREGATION_NOT_SUPPORT_TYPE annotations under DrillWindowQueryTest to support testing of new aggregators.
Add new tests under DrillWindowQueryTest

Release note

New: You can now query string columns with min() and max() functions.

Key changed/added classes in this PR

StringMinAggregator
StringMaxAggregator
StringMinBufferAggregator
StringMaxBufferAggregator
StringMinVectorAggregator
StringMaxVectorAggregator
StringMinAggregatorFactory
StringMaxAggregatorFactory
StringMinAggregatorTest
StringMaxAggregatorTest
StringMinBufferAggregatorTest
StringMaxBufferAggregatorTest
StringMinVectorAggregatorTest
StringMaxVectorAggregatorTest
StringMinAggregatorFactoryTest
StringMaxAggregatorFactoryTest
aggregations.md

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Add StringMinBufferAggregator Settle all fields in StringMinAggregatorFactory Add min aggregator to Guice Fix combine logic in aggregator Fix byte logic in bufferAggregator Make Factory Java8 compatible Add String Aggregator to AggregationType Implement methods to pass Unit AggregatorFactoryTest Implement superclass for StringAggregatorFactory Remove re-implementation of superclass Remove duplicated implementation of factorize() Finish StringMinBufferAggregator Add limit to number of bytes Add and run unit tests for StringMinBufferAggregator Add MinAggregationFactory methods Add StringMinVectorAggregator Change AggregatorFactoryTest declarations Fix checkstyle

processing/src/main/java/org/apache/druid/query/aggregation/SimpleStringAggregatorFactory.java

GWphua · 2024-10-02T09:48:19Z

sql/src/test/resources/drill/window/queries/nestedAggs/woutPrtnBy_6.e

@@ -3,4 +3,4 @@ b	a
 c	a
 d	a
 e	a
-null	a
+null	null


I would like to justify my changes to the expected test result for test cases woutPrtnBy_6 and woutPrtnBy_7.

Taking woutPrtnBy_6 for example, the query is as follows:

SELECT c2, MIN(MIN(c2)) OVER( ORDER BY c2 ) min_c2 FROM "tblWnulls.parquet" GROUP BY c2

the original expected and actual results are as follows:

2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Expected results -- new Object[]{"a", "a"}, new Object[]{"b", "a"}, new Object[]{"c", "a"}, new Object[]{"d", "a"}, new Object[]{"e", "a"}, new Object[]{null, "a"} ---- 2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Actual results -- new Object[]{"a", "a"}, new Object[]{"b", "a"}, new Object[]{"c", "a"}, new Object[]{"d", "a"}, new Object[]{"e", "a"}, new Object[]{null, null} ----

Turns out that either one of OVER(ORDER BY c2) or GROUP BY (I still have difficulty finding out where this happened) will a partial-result of the following order: [null, a, b, c, d, e]. As such, our MinStringAggregator#compare will be iterating strings as follows:

compare(null, null) --> null compare(null, 'a') --> 'a' compare('a', 'b') --> 'a' compare('a', 'c') --> 'a' compare('a', 'd') --> 'a' compare('a', 'e') --> 'a'

As we can see, due to the order in which compare() runs, we are not able to do away with null. This order of things also happened for woutPrtnBy_7, which uses MAX instead of MIN.

While I continued to work and experimented with things, I realised that this order is not only limited to my new StringMin implementation. A tweak to woutPrtnBy_2 shows the following results:

Original Query:

SELECT c2, MIN(MAX(c1)) OVER( ORDER BY c2 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c2

Changed Query:

SELECT c1, MIN(MAX(c1)) OVER( ORDER BY c1 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c1

Results:

2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #0: [null, null] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #1: [-1, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #2: [0, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #3: [1, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #4: [2, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #5: [4, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #6: [5, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #7: [6, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #8: [8, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #9: [9, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #10: [10, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #11: [11, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #12: [12, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #13: [13, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #14: [14, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #15: [15, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #16: [17, -1] 2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #17: [19, -1] 2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #18: [11111, -1] 2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #19: [65536, -1] 2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #20: [1000000, -1] 2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #21: [2147483647, -1] java.lang.RuntimeException: Can't parse input!

Although the test ran into a runtime exception, we can see that the results that are logged shows the presence of [null, null]. Hence, this means that the partial query generated in this case also does not follow a null last comparator logic. Oh, before you panic, I have changed my edits to woutPrtnBy_2 after making such tests 😅.

Seeing that the behaviour is the same across different ColumnTypes, I think that fixing this bug (if do we see it as one) will require another PR.

cc @imply-cheddar

…-on-string-columns

adarshsanjeev · 2024-10-08T05:05:50Z

Thanks for making the PR! I'll try to review this soon.

GWphua · 2024-11-04T01:58:33Z

Hello @adarshsanjeev, would like to ping and see how's it going 😄

github-actions · 2025-01-04T00:20:40Z

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the [email protected] list.
Thank you for your contributions.

GWphua added 17 commits September 30, 2024 16:42

Add min and max aggregators

53c8cff

Remove StringMaxAggregator, deal with StringMin first

67b883a

Complete StringMinAggregatorFactory

5f02730

Allow min/max aggregators to parse expression correctly for strings

a9c5753

Allow StringMin / Max to be processed as a Direct Column Access.

5202445

Add utility support for stringMin

600c66d

Add SimpleString super class

f3101c5

Implement SimpleStringAggregator, MinStringAggregator works

6b0c4ed

Implement StringMinBufferAggregatorTest

6dc2a97

Implement StringMinAggregatorTest

13cecd8

Add factorizeVector

90b30b8

Close all StringMinAggregator in test files to avoid memory leaks.

3cf5443

Add Unit Test for StringMinAggregatorFactoryTest

db336e0

Fix checkstyle

478b13f

Add StringMaxAggregator

b872a00

Add StringMaxAggregation tests

c19e3e7

github-actions bot added the Area - Querying label Sep 30, 2024

Fix checkstyle errors

ef33d79

github-advanced-security bot found potential problems Sep 30, 2024

View reviewed changes

processing/src/main/java/org/apache/druid/query/aggregation/SimpleStringAggregatorFactory.java Fixed Show fixed Hide fixed

GWphua added 10 commits October 1, 2024 10:15

Add checks to ensure getMaxIntermediateSize will not overflow

21607f1

Fix Forbidden API violation on String.format()

9a4f6f0

Fix MIN(STRING) unsupported test under SqlResourceTest

fc1278e

Add unit test for AggregatorUtil initialization

a2a681b

Add maxStringBytes check to address CodeQL scanning

ff7f9ad

100% coverage for Factory classes

16848c4

Add more tests for SimpleString

f9e2e0c

Remove unsupported annotations for supported tests

8c83661

Remove notYetSupported annotation for minString/maxString

5fe883a

Fix checkstyle assertions in StringMinAggregatorFactoryTest

88a7312

GWphua added 3 commits October 2, 2024 16:54

Rectify DrillWindowQueryTest expected output

5cd5f62

Add new test for nested aggregation with nulls

46226e5

Fix MaxAggregatorFactoryTest checkstyle

2b4f40f

GWphua commented Oct 2, 2024

View reviewed changes

GWphua added 2 commits October 3, 2024 10:12

Merge remote-tracking branch 'origin/master' into aggregate-functions…

74e3a38

…-on-string-columns

Update documentation on stringMin/stringMax

f56936c

github-actions bot added the Area - Documentation label Oct 3, 2024

github-actions bot added the stale label Jan 4, 2025

Merge branch 'apache:master' into aggregate-functions-on-string-columns

0656f7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate functions on string columns #17192

Aggregate functions on string columns #17192

GWphua commented Sep 30, 2024 •

edited

Loading

GWphua Oct 2, 2024 •

edited

Loading

asdf2014 Oct 3, 2024

adarshsanjeev commented Oct 8, 2024

GWphua commented Nov 4, 2024

github-actions bot commented Jan 4, 2025

Aggregate functions on string columns #17192

Are you sure you want to change the base?

Aggregate functions on string columns #17192

Conversation

GWphua commented Sep 30, 2024 • edited Loading

Description

Add StringMin and StringMax Aggregators

Setting up Aggregator Factory classes

Unit Tests

Release note

Key changed/added classes in this PR

GWphua Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

asdf2014 Oct 3, 2024

Choose a reason for hiding this comment

adarshsanjeev commented Oct 8, 2024

GWphua commented Nov 4, 2024

github-actions bot commented Jan 4, 2025

GWphua commented Sep 30, 2024 •

edited

Loading

GWphua Oct 2, 2024 •

edited

Loading