-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregate functions on string columns #17192
base: master
Are you sure you want to change the base?
Aggregate functions on string columns #17192
Conversation
Add StringMinBufferAggregator Settle all fields in StringMinAggregatorFactory Add min aggregator to Guice Fix combine logic in aggregator Fix byte logic in bufferAggregator Make Factory Java8 compatible Add String Aggregator to AggregationType Implement methods to pass Unit AggregatorFactoryTest Implement superclass for StringAggregatorFactory Remove re-implementation of superclass Remove duplicated implementation of factorize() Finish StringMinBufferAggregator Add limit to number of bytes Add and run unit tests for StringMinBufferAggregator Add MinAggregationFactory methods Add StringMinVectorAggregator Change AggregatorFactoryTest declarations Fix checkstyle
processing/src/main/java/org/apache/druid/query/aggregation/SimpleStringAggregatorFactory.java
Fixed
Show fixed
Hide fixed
@@ -3,4 +3,4 @@ b a | |||
c a | |||
d a | |||
e a | |||
null a | |||
null null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to justify my changes to the expected test result for test cases woutPrtnBy_6
and woutPrtnBy_7
.
Taking woutPrtnBy_6
for example, the query is as follows:
SELECT c2, MIN(MIN(c2)) OVER( ORDER BY c2 ) min_c2
FROM "tblWnulls.parquet"
GROUP BY c2
the original expected and actual results are as follows:
2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Expected results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, "a"}
----
2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Actual results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, null}
----
Turns out that either one of OVER(ORDER BY c2)
or GROUP BY
(I still have difficulty finding out where this happened) will a partial-result of the following order: [null, a, b, c, d, e]
. As such, our MinStringAggregator#compare
will be iterating strings as follows:
compare(null, null) --> null
compare(null, 'a') --> 'a'
compare('a', 'b') --> 'a'
compare('a', 'c') --> 'a'
compare('a', 'd') --> 'a'
compare('a', 'e') --> 'a'
As we can see, due to the order in which compare()
runs, we are not able to do away with null
. This order of things also happened for woutPrtnBy_7
, which uses MAX instead of MIN.
While I continued to work and experimented with things, I realised that this order is not only limited to my new StringMin implementation. A tweak to woutPrtnBy_2
shows the following results:
Original Query:
SELECT c2, MIN(MAX(c1)) OVER( ORDER BY c2 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c2
Changed Query:
SELECT c1, MIN(MAX(c1)) OVER( ORDER BY c1 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c1
Results:
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #0: [null, null]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #1: [-1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #2: [0, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #3: [1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #4: [2, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #5: [4, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #6: [5, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #7: [6, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #8: [8, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #9: [9, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #10: [10, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #11: [11, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #12: [12, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #13: [13, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #14: [14, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #15: [15, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #16: [17, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #17: [19, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #18: [11111, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #19: [65536, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #20: [1000000, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #21: [2147483647, -1]
java.lang.RuntimeException: Can't parse input!
Although the test ran into a runtime exception, we can see that the results that are logged shows the presence of [null, null]. Hence, this means that the partial query generated in this case also does not follow a null last comparator logic. Oh, before you panic, I have changed my edits to woutPrtnBy_2
after making such tests 😅.
Seeing that the behaviour is the same across different ColumnTypes, I think that fixing this bug (if do we see it as one) will require another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the PR! I'll try to review this soon. |
Hello @adarshsanjeev, would like to ping and see how's it going 😄 |
This pull request has been marked as stale due to 60 days of inactivity. |
Fixes #16956.
Description
This PR introduces min() and max() aggregators for string columns:
StringMinAggregator
andStringMaxAggregator
, and their respective buffer and vector aggregators. These aggregators compute the minimum and maximum string values respectively in a column during query execution.Throughout the implementation process, I had some difficulty understand some of the below, and would appreciate extra care in the reviewing process for:
expression
in the aggregator factories.SingleValueDimensionVectorSelector
andMultiValueDimensionVectorSelector
for vector aggregation.moving-average-query.md
?Regarding the issue #11659 that is linked to #16956, I did some testing, and while my string aggregators can run on datasources:
Results:
Running the following will still not work.
Results:
With that out of the way, let's go into the description of my implementations.
Add StringMin and StringMax Aggregators
StringMinAggregator
to find the minimum string value in a column.StringMaxAggregator
to find the maximum string value in a column.StringMinBufferAggregator
andStringMaxBufferAggregator
for efficient in-memory aggregation.StringMinVectorAggregator
andStringMaxVectorAggregator
for vectorized query execution.Setting up Aggregator Factory classes
StringMinAggregatorFactory
andStringMaxAggregatorFactory
to create the respective aggregators.StringMinAggregatorFactory
andStringMaxAggregatorFactory
undercreateMinAggregatorFactory
andcreateMaxAggregatorFactory
respectively.Aggregations#getArgumentsForSimpleAggregator
to accommodate the new String aggregation functions.AggregatorUtils
AggregatorsModule
.Unit Tests
AggregatorFactoryTest
.AGGREGATION_NOT_SUPPORT_TYPE
annotations underDrillWindowQueryTest
to support testing of new aggregators.DrillWindowQueryTest
Release note
New: You can now query string columns with min() and max() functions.
Key changed/added classes in this PR
StringMinAggregator
StringMaxAggregator
StringMinBufferAggregator
StringMaxBufferAggregator
StringMinVectorAggregator
StringMaxVectorAggregator
StringMinAggregatorFactory
StringMaxAggregatorFactory
StringMinAggregatorTest
StringMaxAggregatorTest
StringMinBufferAggregatorTest
StringMaxBufferAggregatorTest
StringMinVectorAggregatorTest
StringMaxVectorAggregatorTest
StringMinAggregatorFactoryTest
StringMaxAggregatorFactoryTest
aggregations.md
This PR has: