Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate functions on string columns #17192

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

GWphua
Copy link
Contributor

@GWphua GWphua commented Sep 30, 2024

Fixes #16956.

Description

This PR introduces min() and max() aggregators for string columns: StringMinAggregator and StringMaxAggregator, and their respective buffer and vector aggregators. These aggregators compute the minimum and maximum string values respectively in a column during query execution.

Throughout the implementation process, I had some difficulty understand some of the below, and would appreciate extra care in the reviewing process for:

  • Usage of expression in the aggregator factories.
  • Use cases of SingleValueDimensionVectorSelector and MultiValueDimensionVectorSelector for vector aggregation.
  • VectorAggregator classes in general.
  • Should I mention my aggregators under moving-average-query.md?

Regarding the issue #11659 that is linked to #16956, I did some testing, and while my string aggregators can run on datasources:

SELECT min("user")
FROM "wikipedia"

Results:

*feridiák

Running the following will still not work.

SELECT
  min("start"), max("end")
FROM sys.segments

Results:

Error: RUNTIME_FAILURE (OPERATOR)
java.lang.reflect.InvocationTargetException
java.lang.RuntimeException

With that out of the way, let's go into the description of my implementations.


Add StringMin and StringMax Aggregators

  • Implemented StringMinAggregator to find the minimum string value in a column.
  • Implemented StringMaxAggregator to find the maximum string value in a column.
  • Implemented StringMinBufferAggregator and StringMaxBufferAggregator for efficient in-memory aggregation.
  • Implemented StringMinVectorAggregator and StringMaxVectorAggregator for vectorized query execution.

Setting up Aggregator Factory classes

  • Implemented StringMinAggregatorFactory and StringMaxAggregatorFactory to create the respective aggregators.
  • Allow creation of StringMinAggregatorFactory and StringMaxAggregatorFactory under createMinAggregatorFactory and createMaxAggregatorFactory respectively.
  • Change logic of Aggregations#getArgumentsForSimpleAggregator to accommodate the new String aggregation functions.
  • Add new cache key IDs under AggregatorUtils
  • Add the new aggregators as modules to AggregatorsModule.

Unit Tests

  • Added unit tests for all above classes, except VectorAggregators.
  • Added the new aggregator factory classes in AggregatorFactoryTest.
  • Remove AGGREGATION_NOT_SUPPORT_TYPE annotations under DrillWindowQueryTest to support testing of new aggregators.
  • Add new tests under DrillWindowQueryTest

Release note

New: You can now query string columns with min() and max() functions.


Key changed/added classes in this PR
  • StringMinAggregator
  • StringMaxAggregator
  • StringMinBufferAggregator
  • StringMaxBufferAggregator
  • StringMinVectorAggregator
  • StringMaxVectorAggregator
  • StringMinAggregatorFactory
  • StringMaxAggregatorFactory
  • StringMinAggregatorTest
  • StringMaxAggregatorTest
  • StringMinBufferAggregatorTest
  • StringMaxBufferAggregatorTest
  • StringMinVectorAggregatorTest
  • StringMaxVectorAggregatorTest
  • StringMinAggregatorFactoryTest
  • StringMaxAggregatorFactoryTest
  • aggregations.md

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Add StringMinBufferAggregator

Settle all fields in StringMinAggregatorFactory

Add min aggregator to Guice

Fix combine logic in aggregator

Fix byte logic in bufferAggregator

Make Factory Java8 compatible

Add String Aggregator to AggregationType

Implement methods to pass Unit AggregatorFactoryTest

Implement superclass for StringAggregatorFactory

Remove re-implementation of superclass

Remove duplicated implementation of factorize()

Finish StringMinBufferAggregator

Add limit to number of bytes

Add and run unit tests for StringMinBufferAggregator

Add MinAggregationFactory methods

Add StringMinVectorAggregator

Change AggregatorFactoryTest declarations

Fix checkstyle
@@ -3,4 +3,4 @@ b a
c a
d a
e a
null a
null null
Copy link
Contributor Author

@GWphua GWphua Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to justify my changes to the expected test result for test cases woutPrtnBy_6 and woutPrtnBy_7.

Taking woutPrtnBy_6 for example, the query is as follows:

SELECT c2, MIN(MIN(c2)) OVER( ORDER BY c2 ) min_c2 
FROM "tblWnulls.parquet" 
GROUP BY c2

the original expected and actual results are as follows:

2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Expected results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, "a"}
----
2024-10-02T08:47:04,748 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_6] org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Actual results --
new Object[]{"a", "a"},
new Object[]{"b", "a"},
new Object[]{"c", "a"},
new Object[]{"d", "a"},
new Object[]{"e", "a"},
new Object[]{null, null}
----

Turns out that either one of OVER(ORDER BY c2) or GROUP BY (I still have difficulty finding out where this happened) will a partial-result of the following order: [null, a, b, c, d, e]. As such, our MinStringAggregator#compare will be iterating strings as follows:

compare(null, null) --> null
compare(null, 'a') --> 'a'
compare('a', 'b') --> 'a'
compare('a', 'c') --> 'a'
compare('a', 'd') --> 'a'
compare('a', 'e') --> 'a'

As we can see, due to the order in which compare() runs, we are not able to do away with null. This order of things also happened for woutPrtnBy_7, which uses MAX instead of MIN.

While I continued to work and experimented with things, I realised that this order is not only limited to my new StringMin implementation. A tweak to woutPrtnBy_2 shows the following results:

Original Query:

SELECT c2, MIN(MAX(c1)) OVER( ORDER BY c2 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c2

Changed Query:

SELECT c1, MIN(MAX(c1)) OVER( ORDER BY c1 ) min_mx_c1 FROM "tblWnulls.parquet" GROUP BY c1

Results:

2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #0: [null, null]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #1: [-1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #2: [0, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #3: [1, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #4: [2, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #5: [4, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #6: [5, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #7: [6, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #8: [8, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #9: [9, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #10: [10, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #11: [11, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #12: [12, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #13: [13, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #14: [14, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #15: [15, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #16: [17, -1]
2024-10-02T08:43:55,868 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #17: [19, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #18: [11111, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #19: [65536, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #20: [1000000, -1]
2024-10-02T08:43:55,869 INFO [drillWindowQuery-nestedAggs/woutPrtnBy_2] org.apache.druid.sql.calcite.BaseCalciteQueryTest - row #21: [2147483647, -1]

java.lang.RuntimeException: Can't parse input!

Although the test ran into a runtime exception, we can see that the results that are logged shows the presence of [null, null]. Hence, this means that the partial query generated in this case also does not follow a null last comparator logic. Oh, before you panic, I have changed my edits to woutPrtnBy_2 after making such tests 😅.

Seeing that the behaviour is the same across different ColumnTypes, I think that fixing this bug (if do we see it as one) will require another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adarshsanjeev
Copy link
Contributor

Thanks for making the PR! I'll try to review this soon.

@GWphua
Copy link
Contributor Author

GWphua commented Nov 4, 2024

Hello @adarshsanjeev, would like to ping and see how's it going 😄

Copy link

github-actions bot commented Jan 4, 2025

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the [email protected] list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Min() and Max() aggregate functions on string columns
3 participants