Skip to content

RFC-0005 Phase 2, annotated string functions and added tests. #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: RFC_5_UDF_STATS_PHASE_1
Choose a base branch
from

Conversation

ScrapCodes
Copy link
Owner

@ScrapCodes ScrapCodes commented Oct 9, 2024

Description

  1. Annotated the scalar functions in StringFunctions class, with ScalarFunctionConstantStats and ScalarPropagateSourceStats .
  2. Added appropriate tests to check if the stats propagation works as expected.
  3. Updated AbstractCostBasedPlanTests to generate plans with this feature on and off.

Motivation and Context

https://github.com/prestodb/rfcs/blob/main/RFC-0005-functions-stats.md

Impact

None unless the user chooses to enable the feature via setting the session/feature flag.
A new session flag, scalar_function_stats_propagation_enabled and a new feature config will be introduced i.e. optimizer.scalar-function-stats-propagation-enabled, by setting this session flag or feature flag, this feature can be turned on or off.

When the feature is enabled, since string functions are annotated the effect of stats propagation can be measured or seen in the form of plan changes.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_2 branch 3 times, most recently from 7f5dc5c to 55e620a Compare October 9, 2024 10:34
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from a9f04a8 to 99302d3 Compare October 10, 2024 08:34
1. Support for annotating functions with both constant stats and propagating source stats.
2. Added tests for the same.
3. Added Scalar stats calculation based on annotation and tests for the same.

Not added SQLInvokedScalarFunctions.
Not annotated builtin functions, as that is covered in next implementation phase.
Not added C++ changes as this phase only covers Java side of changes.

Added documentation for the new properties and ...
 1. Previously, if any of the source stats were missing, we would still compute the max/min/sum of argument stats etc..
  now we propagate NaNs if any one of the arguments' stats are missing.

2. For distinct values count, upper bounding it to row count is as good as unknown. Therefore, the approach here is, when distinctValuesCount is greater than row count and is provided via annotation we set it to unknown.
A function developer has full control here, for example developer can choose to upper bound or not by selecting the appropriate StatsPropagationBehavior value.

 3. For average row size,
    a) If average row size is provided via ScalarFunctionConstantStats annotation, then we allow even if the size is greater than functions return type width.
    b) If average row size is provided via one of the StatsPropagationBehavior values, then we upper bound it to functions return type width - if available.
    If both (a) and (b) is unknown, then we default it to functions return type width if available.

This way the function developer has greater control.

Added new behaviour SUM_ARGUMENTS_UPPER_BOUND_ROW_COUNT which would upper bound the values to row count, so that summing distinct values count not exceed row counts.
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from 99302d3 to 9d77026 Compare October 15, 2024 10:03
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_2 branch 2 times, most recently from 608d2ab to 8962804 Compare October 21, 2024 07:06
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_2 branch 3 times, most recently from 4eab57e to 4dc9602 Compare October 24, 2024 09:33
…ions` class, with `ScalarFunctionConstantStats` and `ScalarPropagateSourceStats` .

2. Added appropriate tests to check if the stats propagation works as expected.
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_2 branch from 4dc9602 to c295d03 Compare October 24, 2024 10:03
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from e248c01 to 40b6f47 Compare November 20, 2024 07:07
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from 9551c16 to 7693879 Compare December 13, 2024 09:09
@ScrapCodes ScrapCodes force-pushed the RFC_5_UDF_STATS_PHASE_1 branch from c334fdb to 07d2d0c Compare December 28, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant