Skip to content

[FEATURE] Optimize PPL count aggregation performance by utilizing native doc_count #4265

@noCharger

Description

@noCharger

Is your feature request related to a problem?

When PPL queries using stats count() by span(<date>), it generates unnecessarily slow queries due to redundant aggregations. The current implementation creates both a composite date_histogram aggregation AND a separate value_count aggregation on the _index field, even though the date_histogram buckets already provide a doc_count field with the same information from the underlying DSL.

PPL:

"query": "source = {{index_name | default('big5')}} | stats count() by span(`@timestamp`, 1h)"

PPL query plan:

"calcite": {
  "logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n  LogicalProject(count()=[$1], span(`@timestamp`,1h)=[$0])\n    LogicalAggregate(group=[{0}], count()=[COUNT()])\n      LogicalProject(span(`@timestamp`,1h)=[SPAN($17, 1, 'h')])\n        CalciteLogicalIndexScan(table=[[OpenSearch, big5]])\n",
  "physical": "EnumerableLimit(fetch=[10000])\n  EnumerableCalc(expr#0..1=[{inputs}], count()=[$t1], span(`@timestamp`,1h)=[$t0])\n    CalciteEnumerableIndexScan(table=[[OpenSearch, big5]], PushDownContext=[[AGGREGATION->rel#5755:LogicalAggregate.NONE.[](input=RelSubset#5754,group={0},count()=COUNT())], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"span(`@timestamp`,1h)\":{\"date_histogram\":{\"field\":\"@timestamp\",\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\",\"fixed_interval\":\"1h\"}}}]},\"aggregations\":{\"count()\":{\"value_count\":{\"field\":\"_index\"}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n"
}	

DSL in PPL plan:

"opensearch_dsl": {
  "from": 0,
  "size": 0,
  "timeout": "1m",
  "aggregations": {
    "composite_buckets": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "span(`@timestamp`,1h)": {
              "date_histogram": {
                "field": "@timestamp",
                "missing_bucket": true,
                "missing_order": "first",
                "order": "asc",
                "fixed_interval": "1h"
              }
            }
          }
        ]
      },
      "aggregations": {
        "count()": {
          "value_count": {
            "field": "_index"
          }
        }
      }
    }
  }
}	

What solution would you like?

PPL-to-DSL translation can be optimized to avoid creating redundant value_count aggregations when using count() with date histogram aggregations on span().

Do you have any additional context?

Performance comparison of DSL queries:

Simple date histogram (fast):

{
  "size": 0,
  "aggs": {
    "by_hour": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "hour"
      }
    }
  }
}

With additional agg on count() (10x slower):

{
  "size": 0,
  "aggs": {
    "by_hour": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "hour"
      }
    },
    "count()": {
      "value_count": {
        "field": "_index"
      }
    }
  }
}

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or requestpushdownpushdown related issues

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions