[FEATURE] Optimize PPL count aggregation performance by utilizing native doc_count

**Is your feature request related to a problem?**

When PPL queries using `stats count() by span(<date>)`, it generates unnecessarily slow queries due to redundant aggregations. The current implementation creates both a composite date_histogram aggregation AND a separate `value_count` aggregation on the `_index` field, even though the `date_histogram` buckets already provide a `doc_count` field with the same information from the underlying DSL.

PPL:
```
"query": "source = {{index_name | default('big5')}} | stats count() by span(`@timestamp`, 1h)"
```

PPL query plan:
```
"calcite": {
  "logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n  LogicalProject(count()=[$1], span(`@timestamp`,1h)=[$0])\n    LogicalAggregate(group=[{0}], count()=[COUNT()])\n      LogicalProject(span(`@timestamp`,1h)=[SPAN($17, 1, 'h')])\n        CalciteLogicalIndexScan(table=[[OpenSearch, big5]])\n",
  "physical": "EnumerableLimit(fetch=[10000])\n  EnumerableCalc(expr#0..1=[{inputs}], count()=[$t1], span(`@timestamp`,1h)=[$t0])\n    CalciteEnumerableIndexScan(table=[[OpenSearch, big5]], PushDownContext=[[AGGREGATION->rel#5755:LogicalAggregate.NONE.[](input=RelSubset#5754,group={0},count()=COUNT())], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"span(`@timestamp`,1h)\":{\"date_histogram\":{\"field\":\"@timestamp\",\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\",\"fixed_interval\":\"1h\"}}}]},\"aggregations\":{\"count()\":{\"value_count\":{\"field\":\"_index\"}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n"
}	
```

DSL in PPL plan:
```
"opensearch_dsl": {
  "from": 0,
  "size": 0,
  "timeout": "1m",
  "aggregations": {
    "composite_buckets": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "span(`@timestamp`,1h)": {
              "date_histogram": {
                "field": "@timestamp",
                "missing_bucket": true,
                "missing_order": "first",
                "order": "asc",
                "fixed_interval": "1h"
              }
            }
          }
        ]
      },
      "aggregations": {
        "count()": {
          "value_count": {
            "field": "_index"
          }
        }
      }
    }
  }
}	
```

**What solution would you like?**

PPL-to-DSL translation can be optimized to avoid creating redundant `value_count` aggregations when using count() with date histogram aggregations on span(). 


**Do you have any additional context?**

Performance comparison of DSL queries:

Simple date histogram (fast):

```
{
  "size": 0,
  "aggs": {
    "by_hour": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "hour"
      }
    }
  }
}
```

With additional agg on count() (10x slower):

```
{
  "size": 0,
  "aggs": {
    "by_hour": {
      "date_histogram": {
        "field": "@timestamp",
        "calendar_interval": "hour"
      }
    },
    "count()": {
      "value_count": {
        "field": "_index"
      }
    }
  }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Optimize PPL count aggregation performance by utilizing native doc_count #4265

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Optimize PPL count aggregation performance by utilizing native doc_count #4265

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions