-
Notifications
You must be signed in to change notification settings - Fork 172
Description
Is your feature request related to a problem?
When PPL queries using stats count() by span(<date>)
, it generates unnecessarily slow queries due to redundant aggregations. The current implementation creates both a composite date_histogram aggregation AND a separate value_count
aggregation on the _index
field, even though the date_histogram
buckets already provide a doc_count
field with the same information from the underlying DSL.
PPL:
"query": "source = {{index_name | default('big5')}} | stats count() by span(`@timestamp`, 1h)"
PPL query plan:
"calcite": {
"logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(count()=[$1], span(`@timestamp`,1h)=[$0])\n LogicalAggregate(group=[{0}], count()=[COUNT()])\n LogicalProject(span(`@timestamp`,1h)=[SPAN($17, 1, 'h')])\n CalciteLogicalIndexScan(table=[[OpenSearch, big5]])\n",
"physical": "EnumerableLimit(fetch=[10000])\n EnumerableCalc(expr#0..1=[{inputs}], count()=[$t1], span(`@timestamp`,1h)=[$t0])\n CalciteEnumerableIndexScan(table=[[OpenSearch, big5]], PushDownContext=[[AGGREGATION->rel#5755:LogicalAggregate.NONE.[](input=RelSubset#5754,group={0},count()=COUNT())], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":0,\"timeout\":\"1m\",\"aggregations\":{\"composite_buckets\":{\"composite\":{\"size\":1000,\"sources\":[{\"span(`@timestamp`,1h)\":{\"date_histogram\":{\"field\":\"@timestamp\",\"missing_bucket\":true,\"missing_order\":\"first\",\"order\":\"asc\",\"fixed_interval\":\"1h\"}}}]},\"aggregations\":{\"count()\":{\"value_count\":{\"field\":\"_index\"}}}}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n"
}
DSL in PPL plan:
"opensearch_dsl": {
"from": 0,
"size": 0,
"timeout": "1m",
"aggregations": {
"composite_buckets": {
"composite": {
"size": 1000,
"sources": [
{
"span(`@timestamp`,1h)": {
"date_histogram": {
"field": "@timestamp",
"missing_bucket": true,
"missing_order": "first",
"order": "asc",
"fixed_interval": "1h"
}
}
}
]
},
"aggregations": {
"count()": {
"value_count": {
"field": "_index"
}
}
}
}
}
}
What solution would you like?
PPL-to-DSL translation can be optimized to avoid creating redundant value_count
aggregations when using count() with date histogram aggregations on span().
Do you have any additional context?
Performance comparison of DSL queries:
Simple date histogram (fast):
{
"size": 0,
"aggs": {
"by_hour": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "hour"
}
}
}
}
With additional agg on count() (10x slower):
{
"size": 0,
"aggs": {
"by_hour": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "hour"
}
},
"count()": {
"value_count": {
"field": "_index"
}
}
}
}
Metadata
Metadata
Assignees
Labels
Type
Projects
Status