[Feature Request] [RFC] Grouping similar Top N Queries by Latency and Resource Usage

### Is your feature request related to a problem? Please describe

### Background

As part of the [search query categorization](https://github.com/opensearch-project/OpenSearch/issues/11596) initiative we added support to compute the query shape of the incoming search queries and logged (debug) the basic query shape. We also added instrumentations on the search path to collect the types of queries, types of aggregations, sort order, etc. 

As part of [Top N queries by latency](https://github.com/opensearch-project/OpenSearch/issues/11186), we implemented a priority queue-based in-memory data store, with configurable window size, on the coordinator node, designed to efficiently store the top N queries. 


### Problem

For Top N queries by latency, we can encounter scenarios where some (or most) of the Top N queries contain duplicate queries. Say the same dashboard query is triggered continuously and happens to be the most expensive query in terms of latency - in this scenario all the Top N queries by latency will likely be spammed by the same query. To overcome such scenarios and to get a more detailed view of the Top N query patterns we are proposing to implement Top N query shapes by resource usage (latency, memory, disk, etc). 


### Describe the solution you'd like

<h2>Design</h2><h3>Query Shape</h3>For every incoming query, we will compute the query shape. The query shape is essentially lossy transformations on the actual query to help de-duplicate the top N queries. Consider the sample query:
<pre>
{
  "query": {
    "bool": {
      "must": [
        { "term": { "product_category": "Electronics" } },
        { "term": { "product_description": "Smartphone" } },
        { "range": { "purchase_date": { "gte": "2022-01-01", "lte": "2022-12-31" } } }
      ],
      "must_not": [
        { "exists": { "field": "delivery_date" } }
      ],
      "filter": {
        "terms": { "customer_type": ["New Customers", "Returning Customers"] }
      },
      "should": [
        { "match": { "customer_feedback": "Great product" } }
      ]
    }
  },
  "sort": [{ "product_rating": {"order": "asc"}}],
  "aggs": {
    "terms_agg": {
      "terms": { "field": "product_category" },
      "aggs": {
        "date_histogram_agg": { "date_histogram": { "field": "delivery_date", "calendar_interval": "month" } },
        "avg_agg": { "avg": { "field": "product_rating" } },
        "sum_agg": { "sum": { "field": "product_rating" } },
        "min_agg": { "min": { "field": "product_rating" } },
        "max_agg": { "max": { "field": "product_rating" } },
        "cardinality_agg": { "cardinality": { "field": "product_category" } },
        "percentile_ranks_agg": { "percentile_ranks": { "field": "product_rating", "values": [25, 50, 75] } }
      }
    }
  }
}


</pre>

The proposed query shape for the above query will look something like:
<pre>

bool
  must:
    range [field:purchase_date, width:12, unit: months] 
    term [product_category:"Electronics"]
    term [product_description:"Smartphone"]
  must_not:
    exists [field:delivery_date]
  should:
    match [customer_feedback:"Great product"]
  filter:
    terms [customer_type:["New Customers","Returning Customers"]]
sort:
  asc [product_rating:asc]
aggregation:
  terms [field:product_category]
    aggregation:
      avg [field:product_rating]
      cardinality [field:product_category]
      date_histogram [field:delivery_date, calendar_interval:"month"]
      max [field:product_rating]
      min [field:product_rating]
      percentile_ranks [field:product_rating]
      sum [field:product_rating]
pipeline_aggregation:

</pre>

We are capturing the shape of the query, sort, aggregations and pipeline aggregations. We capture the sort order, field names, field values and range width (difference between upper-bound and lower-bound of the range). The range width is captured to help us de-duplicate queries with the same range size but different window. For example, the same dashboard query that is run continuously but for different time ranges.<br><br><h3>Normalize Query Shape</h3>We need to normalize the query shape to make sure that 2 same queries always map to the exact same query shape including the ordering of the clauses, ordering of the queries within the clauses, ordering of the aggregations, etc. The following normalizations will be required:<br>

1. Ordering of Query Shape Major Components: The query shape as shown above is divided into 4 distinct parts. The ordering of these parts will be as shown above: QueryBuilder tree, Sort, Aggregation tree followed by Pipeline aggregations.
2. Ordering of Bool Clauses: The order of the four boolean clauses will follow a lexicographic sequence as follows: filter, must, must_not followed by should.
3. Order of Sub-Queries within a Clause: The order of sub-queries within a clause will follow a lexicographic sequence.
4. Order of Aggregation types: The order of Aggregation types will follow a lexicographic sequence.
5. Order of Pipeline aggregation types: The order of Pipeline aggregation types will follow a lexicographic sequence.
6. Order of Sorts: Sorts will be lexicographically ordered based on sort order.
7. Order of same query type, aggregation type, sort order: If multiple queries, aggregation types, pipeline aggregations or sorts have the same type we will sort based on the real field name (not obfuscated field name).

<h3>Top N Query Shapes</h3>We will create a new processor in the Query Insights framework to keep track of the Top N query shapes in the current window, similar to what we do for Top N Queries by latency.<br><br>For every window, we will keep track of the hash to (count_of_queries, total_latency_across_all_queries). This will enable us to calculate the average latency for the queries. We will also have a priority query to keep track of the Top N queries for the window.<br><br>The algorithm will be as follows: 


![query_shape_insights](https://github.com/opensearch-project/OpenSearch/assets/13984468/6393e824-bcf5-4fac-b9da-e88bdb355e98)


After every window we will clear the hash to (count, latency) mapping and Priority Queue. We will [export](https://github.com/opensearch-project/OpenSearch/pull/12982/files) the Top N query shape data for the previous window to a local index similar to what we are aiming to do for the Top N queries by latency.<br><br><h3>APIs</h3>We will be extending the Top N queries by resource usage APIs as part of query shape insights.<br>Existing APIs:<br><pre># get top n queries by latency in the last window<br>GET /insights/top_queries?type=latency<br># get top n queries by cpu in the last window<br>GET /insights/top_queries?type=cpu</pre>Additions:<br><pre># get top n query shapes by latency in the last window<br>GET /insights/top_queries?type=latency&group_by=shape<br># get top n query shapes by cpu in the last window<br>GET /insights/top_queries?type=cpu&group_by=shape</pre>





### Related component

Search

### Describe alternatives you've considered

1. We can use the **actual query** to de-duplicate the Top N queries by latency. However, this will not take into account queries that are exactly the same with the only difference being the range width. For example, the same dashboard query that is run continuously but for different time ranges. This will also not group the exact same queries that have ordering differences in the clauses.
2. Use a more generic query shape that does not contain the field names, field values and range width information. Similar to:
```
bool
  must:
    term
    terms
  filter:
    constant_score
      filter:
        range
  should:
    bool
      must:
        match
      must_not:
        regex
```
This approach might end up grouping queries with varying latency characteristics into the same group. Example is the same query on 2 different fields exhibiting different latency due to the cardinality differences for the 2 fields (one high cardinality other low cardinality).


Please let me know your thoughts on the above proposal!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] [RFC] Grouping similar Top N Queries by Latency and Resource Usage #13357

Is your feature request related to a problem? Please describe

Background

Problem

Describe the solution you'd like

Design

Query Shape

Normalize Query Shape

Top N Query Shapes

APIs

Related component

Describe alternatives you've considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] [RFC] Grouping similar Top N Queries by Latency and Resource Usage #13357

Description

Is your feature request related to a problem? Please describe

Background

Problem

Describe the solution you'd like

Design

Query Shape

Normalize Query Shape

Top N Query Shapes

APIs

Related component

Describe alternatives you've considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions