-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Is your feature request related to a problem? Please describe
Background
As part of the search query categorization initiative we added support to compute the query shape of the incoming search queries and logged (debug) the basic query shape. We also added instrumentations on the search path to collect the types of queries, types of aggregations, sort order, etc.
As part of Top N queries by latency, we implemented a priority queue-based in-memory data store, with configurable window size, on the coordinator node, designed to efficiently store the top N queries.
Problem
For Top N queries by latency, we can encounter scenarios where some (or most) of the Top N queries contain duplicate queries. Say the same dashboard query is triggered continuously and happens to be the most expensive query in terms of latency - in this scenario all the Top N queries by latency will likely be spammed by the same query. To overcome such scenarios and to get a more detailed view of the Top N query patterns we are proposing to implement Top N query shapes by resource usage (latency, memory, disk, etc).
Describe the solution you'd like
Design
Query Shape
For every incoming query, we will compute the query shape. The query shape is essentially lossy transformations on the actual query to help de-duplicate the top N queries. Consider the sample query:{ "query": { "bool": { "must": [ { "term": { "product_category": "Electronics" } }, { "term": { "product_description": "Smartphone" } }, { "range": { "purchase_date": { "gte": "2022-01-01", "lte": "2022-12-31" } } } ], "must_not": [ { "exists": { "field": "delivery_date" } } ], "filter": { "terms": { "customer_type": ["New Customers", "Returning Customers"] } }, "should": [ { "match": { "customer_feedback": "Great product" } } ] } }, "sort": [{ "product_rating": {"order": "asc"}}], "aggs": { "terms_agg": { "terms": { "field": "product_category" }, "aggs": { "date_histogram_agg": { "date_histogram": { "field": "delivery_date", "calendar_interval": "month" } }, "avg_agg": { "avg": { "field": "product_rating" } }, "sum_agg": { "sum": { "field": "product_rating" } }, "min_agg": { "min": { "field": "product_rating" } }, "max_agg": { "max": { "field": "product_rating" } }, "cardinality_agg": { "cardinality": { "field": "product_category" } }, "percentile_ranks_agg": { "percentile_ranks": { "field": "product_rating", "values": [25, 50, 75] } } } } } }
The proposed query shape for the above query will look something like:
bool must: range [field:purchase_date, width:12, unit: months] term [product_category:"Electronics"] term [product_description:"Smartphone"] must_not: exists [field:delivery_date] should: match [customer_feedback:"Great product"] filter: terms [customer_type:["New Customers","Returning Customers"]] sort: asc [product_rating:asc] aggregation: terms [field:product_category] aggregation: avg [field:product_rating] cardinality [field:product_category] date_histogram [field:delivery_date, calendar_interval:"month"] max [field:product_rating] min [field:product_rating] percentile_ranks [field:product_rating] sum [field:product_rating] pipeline_aggregation:
We are capturing the shape of the query, sort, aggregations and pipeline aggregations. We capture the sort order, field names, field values and range width (difference between upper-bound and lower-bound of the range). The range width is captured to help us de-duplicate queries with the same range size but different window. For example, the same dashboard query that is run continuously but for different time ranges.
Normalize Query Shape
We need to normalize the query shape to make sure that 2 same queries always map to the exact same query shape including the ordering of the clauses, ordering of the queries within the clauses, ordering of the aggregations, etc. The following normalizations will be required:- Ordering of Query Shape Major Components: The query shape as shown above is divided into 4 distinct parts. The ordering of these parts will be as shown above: QueryBuilder tree, Sort, Aggregation tree followed by Pipeline aggregations.
- Ordering of Bool Clauses: The order of the four boolean clauses will follow a lexicographic sequence as follows: filter, must, must_not followed by should.
- Order of Sub-Queries within a Clause: The order of sub-queries within a clause will follow a lexicographic sequence.
- Order of Aggregation types: The order of Aggregation types will follow a lexicographic sequence.
- Order of Pipeline aggregation types: The order of Pipeline aggregation types will follow a lexicographic sequence.
- Order of Sorts: Sorts will be lexicographically ordered based on sort order.
- Order of same query type, aggregation type, sort order: If multiple queries, aggregation types, pipeline aggregations or sorts have the same type we will sort based on the real field name (not obfuscated field name).
Top N Query Shapes
We will create a new processor in the Query Insights framework to keep track of the Top N query shapes in the current window, similar to what we do for Top N Queries by latency.For every window, we will keep track of the hash to (count_of_queries, total_latency_across_all_queries). This will enable us to calculate the average latency for the queries. We will also have a priority query to keep track of the Top N queries for the window.
The algorithm will be as follows:
After every window we will clear the hash to (count, latency) mapping and Priority Queue. We will export the Top N query shape data for the previous window to a local index similar to what we are aiming to do for the Top N queries by latency.
APIs
We will be extending the Top N queries by resource usage APIs as part of query shape insights.Existing APIs:
# get top n queries by latency in the last windowAdditions:
GET /insights/top_queries?type=latency
# get top n queries by cpu in the last window
GET /insights/top_queries?type=cpu
# get top n query shapes by latency in the last window
GET /insights/top_queries?type=latency&group_by=shape
# get top n query shapes by cpu in the last window
GET /insights/top_queries?type=cpu&group_by=shape
Related component
Search
Describe alternatives you've considered
- We can use the actual query to de-duplicate the Top N queries by latency. However, this will not take into account queries that are exactly the same with the only difference being the range width. For example, the same dashboard query that is run continuously but for different time ranges. This will also not group the exact same queries that have ordering differences in the clauses.
- Use a more generic query shape that does not contain the field names, field values and range width information. Similar to:
bool
must:
term
terms
filter:
constant_score
filter:
range
should:
bool
must:
match
must_not:
regex
This approach might end up grouping queries with varying latency characteristics into the same group. Example is the same query on 2 different fields exhibiting different latency due to the cardinality differences for the 2 fields (one high cardinality other low cardinality).
Please let me know your thoughts on the above proposal!
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Status