[WLM] Automated labeling of search requests #16797

kaushalmahi12 · 2024-12-06T04:56:38Z

Author: Kaushal Kumar

Is your feature request related to a problem? Please describe

Recently we launched WLM subfeature i,e; multitenant search resiliency. But the feature still required an external hint to be sent along with each request via HTTP header. Hence this approach puts the burden on the user to apply these hints intelligently.

This can become a real pain if the access is programmatic and if not planned properly the programmatic multitenant access can become unmanageable. Hence it would rather be great if user could just define some rules to determine what should be the right tenant for certain class (confirms to a rule) of request.

Though we have touch based on this idea in the following RFCs

In this issue I want to go over the high level approach to achieve this.

Describe the solution you'd like

Given that this tagging component will lie directly on the search path, we will keep efficient in memory snapshot of the rules for faster processing. The label assignment will only happen once for a request at co-ordinator node irrespective of number of shards it is going to hit.

Rules schema and Storage options

Rule Schema

{
    
   "attribute1": ["value*"],
   "attribute2": ["value*"],
   "label": "fjagjag9243421_425285",
   "updatedAt": "12-03-2024T18:00:23Z"
}

Cluster State - If we use search pipelines to encapsulate rules for determining the label then soon enough the pipelines will explode in numbers which can be detrimental to cluster state processing and could become a bottleneck in clusters stability. In addition to this the cluster state is already quite bloated hence maybe it wouldn’t be such a great idea to use this option.
System Index - This will definitely help us decouple the rules storage and processing from cluster manager related tasks. But since there is no mechanism in indices to propagate these changes to all nodes, it will compel us to either periodically refresh the rules on all nodes or define a custom request handlers to carry out the refresh.

In-memory Structure for Rules

Since we want to hold all the rules in memory and do a fast prefix based string matching trie data structure becomes a natural choice for this problem.

We will keep per attribute trie in memory, each trie will give us a possible list of matching labels.

Rules storage

Following diagram illustrates the rules storage process and how does the structure evolves over time on incremental rule additions [Note: in the diagrams I have used query groups but this will be a generic label which other features can also use]

Rules Matching

Given that the rules are stored in in-memory trie data structure, single attribute value match could yield multiple results. Now there are following scenarios for the string search in the trie

The node where the search ends already has a label value
The node where the search ends don’t have the label but has some child subtrees. So the possible matches will be all the closest node’s queryGroupIds from this node to keep the list minimal.

Now given these N lists of matches, 1 per attribute. We can select an item which will appear in most number of lists and if there is a tie then pick the one with shortest depth in the tree. If the match results in a tie even with depth as a param we will use first query group from the list lexicographically.

Related component

Search

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

kaushalmahi12 · 2024-12-09T18:49:07Z

@msfroh @reta @jainankitk @backslasht @andrross
Can you provide your suggestions and review this ?

kaushalmahi12 · 2024-12-09T19:22:07Z

If we want to separate out the rules for features then having limited choice based value for additional field mentioning which feature would use the rule.

the schema could look like following

{
    
   "attribute1": ["value*"],
   "attribute2": ["value*"],
   "label": "fjagjag9243421_425285",
   "updatedAt": "12-03-2024T18:00:23Z",
   "feature|SomeBetterName": "WLM"
}

reta · 2024-12-11T21:48:25Z

Thanks @kaushalmahi12 (sorry for the delay).

System Index - This will definitely help us decouple the rules storage and processing from cluster manager related tasks. But since there is no mechanism in indices to propagate these changes to all nodes, it will compel us to either periodically refresh the rules on all nodes or define a custom request handlers to carry out the refresh.

I think this is the right approach to manage rules. Also, I suspect the labeling (rule matching) should only be applied on coordinator node(s)? Regarding the data structures, I think it would be great to understand how exactly the attributes to match against are extracted from the search requests, do we have an RFC/Feature Request for it? (sorry if I missed it)

kaushalmahi12 · 2024-12-11T23:27:01Z

Thanks @reta for looking into it.
I will be writing a detailed design for Rule Matching and LLD and it will be part of the second sub issue in the list of issues mentioned in #16813. This is just a high level proposal that outlines the brief about the approach.

I suspect the labeling (rule matching) should only be applied on coordinator node(s)?

Not sure if I follow completely but IMO the mapping should be 1:1 from a user level request and how we treat it within the system. With that being said maybe msearch and mget APIs.

kkhatua · 2025-01-03T22:29:38Z

Not sure if I follow completely but IMO the mapping should be 1:1 from a user level request and how we treat it within the system. With that being said maybe msearch and mget APIs.
@kaushalmahi12
I think what @reta is saying is that you should have the need to apply it only once. From what I recall, that is already the case, because once a request is mapped to a resource group, all the child (e.g. shard tasks) tasks spawned across the cluster will also be in the same group.

Now given these N lists of matches, 1 per attribute. We can select an item which will appear in most number of lists and if there is a tie then pick the one with shortest depth in the tree. If the match results in a tie even with depth as a param we will use first query group from the list lexicographically.
Regarding the above, we currently provide hints in teh request header for mapping to a resource group. What will be the precedence?

kaushalmahi12 · 2025-01-03T23:40:51Z

Regarding the above, we currently provide hints in teh request header for mapping to a resource group. What will be the precedence?

There are 2 choices for that.

Remove the HTTP header altogether given the header provides a way to abuse the system resources.
Let HTTP header be the fallback value for this.
I think having the mechanism to decide whether to tag the request with a value seems more appropriate because an Entity can always keep passing higher limit queryGroupId in the header

jainankitk · 2025-01-14T22:37:45Z

Thanks @kaushalmahi12 for putting together this RFC. Few comments:

if user could just define some rules to determine what should be the right tenant for certain class (confirms to a rule) of request.

IMO, only admins should be able to define the rules for assigning labels for Workload Management.

The label assignment will only happen once for a request at co-ordinator node irrespective of number of shards it is going to hit

IMO, the evaluation can be on the coordinator node or data node, depending on the type of rule. For example - if we allow box_type as one of the possible attributes, the evaluation for that rule should be on the data node

I feel the RFC is focusing on both the high level details as well as low level implementation details. IMO, we should keep this RFC limited to the customer experience we are planning to build. The low level implementation design discussion is better done as followup in separate RFC.

smacrakis · 2025-01-15T16:25:22Z

It would be nice to have a specification of the semantics of tag inferencing before going into implementation.

kaushalmahi12 · 2025-01-16T01:35:49Z

1. Introduction

Workload Management Overview

1.1 Purpose

This document specifies the requirements and functionality for the Workload Management feature in OpenSearch, which uses autotagging to categorize incoming search requests into QueryGroups. We will use Rules to categorize the search requests. Rules will be managed by Admin users using Rest APIs and later we can provide a UI for managing them.

1.2 Scope

The feature covers the creation, management, and application of rules for categorizing search requests into QueryGroups(workload groups). The feature should also be reusable for other components or use cases as well if need be in the future. Feature should also be flexible enough to accomodate for other attributes in the future to decide the query group or some other label.

1.3 Definitions

QueryGroup(workload group): A logical grouping of requests sharing similar resource requirements or access patterns. This entity is already present in WLM and carries the system resource limits as part of its schema definition.

{
    "_id": "fafjafjkaf9ag8a9ga9g7ag0aagaga",
    "resource_limits": {
        "memory": 0.4,
        "cpu": 0.2
    },
    "resiliency_mode": "enforced",
    "name": "analytics",
    "updated_at": 4513232415
}

To understand little bit more about the existing constructs and feature : WLM Overview

Rule: A set of criteria used to map requests to QueryGroups. The criteria is defined as key value pair,
e,g;

{
   "index_pattern": ["logs*", "anomaly_events*"],
   "label": "dev_query_group_id",
   "feature": "WLM",
   "updated_at": "01-10-2025T21:23:456Z"
}

Attribute: A characteristic of a request used for matching (e.g., index_pattern) or the security context related info such as user group, user role, user name. For WLM we will use the index_pattern as the primary attribute to tag the requests

1.4 Use cases this feature will solve

I have multiple groups of users, backend systems, apps and analytics etc. I am seeing the resource spikes due to complex dashboards queries which are affecting other user groups. I want to limit the resource usage for dashboard users
I have a 1:1 mapping from my users/clients to indices where I store my data. I have seen that 80% of the resource usage is consumed by 20% of the users and this is creating problems for other low traffic clients. I want to ensure that top 20% users don't get more than 60% of the system resources
I want to divide the cluster level resources based on custom attributes. For example I want to ensure that for a custom attribute X, the group should not consume 20% of CPU in contended case

System Overview

WLM is designed to ensure of resource isolation amongst workload groups. But the system is missing the capability to automatically categorize the incoming requests into query groups (workload groups).
WLM after this feature will provide the admin users to manage Rules, QueryGroups (tenants) and the search requests will be auto tagged with the appropriate query group using Rules.
The result of the Rule will be a String called label as mentioned in the schema. Each feature will have their separate set of rules hence label could mean a different thing for different feature ,e, g; for WLM this label would be the queryGroupId.

Goals

Admin user should be able to manage (create/update/delete) Rules in real time.
Rules should be able to correctly tag the request with a single querygroup.

3 Functional Requirements

Rules Management: Admin users will be able to use REST APIs to create the Rule entities. Each feature will define the definite set of attributes that are valid for Rules concerning that feature.
Request Categorization: Since the request will use the index_pattern as the primary input. There are following cases that can occur
1. SearchRequest's target indices are all confirms a single Rule. In this case the Rule's label will be assigned as tag to the request.
2. SearchRequest's target indices doesn't conform to any Rule hence either it should be tracked under catch all query group or reject based on the feature setting.
3. SearchRequest's target indices conforms to multiple Rules, with each partially. In this case we have two case either track the child requests in their corresponding querygroups or track it in the catch all or reject. Rejection and catch-all again depends on the user's preference. But We can safely assume that the cluster admins will have deeper understanding of what their data access patterns are and based on that the Rules are defined.
Conflict Resolution: Till we have only a single attribute based tenancy it is all fine but as soon as we introduce additional attributes to decide the tenancy. It opens the door to conflicts, e,g; lets say we have the following rules in the system

{ "index_pattern": ["logs*"], "label": "queryGroupA", ... }
{ "user_pattern": ["dev*"], "label": "queryGroupb", ... }
Now assume the request is targeted for log_q1_24 and from user dev_kaushal then we have a conflict hence we have to give the precedence to one attribute over another. We will prioritize security related attributes over index related attribute.
Assuming that in future we are planning to support co-located hot and warm indices. In this scenario the tagging will happen at the shard level to uniquely attribute the requests to right querygroup because we can only identify the type at shard level if the request spans hot and warm indices.

jainankitk · 2025-01-17T01:23:28Z

@kaushalmahi12 - Thank you for the revision and putting more emphasis on the user experience. Few comments:

I am assuming we are taking both feature and label to allow use of tagging for other features as well?

"label": "dev_query_group_id",
"feature": "WLM",

I am wondering if it is better specified as key-value pair. For example:

"index_pattern": ["logs*", "anomaly_events*"],
"query_group": "dev_query_group_id",
"updated_at": "01-10-2025T21:23:456Z"

Is it better to call it indices instead of index_pattern to allow single index as well as prefix patterns? Also, will the evaluation of rule happen after the index pattern resolution in search request or before?
Do we allow creating/updating conflicting rule definitions? For example:

"index_pattern": ["logs123"],
"query_group": "dev_query_group_id",
"updated_at": "01-10-2025T21:23:456Z"

"indices": ["logs*"],
"query_group": "finance_query_group_id",
"updated_at": "01-10-2025T21:23:456Z"

Can you also provide list of allowed attributes along with its definition?

kaushalmahi12 · 2025-01-28T05:51:06Z

I am assuming we are taking both feature and label to allow use of tagging for other features as well?

Yes.

I am wondering if it is better specified as key-value pair. For example:

"index_pattern": ["logs*", "anomaly_events*"],
"query_group": "dev_query_group_id",
"updated_at": "01-10-2025T21:23:456Z"

I think this is little better!. Thanks @jainankitk for the suggestion.

Is it better to call it indices instead of index_pattern to allow single index as well as prefix patterns? Also, will the evaluation of

I think either is fine since both of them are conveying the intent partially here. as long as the documentation is detailed enough it should be fine IMO.

rule happen after the index pattern resolution in search request or before?

Index pattern resolution happens after ActionFilters AFAIK ref1 ref2

Do we allow creating/updating conflicting rule definitions? For example:

Yes we will and the last updated rule value will be updated.

kaushalmahi12 added enhancement untriaged labels Dec 6, 2024

github-actions bot added the Search label Dec 6, 2024

kaushalmahi12 self-assigned this Dec 6, 2024

kaushalmahi12 mentioned this issue Dec 9, 2024

[META] Automatic labeling using Rules #16813

Open

sandeshkr419 removed the untriaged label Dec 18, 2024

peterzhuamazon added this to Search Project Board Dec 19, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Dec 19, 2024

kaushalmahi12 mentioned this issue Dec 20, 2024

[Proposal] Rule Matching #16888

Open

ruai0511 mentioned this issue Dec 20, 2024

[WLM] Synchronizing Rules Across Nodes #16889

Open

ruai0511 mentioned this issue Feb 3, 2025

[Rule-based Auto Tagging] Add rule schema for auto tagging #17238

Merged

3 tasks

This was referenced Feb 12, 2025

[rule based autotagging] Add Create Rule API Logic #17335

Closed

[rule based autotagging] Add Get Rule API Logic #17336

Open

This was referenced Feb 12, 2025

[rule based autotagging] add attribute value store #17342

Merged

[Rule based auto tagging] Add in-memory rule processing service #17365

Merged

kaushalmahi12 mentioned this issue Apr 3, 2025

[Rule based auto tagging] Add WLM action filter to automate tagging for search traffic #17791

Open

3 tasks

This was referenced Apr 3, 2025

[rule based autotagging] Add Create Rule API Logic #17792

Open

[rule based autotagging] Add Update Rule API Logic #17797

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WLM] Automated labeling of search requests #16797

[WLM] Automated labeling of search requests #16797

kaushalmahi12 commented Dec 6, 2024 •

edited

Loading

kaushalmahi12 commented Dec 9, 2024 •

edited

Loading

kaushalmahi12 commented Dec 9, 2024 •

edited

Loading

reta commented Dec 11, 2024

kaushalmahi12 commented Dec 11, 2024 •

edited

Loading

kkhatua commented Jan 3, 2025 •

edited

Loading

kaushalmahi12 commented Jan 3, 2025 •

edited

Loading

jainankitk commented Jan 14, 2025

smacrakis commented Jan 15, 2025

kaushalmahi12 commented Jan 16, 2025 •

edited

Loading

jainankitk commented Jan 17, 2025

kaushalmahi12 commented Jan 28, 2025

[WLM] Automated labeling of search requests #16797

[WLM] Automated labeling of search requests #16797

Comments

kaushalmahi12 commented Dec 6, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Rules schema and Storage options

In-memory Structure for Rules

Rules storage

Rules Matching

Related component

Describe alternatives you've considered

Additional context

kaushalmahi12 commented Dec 9, 2024 • edited Loading

kaushalmahi12 commented Dec 9, 2024 • edited Loading

reta commented Dec 11, 2024

kaushalmahi12 commented Dec 11, 2024 • edited Loading

kkhatua commented Jan 3, 2025 • edited Loading

kaushalmahi12 commented Jan 3, 2025 • edited Loading

jainankitk commented Jan 14, 2025

smacrakis commented Jan 15, 2025

kaushalmahi12 commented Jan 16, 2025 • edited Loading

1. Introduction

1.1 Purpose

1.2 Scope

1.3 Definitions

1.4 Use cases this feature will solve

System Overview

Goals

3 Functional Requirements

jainankitk commented Jan 17, 2025

kaushalmahi12 commented Jan 28, 2025

kaushalmahi12 commented Dec 6, 2024 •

edited

Loading

kaushalmahi12 commented Dec 9, 2024 •

edited

Loading

kaushalmahi12 commented Dec 9, 2024 •

edited

Loading

kaushalmahi12 commented Dec 11, 2024 •

edited

Loading

kkhatua commented Jan 3, 2025 •

edited

Loading

kaushalmahi12 commented Jan 3, 2025 •

edited

Loading

kaushalmahi12 commented Jan 16, 2025 •

edited

Loading