Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Introduce node level circuit breaker settings for k-NN #2263

Open
kotwanikunal opened this issue Nov 8, 2024 · 2 comments · May be fixed by #2509
Open

[FEATURE] Introduce node level circuit breaker settings for k-NN #2263

kotwanikunal opened this issue Nov 8, 2024 · 2 comments · May be fixed by #2509
Assignees

Comments

@kotwanikunal
Copy link
Member

Is your feature request related to a problem?
The existing k-NN plugin uses a cluster-level circuit breaker to prevent excessive memory consumption. While effective, this approach may not be optimal for all cluster configurations, especially in heterogeneous environments where nodes have varying capacities. It lacks fine-grained control over memory usage on individual nodes.

What solution would you like?
Implement node-level circuit breakers for the k-NN plugin, allowing memory limits to be set and enforced on a per-node basis.
The enhancement will refactor the circuit breaker system in OpenSearch to support differentiated limits based on node attributes.

The approach involves defining circuit breaker limits at the cluster level, but with distinct values for different node types. Nodes would be categorized using attributes such as "node.attr.type" set to values like "big" or "small" in their opensearch.yml configuration.
For example -

PUT _cluster/settings
{
  "persistent": {
    "plugins.knn.circuit_breaker.limit.big": "70%",
    "plugins.knn.circuit_breaker.limit.small": "40%"
  }
}

The k-NN plugin would then apply the appropriate limit based on each node's attributes. This method leverages existing OpenSearch configuration mechanisms, allows for centralized management, and provides the necessary flexibility for mixed-capability clusters. It maintains backwards compatibility by falling back to a default or existing cluster-wide setting for nodes without specified attributes.

Implementation would require modifying the k-NN plugin to read node attributes, select the corresponding circuit breaker limit, and apply it in the circuit breaker logic. This solution offers a balance of granular control and ease of management, tailoring resource allocation to node capabilities while keeping configuration centralized and straightforward.

What alternatives have you considered?

  • Moving to per node configuration within opensearch.yml as overrides
@markwu-sde
Copy link
Contributor

I can look into this.

@markwu-sde
Copy link
Contributor

markwu-sde commented Feb 4, 2025

Initial Thoughts

The implementation centers around modifying KNNSettings.getCircuitBreakerLimit() to consider node-level attributes while using the cluster-level value as a fallback default. Currently, this configuration is set dynamically and maintains a 1:1 key/value based mapping.

I understood the approach abovebeing to set a fixed attribute (e.g., circuit_breaker_type: "big") as a node attribute and use this pre-determined key to check for dynamic overrides for that particular type. If no override exists, we default back to the cluster-level circuit breaker percentage.

Users would need to update their node-level configuration with this categorization to utilize the updated circuit breaker value. Only the specified key would be considered among user-provided node attributes, requiring users to classify nodes independently.

Since the listeners for these settings are defined at runtime, we could leverage groupSetting to listen to all changes prefixed by the limit character.

Initial local changes look promising but require additional testing.

Test Output

curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent" : {
    "knn.memory.circuit_breaker.limit.test" : "92%"
  }
}
'


[2025-02-03T18:51:34,622][INFO ][stdout                   ] [integTest-0] [KNN] circuit breaker size: 14627635
[2025-02-03T18:51:43,984][DEBUG][o.o.c.c.Coordinator      ] [integTest-0] initialized PublicationContext using class: class org.opensearch.cluster.coordination.PublicationTransportHandler$PublicationContext
[2025-02-03T18:51:43,988][DEBUG][o.o.c.c.C.CoordinatorPublication] [integTest-0] publishing version 7 to [PublicationTarget{discoveryNode={integTest-0}{FS6n7xQ8TGi149fW29-LpQ}{xeuvK1QyRgS4AY4fn1B61Q}{*********}{*********:9300}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, state=NOT_STARTED, ackIsPending=true}]
[2025-02-03T18:51:43,991][DEBUG][o.o.c.c.PublicationTransportHandler] [integTest-0] received diff cluster state version [7] with uuid [uOPk9u6QQBGIPUaTt-PbGQ], diff size [201]
[2025-02-03T18:51:43,992][DEBUG][o.o.c.c.Coordinator      ] [integTest-0] handlePublishRequest: handling version [7] from [{integTest-0}{FS6n7xQ8TGi149fW29-LpQ}{xeuvK1QyRgS4AY4fn1B61Q}{*********}{*********:9300}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}]
[2025-02-03T18:51:44,165][INFO ][o.o.k.i.m.NativeMemoryCacheManager] [integTest-0] KNN Cache rebuilding.
[2025-02-03T18:51:44,170][DEBUG][o.o.c.c.C.CoordinatorPublication] [integTest-0] publication ended successfully: Publication{term=1, version=7}
[2025-02-03T18:51:44,175][WARN ][o.o.c.r.a.AllocationService] [integTest-0] Falling back to single shard assignment since batch mode disable or multiple custom allocators set
[2025-02-03T18:51:44,626][INFO ][stdout                   ] [integTest-0] [KNN] circuit breaker size: 15440281

We could further potentially improve the experience by enabling users to set custom circuit breaker policies based on their existing attributes. Similar to routing or load balancing policies, a circuit breaker policy would evaluate each node's circuit breaker value using user-provided in-memory static evaluation conditionals. This represents a more complex effort that may be outside the current scope but could be explored if there exists a need for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog (Hot)
3 participants