(WIP) Collect node thread pool usage for shard balancing #131249

DiannaHohensee · 2025-07-14T22:04:39Z

Adds a new transport action to collect usage stats from the
data nodes. ClusterInfoService uses the action to pull thread
pool usage information from the data nodes to the master node
periodically. Sets up a new thread pool usage monitor class
to receive new ClusterInfo: in future this class will initiate
rebalancing if any data node's write thread pool is
hot-spotting.

Relates ES-12316, ES-11991

I wanted to get agreement before polishing and testing. I can put up a separate patch for the *Monitor, but wanted to show how things would connect together.

Adds a new transport action to collect usage stats from the data nodes. ClusterInfoService uses the action to pull thread pool usage information from the data nodes to the master node periodically. Sets up a new thread pool usage monitor class to receive new ClusterInfo: in future this class will initiate rebalancing if any data node's write thread pool is hot-spotting. Closes ES-12316

nicktindall · 2025-07-14T23:51:00Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+            // The class doesn't have any members at the moment so return the same hash code
+            return Objects.hash(NAME);
+        }
+    }


I don't think Request should have equals/hashCode should it? I looked at a few others and they don't seem to have it.

Seems hit or miss -- some impls have it overridden, some don't. I'll get rid of them, see if anything complains

nicktindall · 2025-07-14T23:54:05Z

...va/org/elasticsearch/action/admin/cluster/node/usage/NodeUsageStatsForThreadPoolsAction.java

+        @Override
+        public int hashCode() {
+            return Objects.hash(nodeUsageStatsForThreadPools, getNode());
+        }


Also in this response do we need equals/hashCode?

nicktindall · 2025-07-15T00:17:47Z

server/src/main/java/org/elasticsearch/cluster/NodeUsageStatsForThreadPoolsCollectorImpl.java

+        client.execute(
+            TransportNodeUsageStatsForThreadPoolsAction.TYPE,
+            new NodeUsageStatsForThreadPoolsAction.Request(),
+            listener.map(response -> response.getAllNodeUsageStatsForThreadPools())


We don't check for failures here (e.g. if one node failed) I wonder if it makes sense to pass a partial result through.

I think we could use the last stats sent for any node that fails to respond, and have the *Monitor log a a debug message about the missing response. I'd expect the cluster has other issues if a node fails to respond, or perhaps a race with removing a node. I can't think of any obvious harm in using stats that are a little stale. We could go as far as logging a warning if a node is not updated after X rounds -- keep some kind of timestamp, or counter, of last stats update per node.

DiannaHohensee self-assigned this Jul 14, 2025

DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.1 labels Jul 14, 2025

elasticsearchmachine added the v9.2.0 label Jul 14, 2025

DiannaHohensee changed the title ~~Collect node thread pool usage for shard balancing~~ (WIP) Collect node thread pool usage for shard balancing Jul 14, 2025

DiannaHohensee force-pushed the 2025/07/10/new-cluster-info-transport-action branch from 849b1ab to 02b7126 Compare July 14, 2025 22:33

DiannaHohensee requested a review from nicktindall July 14, 2025 22:35

DiannaHohensee and others added 2 commits July 14, 2025 15:40

rename average to max for queue latency

e874a7a

[CI] Auto commit changes from spotless

d99e953

nicktindall reviewed Jul 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(WIP) Collect node thread pool usage for shard balancing #131249

(WIP) Collect node thread pool usage for shard balancing #131249

DiannaHohensee commented Jul 14, 2025 •

edited

Loading

Uh oh!

nicktindall Jul 14, 2025

Uh oh!

DiannaHohensee Jul 15, 2025

Uh oh!

nicktindall Jul 14, 2025

Uh oh!

nicktindall Jul 15, 2025

Uh oh!

DiannaHohensee Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

(WIP) Collect node thread pool usage for shard balancing #131249

Are you sure you want to change the base?

(WIP) Collect node thread pool usage for shard balancing #131249

Conversation

DiannaHohensee commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicktindall Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

DiannaHohensee Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DiannaHohensee commented Jul 14, 2025 •

edited

Loading

DiannaHohensee Jul 15, 2025 •

edited

Loading