Skip to content

(WIP) Collect node thread pool usage for shard balancing #131249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Jul 14, 2025

Adds a new transport action to collect usage stats from the
data nodes. ClusterInfoService uses the action to pull thread
pool usage information from the data nodes to the master node
periodically. Sets up a new thread pool usage monitor class
to receive new ClusterInfo: in future this class will initiate
rebalancing if any data node's write thread pool is
hot-spotting.

Relates ES-12316, ES-11991


I wanted to get agreement before polishing and testing. I can put up a separate patch for the *Monitor, but wanted to show how things would connect together.

@DiannaHohensee DiannaHohensee self-assigned this Jul 14, 2025
@DiannaHohensee DiannaHohensee added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.1 labels Jul 14, 2025
@DiannaHohensee DiannaHohensee changed the title Collect node thread pool usage for shard balancing (WIP) Collect node thread pool usage for shard balancing Jul 14, 2025
Adds a new transport action to collect usage stats from the
data nodes. ClusterInfoService uses the action to pull thread
pool usage information from the data nodes to the master node
periodically. Sets up a new thread pool usage monitor class
to receive new ClusterInfo: in future this class will initiate
rebalancing if any data node's write thread pool is
hot-spotting.

Closes ES-12316
@DiannaHohensee DiannaHohensee force-pushed the 2025/07/10/new-cluster-info-transport-action branch from 849b1ab to 02b7126 Compare July 14, 2025 22:33
// The class doesn't have any members at the moment so return the same hash code
return Objects.hash(NAME);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Request should have equals/hashCode should it? I looked at a few others and they don't seem to have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems hit or miss -- some impls have it overridden, some don't. I'll get rid of them, see if anything complains

@Override
public int hashCode() {
return Objects.hash(nodeUsageStatsForThreadPools, getNode());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also in this response do we need equals/hashCode?

client.execute(
TransportNodeUsageStatsForThreadPoolsAction.TYPE,
new NodeUsageStatsForThreadPoolsAction.Request(),
listener.map(response -> response.getAllNodeUsageStatsForThreadPools())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check for failures here (e.g. if one node failed) I wonder if it makes sense to pass a partial result through.

Copy link
Contributor Author

@DiannaHohensee DiannaHohensee Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use the last stats sent for any node that fails to respond, and have the *Monitor log a a debug message about the missing response. I'd expect the cluster has other issues if a node fails to respond, or perhaps a race with removing a node. I can't think of any obvious harm in using stats that are a little stale. We could go as far as logging a warning if a node is not updated after X rounds -- keep some kind of timestamp, or counter, of last stats update per node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.1.1 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants