-
Notifications
You must be signed in to change notification settings - Fork 25.3k
(WIP) Collect node thread pool usage for shard balancing #131249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
(WIP) Collect node thread pool usage for shard balancing #131249
Conversation
Adds a new transport action to collect usage stats from the data nodes. ClusterInfoService uses the action to pull thread pool usage information from the data nodes to the master node periodically. Sets up a new thread pool usage monitor class to receive new ClusterInfo: in future this class will initiate rebalancing if any data node's write thread pool is hot-spotting. Closes ES-12316
849b1ab
to
02b7126
Compare
// The class doesn't have any members at the moment so return the same hash code | ||
return Objects.hash(NAME); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Request
should have equals
/hashCode
should it? I looked at a few others and they don't seem to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems hit or miss -- some impls have it overridden, some don't. I'll get rid of them, see if anything complains
@Override | ||
public int hashCode() { | ||
return Objects.hash(nodeUsageStatsForThreadPools, getNode()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in this response do we need equals
/hashCode
?
client.execute( | ||
TransportNodeUsageStatsForThreadPoolsAction.TYPE, | ||
new NodeUsageStatsForThreadPoolsAction.Request(), | ||
listener.map(response -> response.getAllNodeUsageStatsForThreadPools()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't check for failures here (e.g. if one node failed) I wonder if it makes sense to pass a partial result through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use the last stats sent for any node that fails to respond, and have the *Monitor log a a debug message about the missing response. I'd expect the cluster has other issues if a node fails to respond, or perhaps a race with removing a node. I can't think of any obvious harm in using stats that are a little stale. We could go as far as logging a warning if a node is not updated after X rounds -- keep some kind of timestamp, or counter, of last stats update per node.
Adds a new transport action to collect usage stats from the
data nodes. ClusterInfoService uses the action to pull thread
pool usage information from the data nodes to the master node
periodically. Sets up a new thread pool usage monitor class
to receive new ClusterInfo: in future this class will initiate
rebalancing if any data node's write thread pool is
hot-spotting.
Relates ES-12316, ES-11991
I wanted to get agreement before polishing and testing. I can put up a separate patch for the *Monitor, but wanted to show how things would connect together.