-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590
Comments
I have tried exploring couple of approach to see if it is feasible to decouple both logic from synchronized behavior, but it seems like we need to keep both under synchronized block. Cluster applier thread calls Approach 1) Remove synchronized behavior with current responsibilityBoth methods might acts on same shard, so simultaneous execution can introduce multiple race conditions and issues. I have listed few race condition based on high level analysis, but there can be more race conditions.
Approach 2) Change responsibility of
|
Closing this, since it doesn't seem feasible to decouple it, we will need to keep it in synchronized block. |
What are next action items for this issue? |
Reopening this issue, as this is impacting cluster stability everytime applier threads are stuck and causes node drops. |
hey @shwetathareja @dhwanilpatel , from knn side, it seems for knn some kind of graph creation/indexing is happening that would cause such a large block. Im wondering if in this case, we could provide signal to cancel building the graph and abandon the process. Is the expected behavior that the operation taken by knn will be abandoned anyway? In other words, would it be correct behavior to abandon the segment generation occurring. |
@jmazanec15 yes, Ideally during closeShard, knn graph generation should be abandoned but i think there is no easy way to interrupt the library during graph generation. |
@shwetathareja From faiss perspective, there are InterruptHandlers (https://github.com/facebookresearch/faiss/blob/b9fe1dcdf71602f5d733dbd78adce06bba20d615/faiss/IndexHNSW.cpp#L118) that could be leveraged in order to halt graph creation. So, assuming that there is a suitable extension point on closeShard, we could interrupt the creation for selective engines that support interrupts and stop graph creation. It seems this would mitigate this issue. Update: on closer examination, faiss implements the InterruptHandlers statically, which may make it more complex to integrate. Anyway, I think this is worthwhile to investigate adding. |
Thanks @jmazanec15, I agree, if there is an option to interrupt the graph generation, we should explore. would you have some bandwidth to look into it? |
@luyuncheng is looking at this in opensearch-project/k-NN#2529 |
Describe bug
IndicesClusterStateService
is one of the cluster state applier.IndicesClusterStateService
is responsible for other activity as well like handling of peer recovery failure or shard failure. All these three methodsapplyClusterState
,handleRecoveryFailure
andFailedShardHandler
are syncronized method, hence different threads can block each other while performing these operations.We have seen cases where
clusterApplierService
thread gets blocked onIndicesClusterStateService
because other generic thread has already took the lock on service. It can be either due to recovery failure or shard failure. If those operation takes long time to complete, it will block Cluster Applier thread for long time and cause frequent node drops. Node will follow the cycle of node-join and left events until the applier thread is not unblocked. Similarly this behavior can block other generic threads as well.For e.g. in one case we have seen cluster applier thread was blocked for long time due to failed recovery of KNN index, where handleFailedRecovery took very long time and caused frequent node drops in cluster.
Implication:
Too many node-join/left events will occur in cluster.
Thread dumps supporting problem
Blocked Cluster Applier thread for
IndicesClusterStateService
Generic thread holding to lock due to recovery failure
Other blocked generic thread.
The text was updated successfully, but these errors were encountered: