Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590

Open
dhwanilpatel opened this issue Jul 10, 2023 · 10 comments
Labels
bug Something isn't working Cluster Manager

Comments

@dhwanilpatel
Copy link
Contributor

Describe bug

IndicesClusterStateService is one of the cluster state applier. IndicesClusterStateService is responsible for other activity as well like handling of peer recovery failure or shard failure. All these three methods applyClusterState , handleRecoveryFailure and FailedShardHandler are syncronized method, hence different threads can block each other while performing these operations.

We have seen cases where clusterApplierService thread gets blocked on IndicesClusterStateService because other generic thread has already took the lock on service. It can be either due to recovery failure or shard failure. If those operation takes long time to complete, it will block Cluster Applier thread for long time and cause frequent node drops. Node will follow the cycle of node-join and left events until the applier thread is not unblocked. Similarly this behavior can block other generic threads as well.

For e.g. in one case we have seen cluster applier thread was blocked for long time due to failed recovery of KNN index, where handleFailedRecovery took very long time and caused frequent node drops in cluster.

Implication:
Too many node-join/left events will occur in cluster.

Thread dumps supporting problem

Blocked Cluster Applier thread for IndicesClusterStateService

opensearch[59a08cdd7fbe4cdbecca8f5917fd78d9][clusterApplierService#updateTask][T#1] 
Stack Trace is: 
java.lang.Thread.State: BLOCKED (on object monitor)
at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:256)
- waiting to lock <0x000000100bd24c78> (a org.opensearch.indices.cluster.IndicesClusterStateService)
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)
at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561)
at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484)
at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)

Generic thread holding to lock due to recovery failure

opensearch[59a08cdd7fbe4cdbecca8f5917fd78d9][generic][T#3]
Stack Trace is:
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for  <0x00000010230dac70> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
at org.opensearch.index.engine.Engine.close(Engine.java:1959)
at org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
at org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
at org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1691)
- locked <0x00000010230d2368> (a java.lang.Object)
at org.opensearch.index.IndexService.closeShard(IndexService.java:602)
at org.opensearch.index.IndexService.removeShard(IndexService.java:582)
- locked <0x00000010230a2a70> (a org.opensearch.index.IndexService)
at org.opensearch.indices.cluster.IndicesClusterStateService.failAndRemoveShard(IndicesClusterStateService.java:795)
at org.opensearch.indices.cluster.IndicesClusterStateService.handleRecoveryFailure(IndicesClusterStateService.java:775)
- locked <0x000000100bd24c78> (a org.opensearch.indices.cluster.IndicesClusterStateService)
at org.opensearch.indices.recovery.RecoveryListener.onFailure(RecoveryListener.java:53)
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139)
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176)
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:710)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:636)
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379)
at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420)
at org.opensearch.transport.InboundHandler$$Lambda$4982/0x00000017a264bc40.run(Unknown Source)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)

Other blocked generic thread.

opensearch[59a08cdd7fbe4cdbecca8f5917fd78d9][generic][T#6] 
Stack Trace is: 
java.lang.Thread.State: BLOCKED (on object monitor)
at org.opensearch.indices.cluster.IndicesClusterStateService.handleRecoveryFailure(IndicesClusterStateService.java:775)
- waiting to lock <0x000000100bd24c78> (a org.opensearch.indices.cluster.IndicesClusterStateService)
at org.opensearch.indices.recovery.RecoveryListener.onFailure(RecoveryListener.java:53)
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139)
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176)
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:710)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:636)
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379)
at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420)
at org.opensearch.transport.InboundHandler$$Lambda$4982/0x00000017a264bc40.run(Unknown Source)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)
@dhwanilpatel dhwanilpatel added bug Something isn't working untriaged labels Jul 10, 2023
@dhwanilpatel dhwanilpatel changed the title IndiceClusterStateService blocking ClusterApplier thread on causing node drop. IndiceClusterStateService blocking ClusterApplier thread and causing node drop. Jul 10, 2023
@andrross andrross changed the title IndiceClusterStateService blocking ClusterApplier thread and causing node drop. IndicesClusterStateService blocking ClusterApplier thread and causing node drop. Jul 18, 2023
@dhwanilpatel
Copy link
Contributor Author

I have tried exploring couple of approach to see if it is feasible to decouple both logic from synchronized behavior, but it seems like we need to keep both under synchronized block.

Cluster applier thread calls applyClusterState and Shard/Recovery failure calls the failAndRemoveShard method. I have analyzed the responsibility of both method to asses if those can run simultaneously or not.
Responsibility of applyClusterState : It will apply new cluster state and perform relvent operations like create/update/delete shards or index.
Responsibility of failAndRemoveShard: It removes the shard files and sends the shard-failed tasks to cluster manager based on method parameter. It is being used from handleRecoveryFailure and FailedShardHandler

Approach 1) Remove synchronized behavior with current responsibility

Both methods might acts on same shard, so simultaneous execution can introduce multiple race conditions and issues. I have listed few race condition based on high level analysis, but there can be more race conditions.

  1. failedShardsCache is maintained between those two method to prevent repeated recovery of these shards on each cluster state update. This can break and it can start repeated recovery on each state update.

  2. If applyClusterState deletes the shard as part of delete index and at same time failAndRemoveShard also goes ahead and remove the shard, it can cause issue accessing underlying node environment. This operations might not be using the same lock in IndexService and might cause issues. This can be one of the race conditions, there can be more.

Approach 2) Change responsibility of failAndRemoveShard to just fail the shard to cluster manager node

failAndRemoveShard method, removes the local shard files and send shard-failed task to cluster manager node based on method parameter. we can change this method's responsibility to just send the shard failure to master node and then rely on cluster state applier to remove the shard files. After this we may remove synchronization between these methods.

There are some issue with this as well, like:

  1. In some cases like source node cancels the recovery/source shard is closed/etc, where target data node don't send shard-failed to cluster manager node and directly removes the shard files. We will need to handle such scenario separately.

  2. If recovery fails due to disk space, where space on target node gets full, data node might not be able to apply the new cluster state itself and wont be able to delete those shard files and release space. For this case, we will need behaviour where we first remove the files and then send shard-failed to cluster manager node.

After analysing above two approach, looks like removing the synchronize behavior between applyClusterState , handleRecoveryFailure and FailedShardHandler is not feasible.

I am open to hear more suggestion/approach on decoupling it.

@dhwanilpatel
Copy link
Contributor Author

Closing this, since it doesn't seem feasible to decouple it, we will need to keep it in synchronized block.

@romiras
Copy link

romiras commented Dec 23, 2023

@dhwanilpatel

Closing this, since it doesn't seem feasible to decouple it, we will need to keep it in synchronized block.

What are next action items for this issue?

@shwetathareja
Copy link
Member

Reopening this issue, as this is impacting cluster stability everytime applier threads are stuck and causes node drops.

@dblock
Copy link
Member

dblock commented Aug 19, 2024

[Catch All Triage - 1, 2]

@jmazanec15
Copy link
Member

hey @shwetathareja @dhwanilpatel , from knn side, it seems for knn some kind of graph creation/indexing is happening that would cause such a large block. Im wondering if in this case, we could provide signal to cancel building the graph and abandon the process. Is the expected behavior that the operation taken by knn will be abandoned anyway? In other words, would it be correct behavior to abandon the segment generation occurring.

@shwetathareja
Copy link
Member

@jmazanec15 yes, Ideally during closeShard, knn graph generation should be abandoned but i think there is no easy way to interrupt the library during graph generation.

@jmazanec15
Copy link
Member

jmazanec15 commented Jan 14, 2025

@shwetathareja From faiss perspective, there are InterruptHandlers (https://github.com/facebookresearch/faiss/blob/b9fe1dcdf71602f5d733dbd78adce06bba20d615/faiss/IndexHNSW.cpp#L118) that could be leveraged in order to halt graph creation. So, assuming that there is a suitable extension point on closeShard, we could interrupt the creation for selective engines that support interrupts and stop graph creation. It seems this would mitigate this issue.

Update: on closer examination, faiss implements the InterruptHandlers statically, which may make it more complex to integrate. Anyway, I think this is worthwhile to investigate adding.

@shwetathareja
Copy link
Member

Thanks @jmazanec15, I agree, if there is an option to interrupt the graph generation, we should explore. would you have some bandwidth to look into it?

@jmazanec15
Copy link
Member

@luyuncheng is looking at this in opensearch-project/k-NN#2529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager
Projects
Status: 🆕 New
Development

No branches or pull requests

7 participants