IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590

dhwanilpatel · 2023-07-10T17:31:07Z

Describe bug

IndicesClusterStateService is one of the cluster state applier. IndicesClusterStateService is responsible for other activity as well like handling of peer recovery failure or shard failure. All these three methods applyClusterState , handleRecoveryFailure and FailedShardHandler are syncronized method, hence different threads can block each other while performing these operations.

We have seen cases where clusterApplierService thread gets blocked on IndicesClusterStateService because other generic thread has already took the lock on service. It can be either due to recovery failure or shard failure. If those operation takes long time to complete, it will block Cluster Applier thread for long time and cause frequent node drops. Node will follow the cycle of node-join and left events until the applier thread is not unblocked. Similarly this behavior can block other generic threads as well.

For e.g. in one case we have seen cluster applier thread was blocked for long time due to failed recovery of KNN index, where handleFailedRecovery took very long time and caused frequent node drops in cluster.

Implication:
Too many node-join/left events will occur in cluster.

Thread dumps supporting problem

Blocked Cluster Applier thread for IndicesClusterStateService

opensearch[59a08cdd7fbe4cdbecca8f5917fd78d9][clusterApplierService#updateTask][T#1] 
Stack Trace is: 
java.lang.Thread.State: BLOCKED (on object monitor)
at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:256)
- waiting to lock <0x000000100bd24c78> (a org.opensearch.indices.cluster.IndicesClusterStateService)
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606)
at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593)
at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561)
at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484)
at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282)
at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)

Generic thread holding to lock due to recovery failure

opensearch[59a08cdd7fbe4cdbecca8f5917fd78d9][generic][T#3]
Stack Trace is:
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for  <0x00000010230dac70> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued([email protected]/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire([email protected]/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock([email protected]/ReentrantReadWriteLock.java:959)
at org.opensearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:69)
at org.opensearch.index.engine.Engine.close(Engine.java:1959)
at org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:87)
at org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:129)
at org.opensearch.core.internal.io.IOUtils.close(IOUtils.java:79)
at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1691)
- locked <0x00000010230d2368> (a java.lang.Object)
at org.opensearch.index.IndexService.closeShard(IndexService.java:602)
at org.opensearch.index.IndexService.removeShard(IndexService.java:582)
- locked <0x00000010230a2a70> (a org.opensearch.index.IndexService)
at org.opensearch.indices.cluster.IndicesClusterStateService.failAndRemoveShard(IndicesClusterStateService.java:795)
at org.opensearch.indices.cluster.IndicesClusterStateService.handleRecoveryFailure(IndicesClusterStateService.java:775)
- locked <0x000000100bd24c78> (a org.opensearch.indices.cluster.IndicesClusterStateService)
at org.opensearch.indices.recovery.RecoveryListener.onFailure(RecoveryListener.java:53)
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139)
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176)
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:710)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:636)
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379)
at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420)
at org.opensearch.transport.InboundHandler$$Lambda$4982/0x00000017a264bc40.run(Unknown Source)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)

Other blocked generic thread.

opensearch[59a08cdd7fbe4cdbecca8f5917fd78d9][generic][T#6] 
Stack Trace is: 
java.lang.Thread.State: BLOCKED (on object monitor)
at org.opensearch.indices.cluster.IndicesClusterStateService.handleRecoveryFailure(IndicesClusterStateService.java:775)
- waiting to lock <0x000000100bd24c78> (a org.opensearch.indices.cluster.IndicesClusterStateService)
at org.opensearch.indices.recovery.RecoveryListener.onFailure(RecoveryListener.java:53)
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:139)
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176)
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:194)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:710)
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:636)
at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleException(SecurityInterceptor.java:316)
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379)
at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420)
at org.opensearch.transport.InboundHandler$$Lambda$4982/0x00000017a264bc40.run(Unknown Source)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:756)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:829)

The text was updated successfully, but these errors were encountered:

dhwanilpatel · 2023-09-07T09:10:06Z

I have tried exploring couple of approach to see if it is feasible to decouple both logic from synchronized behavior, but it seems like we need to keep both under synchronized block.

Cluster applier thread calls applyClusterState and Shard/Recovery failure calls the failAndRemoveShard method. I have analyzed the responsibility of both method to asses if those can run simultaneously or not.
Responsibility of applyClusterState : It will apply new cluster state and perform relvent operations like create/update/delete shards or index.
Responsibility of failAndRemoveShard: It removes the shard files and sends the shard-failed tasks to cluster manager based on method parameter. It is being used from handleRecoveryFailure and FailedShardHandler

Approach 1) Remove synchronized behavior with current responsibility

Both methods might acts on same shard, so simultaneous execution can introduce multiple race conditions and issues. I have listed few race condition based on high level analysis, but there can be more race conditions.

failedShardsCache is maintained between those two method to prevent repeated recovery of these shards on each cluster state update. This can break and it can start repeated recovery on each state update.
If applyClusterState deletes the shard as part of delete index and at same time failAndRemoveShard also goes ahead and remove the shard, it can cause issue accessing underlying node environment. This operations might not be using the same lock in IndexService and might cause issues. This can be one of the race conditions, there can be more.

Approach 2) Change responsibility of `failAndRemoveShard` to just fail the shard to cluster manager node

failAndRemoveShard method, removes the local shard files and send shard-failed task to cluster manager node based on method parameter. we can change this method's responsibility to just send the shard failure to master node and then rely on cluster state applier to remove the shard files. After this we may remove synchronization between these methods.

There are some issue with this as well, like:

In some cases like source node cancels the recovery/source shard is closed/etc, where target data node don't send shard-failed to cluster manager node and directly removes the shard files. We will need to handle such scenario separately.
If recovery fails due to disk space, where space on target node gets full, data node might not be able to apply the new cluster state itself and wont be able to delete those shard files and release space. For this case, we will need behaviour where we first remove the files and then send shard-failed to cluster manager node.

After analysing above two approach, looks like removing the synchronize behavior between applyClusterState , handleRecoveryFailure and FailedShardHandler is not feasible.

I am open to hear more suggestion/approach on decoupling it.

dhwanilpatel · 2023-12-21T11:16:24Z

Closing this, since it doesn't seem feasible to decouple it, we will need to keep it in synchronized block.

romiras · 2023-12-23T12:05:43Z

@dhwanilpatel

Closing this, since it doesn't seem feasible to decouple it, we will need to keep it in synchronized block.

What are next action items for this issue?

shwetathareja · 2024-08-16T07:08:18Z

Reopening this issue, as this is impacting cluster stability everytime applier threads are stuck and causes node drops.

dblock · 2024-08-19T16:02:47Z

[Catch All Triage - 1, 2]

jmazanec15 · 2025-01-13T21:50:38Z

hey @shwetathareja @dhwanilpatel , from knn side, it seems for knn some kind of graph creation/indexing is happening that would cause such a large block. Im wondering if in this case, we could provide signal to cancel building the graph and abandon the process. Is the expected behavior that the operation taken by knn will be abandoned anyway? In other words, would it be correct behavior to abandon the segment generation occurring.

shwetathareja · 2025-01-14T14:25:20Z

@jmazanec15 yes, Ideally during closeShard, knn graph generation should be abandoned but i think there is no easy way to interrupt the library during graph generation.

jmazanec15 · 2025-01-14T15:52:30Z

@shwetathareja From faiss perspective, there are InterruptHandlers (https://github.com/facebookresearch/faiss/blob/b9fe1dcdf71602f5d733dbd78adce06bba20d615/faiss/IndexHNSW.cpp#L118) that could be leveraged in order to halt graph creation. So, assuming that there is a suitable extension point on closeShard, we could interrupt the creation for selective engines that support interrupts and stop graph creation. It seems this would mitigate this issue.

Update: on closer examination, faiss implements the InterruptHandlers statically, which may make it more complex to integrate. Anyway, I think this is worthwhile to investigate adding.

shwetathareja · 2025-02-17T11:01:18Z

Thanks @jmazanec15, I agree, if there is an option to interrupt the graph generation, we should explore. would you have some bandwidth to look into it?

jmazanec15 · 2025-02-17T17:07:11Z

@luyuncheng is looking at this in opensearch-project/k-NN#2529

dhwanilpatel added bug Something isn't working untriaged labels Jul 10, 2023

dhwanilpatel changed the title ~~IndiceClusterStateService blocking ClusterApplier thread on causing node drop.~~ IndiceClusterStateService blocking ClusterApplier thread and causing node drop. Jul 10, 2023

mch2 added the distributed framework label Jul 12, 2023

andrross changed the title ~~IndiceClusterStateService blocking ClusterApplier thread and causing node drop.~~ IndicesClusterStateService blocking ClusterApplier thread and causing node drop. Jul 18, 2023

anasalkouz removed the untriaged label Jul 18, 2023

shwetathareja added the Cluster Manager label Jul 21, 2023

anasalkouz removed the distributed framework label Sep 19, 2023

dhwanilpatel closed this as completed Dec 21, 2023

shwetathareja reopened this Aug 16, 2024

github-actions bot added the untriaged label Aug 16, 2024

github-project-automation bot added this to Cluster Manager Project Board Aug 16, 2024

github-project-automation bot moved this to 🆕 New in Cluster Manager Project Board Aug 16, 2024

dblock removed the untriaged label Aug 19, 2024

jmazanec15 mentioned this issue Feb 17, 2025

[Enhancement] Make Merge in nativeEngine can Abort opensearch-project/k-NN#2529

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590

IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590

dhwanilpatel commented Jul 10, 2023

dhwanilpatel commented Sep 7, 2023

dhwanilpatel commented Dec 21, 2023

romiras commented Dec 23, 2023

shwetathareja commented Aug 16, 2024

dblock commented Aug 19, 2024

jmazanec15 commented Jan 13, 2025

shwetathareja commented Jan 14, 2025

jmazanec15 commented Jan 14, 2025 •

edited

Loading

shwetathareja commented Feb 17, 2025

jmazanec15 commented Feb 17, 2025

IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590

IndicesClusterStateService blocking ClusterApplier thread and causing node drop. #8590

Comments

dhwanilpatel commented Jul 10, 2023

dhwanilpatel commented Sep 7, 2023

Approach 1) Remove synchronized behavior with current responsibility

Approach 2) Change responsibility of failAndRemoveShard to just fail the shard to cluster manager node

dhwanilpatel commented Dec 21, 2023

romiras commented Dec 23, 2023

shwetathareja commented Aug 16, 2024

dblock commented Aug 19, 2024

jmazanec15 commented Jan 13, 2025

shwetathareja commented Jan 14, 2025

jmazanec15 commented Jan 14, 2025 • edited Loading

shwetathareja commented Feb 17, 2025

jmazanec15 commented Feb 17, 2025

Approach 2) Change responsibility of `failAndRemoveShard` to just fail the shard to cluster manager node

jmazanec15 commented Jan 14, 2025 •

edited

Loading