-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Make Merge in nativeEngine can Abort #2529
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: luyuncheng <[email protected]>
@luyuncheng This is interesting - but can the interrupt callback with faiss be per graph or would it be for all graphs? In other words, would it cancel all graph builds happening on the node instead of just the one for the closed shard. |
@luyuncheng thanks for creating the GH issue. and I think we ourselves have seen this problem in couple of places(ref: opensearch-project/OpenSearch#14828, @kotwanikunal created this), and seeing a solution around this problem is really great. I looked through the code and I want to know how this code is even working? Because as per my understanding of the code you are checking if merge is aborted if somehow write fails and then eating up that exception. Is this PR only to see if the merge is aborted and write to directory fails how we can handle the errors in case shard is not present on the node because it moved? Because if that is the case then in the 2.19 version of k-NN plugin we added the support writing index using IndexInput/Output. So if a rather than checking if merge is aborted or not, can we not see if IndexInput/Output is closed or not? |
@jmazanec15 In Sample Code, k-NN/src/main/java/org/apache/lucene/index/KNNMergeHelper.java Lines 11 to 16 in 32e151a
current merge thread like OS code https://github.com/opensearch-project/OpenSearch/blob/99a9a81da366173b0c2b963b26ea92e15ef34547/server/src/main/java/org/apache/lucene/index/OneMergeHelper.java#L65-L69
|
@navneet1v i think the call chain is like following shows
Nice catch, we need to handle |
@luyuncheng right, but in the faiss InterruptCallback: https://github.com/facebookresearch/faiss/blob/657c563604c774461aed0394ae99210713145e03/faiss/impl/AuxIndexStructures.h#L135-L162, it operates globally. So, setting the instance in one thread, will set it in another thread. So, for instance, here is the interrupt check in HNSW during graph build: https://github.com/facebookresearch/faiss/blob/657c563604c774461aed0394ae99210713145e03/faiss/IndexHNSW.cpp#L178-L180. If the interrupt gets set and another shard is building a graph on the node, wont this shard's graph build fail too? |
@jmazanec15 exactly right,
every different thread call so every graph build would go into check also, multi thread would get into lock area which would reduce graph build performance, but |
Oh I see (https://github.com/facebookresearch/faiss/blob/657c563604c774461aed0394ae99210713145e03/faiss/impl/AuxIndexStructures.cpp#L223-L229). That makes sense. Let me take look at this PR. This would also help address: opensearch-project/OpenSearch#8590. |
@@ -463,4 +463,8 @@ public static native KNNQueryResult[] rangeSearchIndex( | |||
int indexMaxResultWindow, | |||
int[] parentIds | |||
); | |||
|
|||
public static native void setMergeInterruptCallback(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these and just have one interrupt callback registered at the time of library initialization? I dont think these need to be in the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just prefer not to do this when its a global, static callback.
public class KNNMergeHelper { | ||
|
||
private KNNMergeHelper() {} | ||
public static boolean isMergeAborted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we generalize this to include shard closed events? See this comment: opensearch-project/OpenSearch#8590 (comment). Im not sure if its possible or not to look up that kind of information. @shwetathareja do you have any ideas around here?
buildAndWriteIndex(knnVectorValues, totalLiveDocs); | ||
endMergeStats(totalLiveDocs, bytesPerVector); | ||
} catch (Exception ex) { | ||
//TODO handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to catch this?
Description
When there is a scenarios:
Node1#Index1#Shard1
(long time running)Node1#Index1#Shard1
TONode2#Index1#Shard1
Proposal
i think we can introduce abort mechanism for long time merge task meanwhile close shard called.
i think we can introduce
KNNMergeHelper
class to check if merge aborted. and when build the graph, we can reuse faiss::InterruptCallback which is interrupt callback mechanism to check whether aborted or notBUT
ConcurrentMergeScheduler#MergeThread
is a internal class, we can not call this directly. it throwsorg.apache.lucene.index.ConcurrentMergeScheduler$MergeThread is in unnamed module of loader 'app'
we can added this static method into OpenSearch Core like
OneMergeHelper
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
#2530
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.