Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Vector Index Build] Introduce RemoteNativeIndexBuildStrategy skeleton #2525

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jed326
Copy link
Contributor

@jed326 jed326 commented Feb 13, 2025

Description

First PR for #2465

In order to review changes incrementally, this PR is scoped down to only the following:

  • Feature Flag
  • Index Setting
  • Repository name cluster setting
  • RemoteIndexBuilder skeleton with repository service and index settings wired through the Codec

I am keeping the vector upload changes in a separate follow-up PR as that will deserve it's own in-depth discussion.

The key part of this PR is that we need to pass a Supplier for KNNVectorValues through to the NativeIndexBuildStrategy as we need to open up multiple InputStreams on multiple KNNVectorValues in order to support uploading blobs in parallel. The actual implementation of this parallel upload will come in the next PR, however I am laying the groundwork for that in this skeleton implementation.

Related Issues

Relates: #2465

Check List

  • New functionality includes testing.
    - [ ] New functionality has been documented.
    - [ ] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
    - [ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@jed326 jed326 force-pushed the remote-vector-dev branch 2 times, most recently from 3f39eba to 739d2ef Compare February 13, 2025 04:14
@jed326
Copy link
Contributor Author

jed326 commented Feb 13, 2025

Thanks @navneet1v , I also agree that a new writer isn't completely necessary as the underlying formats are not changing. Moreover, the remote build should be format agnostic anyways, so I've refactored NativeIndexWriter into an interface to be used either locally or remotely when performing flush or merge.

Copy link
Member

@kotwanikunal kotwanikunal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall @jed326 .

@jed326 jed326 force-pushed the remote-vector-dev branch 2 times, most recently from e7f50da to 4f8648c Compare February 14, 2025 06:08
@jed326 jed326 force-pushed the remote-vector-dev branch 2 times, most recently from 0531425 to 7e93b7f Compare February 17, 2025 01:06
@jed326
Copy link
Contributor Author

jed326 commented Feb 17, 2025

@shatejas I've reworked this skeleton in the form of RemoteIndexBuildStrategy implements NativeIndexBuildStrategy. Javadoc is still needed in a few places, but this PR is ready for review on the overall approach. Thanks!

@jed326 jed326 force-pushed the remote-vector-dev branch 2 times, most recently from 7964781 to 0c52e14 Compare February 18, 2025 00:26
}
} catch (Exception e) {
throw new RuntimeException(e);
}
});
final Long expectedTimesGetVectorValuesIsCalled = vectorsPerField.stream().filter(Predicate.not(Map::isEmpty)).count();
knnVectorValuesFactoryMockedStatic.verify(
() -> KNNVectorValuesFactory.getVectorValues(any(VectorDataType.class), any(DocsWithFieldSet.class), any()),
times(Math.toIntExact(expectedTimesGetVectorValuesIsCalled) * 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In quantization case getVectorValues is called twice, however now we are retrieving the supplier itself via getVectorValuesSupplier, which we only need to do once then we are passing the supplier through the rest of the flow.

Vikasht34
Vikasht34 previously approved these changes Feb 18, 2025
Copy link
Collaborator

@Vikasht34 Vikasht34 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas already had made some good suggestions on PR , I see those being addressed. I am good with approach and class hierarchy.

Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we hide repositories service dependency inside some RemoteIndexBuilder abstraction instead of taking direct dependency on it in codec classes?

Also, I think this should go into a feature branch, especially now that main is a release branch. I think overall structure is good, but its still WIP and doesnt have tests.

@@ -54,15 +55,29 @@ public class NativeEngines990KnnVectorsWriter extends KnnVectorsWriter {
private final List<NativeEngineFieldVectorsWriter<?>> fields = new ArrayList<>();
private boolean finished;
private final Integer approximateThreshold;
private final Supplier<RepositoriesService> repositoriesServiceSupplier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of taking a hard dependency on the RepositoriesService, can we build an abstraction, like RemoteIndexBuilder, that hides these details from the IndexWriter class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach I previously considered was in NativeEngines990KnnVectorsFormat itself we can do the feature checks and then create an instance of RemoteIndexBuildStrategy to pass to the NativeEngines990KnnVectorsWriter. However, it seemed better to me to not instantiate any of the remote index build related classes unless they were actually needed, so I went with this approach of passing the repositoriesServiceSupplier through to the index writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced a NativeIndexBuildStrategyFactory class in the latest revision

@jed326
Copy link
Contributor Author

jed326 commented Feb 18, 2025

Thanks @jmazanec15

Also, I think this should go into a feature branch, especially now that main is a release branch. I think overall structure is good, but its still WIP and doesnt have tests.

I do have a feature branch in my fork that I have been developing on (https://github.com/jed326/k-NN/commits/remote-vector-staging-2.19/), but I strongly believe we need to start merging these changes into main incrementally otherwise it feels like we are just kicking the can down the road on these high level discussions.

In terms of testing I will add some coverage to ensure the fallback mechanism is working. I was thinking these weren't needed for now since all the functional testing would come in subsequent PRs that include the functionality itself.

@jmazanec15
Copy link
Member

@navneet1v @shatejas @Vikasht34 What do you guys think on main vs. feature branch?

@Vikasht34
Copy link
Collaborator

@navneet1v @shatejas @Vikasht34 What do you guys think on main vs. feature branch?

Feature Branch

@navneet1v
Copy link
Collaborator

@navneet1v @shatejas @Vikasht34 What do you guys think on main vs. feature branch?

if tests are there and code is buildable then we should go with main branch. There is no point in feature branch.

@jed326 jed326 force-pushed the remote-vector-dev branch 2 times, most recently from 13b309b to c4a71b6 Compare February 19, 2025 01:05
@jed326
Copy link
Contributor Author

jed326 commented Feb 19, 2025

@navneet1v @shatejas @Vikasht34 What do you guys think on main vs. feature branch?

I would strongly prefer to not use a feature branch. I think this PR itself is a prime example of why it's important to get a wide range of opinions on changes like this that include a decent amount of refactoring and I think this PR would not have gotten the same amount of feedback if it were directed at a feature branch.

For testing, in the latest revision I have added some randomization to the base test class to randomly enable/disable the new settings and feature flag to ensure the fallback mechanisms are working correctly.

@jmazanec15
Copy link
Member

Okay, lets add some tests and we can develop on main, assuming proper feature flag.

… to accept vector value supplier

Signed-off-by: Jay Deng <[email protected]>
@jed326
Copy link
Contributor Author

jed326 commented Feb 19, 2025

Added a new RemoteIndexBuildStrategyTests.java as well to test the fallback mechanism. I think that + the settings updates in the KNNRestTestCase base class for ITs should give us good coverage on this change, but please let me know if there's any other coverage you think is missing, thanks!

@jed326 jed326 changed the title [Remote Vector Index Build] Introduce RemoteIndexBuilder skeleton [Remote Vector Index Build] Introduce RemoteNativeIndexBuildStrategy skeleton Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants