Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating GPU based Vector Search using cuVS #14131

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

chatman
Copy link
Contributor

@chatman chatman commented Jan 10, 2025

Description

  • NVIDIA's cuVS library (https://github.com/rapidsai/cuvs) is a state-of-the-art vector search library. It supports fast indexing and search using GPUs.
  • This pull request is to integrate this library based on a custom KnnVectorFormat and Codec into the Lucene sandbox.
  • cuVS library is a C library. There is an in-progress Java API (to be released soon): Initial cut for a cuVS Java API rapidsai/cuvs#450. This uses Project Panama for integration.

This is an in-progress PR at the moment. Here is a way to test it out:

  • Clone the cuvs repository from the PR branch.
  • ./build.sh libcuvs && ./build.sh java
  • (The above will install the cuvs-java artifacts in local Maven repository)
  • Compile and use this branch in an IDE.

TODO:

  • TestCuVS works via IDE, but not via gradle (some native access security issues).
  • Make this branch work with released version of cuvs-java, once it is released.
  • Add more tests.
  • Publish benchmarks.

This work is mainly done by @narangvivek10, @punAhuja and me, along with help from @cjnolet.

@chatman
Copy link
Contributor Author

chatman commented Jan 10, 2025

@@ -22,6 +22,7 @@ plugins {
}

repositories {
mavenLocal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove mavenLocal before merging, if it happens. There will be issues with it - some are very cryptic and hard to diagnose (like different artifact hashes). It's going to be a major headache if it's left in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Dawid, will add a nocommit comment there. I had added nocommits earlier, but had to remove them to make "./gradlew check" work. Is there a way to have it check everything other than the nocommits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can skip the entire task that looks for nocommits (and other things) by running ./gradlew check -x validateSourcePatterns

@navneet1v
Copy link
Contributor

@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS.

.withNumWriterThreads(cuvsWriterThreads)
.withIntermediateGraphDegree(intGraphDegree)
.withGraphDegree(graphDegree)
.withCagraGraphBuildAlgo(CagraGraphBuildAlgo.NN_DESCENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some experience with building the Cagra index, and I think NN_DESCENT is faster in cagra index creation but it has a high GPU memory footprint. Should we use IVF_PQ here? Or can we have a hybrid approach where if doc count is < a specific number then we use NN_DESCENT else IVF_PQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we encountered some crashes while trying IVF_PQ, hence using NN_DESCENT as the default. Maybe we can make it configurable once those crashes can be thoroughly verified.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:

  • tests for queries
  • tests for the format
  • preventing bad behavior (e.g. using byte[])

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

Comment on lines +204 to +213
float vectors[][] = new float[field.vectors.size()][field.vectors.get(0).length];
for (int i = 0; i < vectors.length; i++) {
for (int j = 0; j < vectors[i].length; j++) {
vectors[i][j] = field.vectors.get(i)[j];
}
}

cagraIndexBytes = createCagraIndex(vectors, new ArrayList<Integer>(field.vectors.keySet()));
bruteForceIndexBytes = createBruteForceIndex(vectors);
hnswIndexBytes = createHnswIndex(vectors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, you load all the vectors onto heap. This is frankly untenable in production.

The amount of GC & JVM heap would be enormous on medium size indices (10M+).

Comment on lines +166 to +167
cagraIndexForHnsw =
new CagraIndex.Builder(resources).withDataset(vectors).withIndexParams(indexParams).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cagra index builder should instead accept a file handle or something. Having to feed it on-heap vectors is not good.

Comment on lines +128 to +130
File tmpFile =
File.createTempFile(
"tmpindex", "cag"); // TODO: Should we make this a file with random names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tmp file name basically means you cannot have different fields with different settings running at the same time? Seems like a very bad idea.

this.segmentWriteState.segmentInfo.getId(),
this.segmentWriteState.segmentSuffix);

CuVSSegmentFile cuVSFile = new CuVSSegmentFile(new SegmentOutputStream(cuVSIndex, 100000));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use all the built in inputs/outputs? Seems weird to have a unique buffered output.

Comment on lines +97 to +101
for (StackFrame s : stackTrace) {
if (s.toString().startsWith("org.apache.lucene.index.IndexWriter.merge")) {
isMergeCase = true;
// log.info("Reader opening on merge call");
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should instead use getMergeInstance which allows you do set merge settings, or whatever for a given reader.

Comment on lines +116 to +123
private Map<String, List<CuVSIndex>> loadCuVSIndex(ZipInputStream zis, boolean isMergeCase)
throws Throwable {
Map<String, List<CuVSIndex>> ret = new HashMap<String, List<CuVSIndex>>();
Map<String, CagraIndex> cagraIndexes = new HashMap<String, CagraIndex>();
Map<String, BruteForceIndex> bruteforceIndexes = new HashMap<String, BruteForceIndex>();
Map<String, HnswIndex> hnswIndexes = new HashMap<String, HnswIndex>();
Map<String, List<Integer>> mappings = new HashMap<String, List<Integer>>();
Map<String, List<float[]>> vectors = new HashMap<String, List<float[]>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, every time this reader is opened, you load everything on heap and THEN build the cuvsindex by scratch?

Why not serialize the cuvindex itself? Doesn't it have a file format or something?

@chatman
Copy link
Contributor Author

chatman commented Jan 17, 2025

This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:

* tests for queries

* tests for the format

* preventing bad behavior (e.g. using `byte[]`)

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

Indeed, Ben. This is WIP at the moment. More tests are WIP. As for loading the entire index in byte[], this is something that we're working with the NVIDIA/cuVS team to see if streaming can be supported (right now it is not).

@chatman
Copy link
Contributor Author

chatman commented Jan 17, 2025

@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS.

Yes, exactly.

@chatman
Copy link
Contributor Author

chatman commented Jan 17, 2025

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

This is something we need to work out as we want this to be. Here are my thoughts at the moment, and things we need consensus on:

  • Right now, the cuvs-java dependency comes from local Maven and it should come from Maven central once the artifacts are available.
  • TODO: If a system doesn't have cuda or GPUs, these codepaths should gracefully fail and indicate that support not available.
  • Continuous testing can be enabled on GPU enabled Jenkins instances (we can have a discussion around that).
  • To validate the integration at API level, some mock tests (that simulate the same functionality using the CPU) can be added to the cuvs-java API.
  • We can discuss whether shipping this by default with release artifacts is a problem or not.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next to Dawid's comments about the dependencies and the code, I have some major problems with:

  • cleanup the API and do not make everything public. Only the Codec and format class needs to be public, all other pkg-private.
  • remove public fields that are modifiable.
  • make all fields final, if possible.
  • there is an issue with a static field which gets initialized in the ctor. This looks wrong!
  • don't swallow exceptions and log errors. If you have no CUVS, fail hard. It makes no sense to proceed, because without the native library and a graphics adapter the whole codec won't work.

Anyways, I am happy to see this and that the underlying library was moved to Panama. So it was really a pleasure to talk to your colleagues last summer at berlinbuzzwords!

dependencies {
moduleApi project(':lucene:core')
moduleApi project(':lucene:queries')
moduleApi project(':lucene:facet')
moduleTestImplementation project(':lucene:test-framework')
moduleImplementation deps.commons.lang3
Copy link
Contributor

@uschindler uschindler Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would really be great to remove the commons-lang3 (bullshit) library. Sorry for this, but this library is mostly useless. Sometimes some of its rarely used funtions can be replicated (if they are of good use for other parts). But in general the broken null handling in this library (it ignores nulls) and lots of legacy code that can be much easier written with more modern Java 21 APIs do not justify its usage.

My suggestion: Actually the code here only uses SerializationUtils.deserialize() and SerializationUtils.serialize(). Maybe just copy those to the Utils class of the package as it seems to be of not much use elsewhere in Lucene - and remove this dependency.

public final String fieldName;
public final ConcurrentHashMap<Integer, float[]> vectors =
new ConcurrentHashMap<Integer, float[]>();
public int fieldVectorDimension = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be final and only initialized by ctor.

try {
format = new CuVSVectorsFormat(1, 128, 64, MergeStrategy.NON_TRIVIAL_MERGE);
setKnnFormat(format);
} catch (LibraryNotFoundException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fail hard. If cuvs is not available it should throw exception and not just log a severe error. Instead this throws NPE later when the knn format is retrieved.

return knnFormat;
}

public void setKnnFormat(KnnVectorsFormat format) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No setters please, codecs should be unmodifiable! It should initialize on ctor or fail if library cannot be loaded.

public class CuVSSegmentFile implements AutoCloseable {
private final ZipOutputStream zos;

private Set<String> filesAdded = new HashSet<String>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must be final.

public final int cuvsWriterThreads;
public final int intGraphDegree;
public final int graphDegree;
public MergeStrategy mergeStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also be final and why is this one and the previous ones public?

// protected Logger log = Logger.getLogger(getClass().getName());

IndexInput vectorDataReader = null;
public String fileName = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all fields should not be public, and if they need to be public they should be final at least. I assume they can be package private or private.


// protected Logger log = Logger.getLogger(getClass().getName());

private List<CagraFieldVectorsWriter> fieldVectorWriters = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check which fields can be final.

*/
public class PerLeafCuVSKnnCollector implements KnnCollector {

public List<ScoreDoc> scoreDocs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again why is all this public?

/**
* Some Utils used in CuVS integration
*/
public class Util {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is internal, make it pkg private

@@ -20,6 +20,9 @@
requires org.apache.lucene.core;
requires org.apache.lucene.queries;
requires org.apache.lucene.facet;
requires java.logging;
requires com.nvidia.cuvs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes cuVS a non-optional dependency. I would have expected to see this feature as optional, i.e. if not present you get a nice error message or something. I guess it kinda depends on how the cuVS and native implementation are tied together? That is, can one use or even instantiate types from the cuVS Java API without the native library being present?

Copy link
Contributor

@uschindler uschindler Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this in the snapshot module a problem. The code itsself should be written in a way that the codec can only be used, if the native library is there. Currently the code is a bit broken as it does not fail hard. I'd like to change this (see my review),
But if the dependency is on Maven central, it should work.
Of course we could make it also optional in module system, but then you would get CNFE when loading codecs or related classes, which would happen on SPI lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uschindler Agree - this feature should be optional. If the Java API is present and loadable without the native library - has at least one callable method, which we can call and catch, then that is fine. I'll try it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code catches this, but just logs a warning and then fails later (due to NPE). I argued that the Codec should fail hard or at least fails when it is to be used.

This must be refactored before release:

  • Allow to load the Codec, but delay initialization of native library until the codec is actually used to create components.
  • Fail hard when an index with the custom codec is loaded and theres no native access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the code I think best wozuld be the following: Create a small static "holder" class (inside the Codec impl) which has a static initializer loading and setting up the CUVS code. This should fail hard with some LinkageError if library is not there. The holder class should have getters for some CUVS entry points.

All getter methods to create formats and readers in the Codec delegate to the getters in the holder and pass through any exceptions. In addition, all CUVS classes should be package private (I mentioned this already) and only let the codec and a essential config classes public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for the postings formats and other public components loaded via SPI.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Jan 27, 2025

Hi, cuvs-java-25.02 is currently compiled with JDK 22, so has a minimum class file version of 66. Lucene compiles with a minimum of JDK 21, class file version 65. The reason why cuvs-java has a minimum of JDK 22 is that is uses Panama/foreign APIs directly, which is fine.

We have two options here:

  1. restructure cuvs-java so that it compiles to a minimum JDK 21, with an mr jar/version specific loading. Maybe it can even strip use class file version 65 and strip the preview bit, similar to what Lucene does. Depends on the nature of what the code is doing - which I've yet to fully look into. All said, this is quite involved, but may make that cuvs-java api more generally useful. ( one can at least develop against it more easily, if not run tests, etc )

  2. Restructure the code in Lucene so that it uses a toolchain and javac from JDK 22 and a java22 sourceset. Also make its loading conditional on the JDK version and/or use an mrjar.

Example of how this is currently failing:

Error occurred during initialization of boot layer
java.lang.module.FindException: Error reading module: /Users/chegar/.m2/repository/com/nvidia/cuvs/cuvs-java/25.02/cuvs-java-25.02.jar
Caused by: java.lang.module.InvalidModuleDescriptorException: Unsupported major.minor version 66.0

FTR - there is no suggestion of intent to increase the required minimum JDK for Lucene - it will remain JDK 21.

@ChrisHegarty
Copy link
Contributor

I want to start cleaning some of the outstanding items in this PR, but I do not have push access to SearchScale:cuvs-integration-main. Can I get access, or is there a better way to proceed?

@chatman
Copy link
Contributor Author

chatman commented Jan 27, 2025

I want to start cleaning some of the outstanding items in this PR, but I do not have push access to SearchScale:cuvs-integration-main. Can I get access, or is there a better way to proceed?

I've sent you a maintainer level access invitation. Thanks for your help!

@chatman
Copy link
Contributor Author

chatman commented Jan 27, 2025

restructure cuvs-java so that it compiles to a minimum JDK 21, with an mr jar/version specific loading. Maybe it can even strip use class file version 65 and strip the preview bit, similar to what Lucene does. Depends on the nature of what the code is doing - which I've yet to fully look into. All said, this is quite involved, but may make that cuvs-java api more generally useful. ( one can at least develop against it more easily, if not run tests, etc )

If this is possible, it would be very nice! I think @narangvivek10 has more details, but as I understand, the FFM functionality is being used in Project Panama, but those APIs are available only since JDK22.

@dweiss
Copy link
Contributor

dweiss commented Jan 27, 2025

The third option is to bump the minimum Java requirement to Java 22 on main? I know it's an interim release but maybe we should just do it, anticipating the next major lts (due to be released in September)?

@uschindler
Copy link
Contributor

uschindler commented Jan 27, 2025

restructure cuvs-java so that it compiles to a minimum JDK 21, with an mr jar/version specific loading. Maybe it can even strip use class file version 65 and strip the preview bit, similar to what Lucene does. Depends on the nature of what the code is doing - which I've yet to fully look into. All said, this is quite involved, but may make that cuvs-java api more generally useful. ( one can at least develop against it more easily, if not run tests, etc )

If this is possible, it would be very nice! I think @narangvivek10 has more details, but as I understand, the FFM functionality is being used in Project Panama, but those APIs are available only since JDK22.

Tha APIs are available in Java 21, too (with minimal changes regarding some specific parts like string handling). If you omit those, you can compile against java 21 and later strip the preview bit from all classes (Lucebe did this a while ago). Nowadays Lucene extracted a stub-jar with all public class signatures (without code) from java 21 and compiles against it with some compiler tricks.

I don't know if the APIs which are different are used by cuvs, so I can't give a recommendation.

If cuvs is only available with Java 22 due to incompatibility of APIs, we need to either upgrade minimum version of Lucene or we need toolkit magic to only compile this for Java 22 (which I'd like to avoid).

P.S. I'd like to bite into the apple and make Java 22 minimum requirement. Then we can use Memory segments in our public API and clean up a lot. I know people will argue and complain, but we have to go that route soon. I was heavily arguing in the OpenJDK community to make Panama be non-preview in 21, but that's now too late.

The best may be to delay this PR till next Java LTS comes out and jump to next LTS as soon as possible.

@benwtrent
Copy link
Member

@dweiss While I think that might work for unused/snapshot Lucene releases, I think @chatman et. al. is aiming for usage in the current Lucene Main so that Lucene focused search engines (e.g. OpenSearch, Elasticsearch, etc.) can take advantage of the cuvs format if they want.

@chatman the only way to get this to work in current major releases (e.g. this year) for Apache Lucene engines is to handle this Java version code and use the MR-Jar module stuff. Meaning, it cannot have a hard compilation dependency on java 22.

I'd like to bite into the apple and make Java 22 minimum requirement. Then we can use Memory segments in our public API and clean up a lot.

You mean bump the minimum Java for every Java release until the next LTS, assuming that Lucene 11 will require Java 25? I would be for this myself :). Releasing Lucene 11 with Java 25 will have many nice benefits.

But, I think there is a strong desire for @chatman and company to get this into production without waiting another year+.

@uschindler
Copy link
Contributor

uschindler commented Jan 27, 2025

P.P.S. Elasticsearch has a Gradle plugin to strip preview flags. Basically it patches one short at a fixed position in all class files that are created by Javac.

@ChrisHegarty
Copy link
Contributor

Tha APIs are available in Java 21, too (with minimal changes regarding some specific parts like string handling). If you omit those, you can compile against java 21 and later strip the preview bit from all classes (Lucebe did this a while ago). Nowadays Lucene extracted a stub-jar with all public class signatures (without code) from java 21 and compiles against it with some compiler tricks.

I don't know if the APIs which are different are used by cuvs, so I can't give a recommendation.

From a quick skim, then I think we can manage to make this work. For the small shimmer in the foreign API, we can provide static version specific stubs. We do similar in Elasticsearch.

If cuvs is only available with Java 22 due to incompatibility of APIs, we need to either upgrade minimum version of Lucene or we need toolkit magic to only compile this for Java 22 (which I'd like to avoid).

At a minimum, the cuvs-java API exposes MemorySegment, so all consumers - Lucene in this case - would have to deal with the preview-ness nonsense!. But again, this may be doable with a little fiddling in gradle and the class file.

P.S. I'd like to bite into the apple and make Java 22 minimum requirement. Then we can use Memory segments in our public API and clean up a lot.

As you know, I'm strongly in favour of moving to newer java versions. However, this is quite a move. I will result in every project consuming Lucene to keep upgrading constantly until the next Java LTS. Which I'm personally ok with, but I'm not sure anyone has really done this before.

@ChrisHegarty
Copy link
Contributor

P.P.S. Elasticsearch has a Gradle plugin to strip preview flags. Basically it patches one byte in all class files that are created by Javac.

Yes, we can do this. Along with an MRJAR plugin. In ES we use the toolchain to resolve newer JDKs, e.g. 22, etc

@uschindler
Copy link
Contributor

One additional important thing: no public API in this new Lucene module/codec must export any preview API, so it must all be private/pkg-private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants