-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating GPU based Vector Search using cuVS #14131
base: main
Are you sure you want to change the base?
Conversation
@@ -22,6 +22,7 @@ plugins { | |||
} | |||
|
|||
repositories { | |||
mavenLocal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove mavenLocal before merging, if it happens. There will be issues with it - some are very cryptic and hard to diagnose (like different artifact hashes). It's going to be a major headache if it's left in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, Dawid, will add a nocommit comment there. I had added nocommits earlier, but had to remove them to make "./gradlew check" work. Is there a way to have it check everything other than the nocommits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can skip the entire task that looks for nocommits (and other things) by running ./gradlew check -x validateSourcePatterns
@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS. |
.withNumWriterThreads(cuvsWriterThreads) | ||
.withIntermediateGraphDegree(intGraphDegree) | ||
.withGraphDegree(graphDegree) | ||
.withCagraGraphBuildAlgo(CagraGraphBuildAlgo.NN_DESCENT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some experience with building the Cagra index, and I think NN_DESCENT is faster in cagra index creation but it has a high GPU memory footprint. Should we use IVF_PQ here? Or can we have a hybrid approach where if doc count is < a specific number then we use NN_DESCENT else IVF_PQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we encountered some crashes while trying IVF_PQ, hence using NN_DESCENT as the default. Maybe we can make it configurable once those crashes can be thoroughly verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:
- tests for queries
- tests for the format
- preventing bad behavior (e.g. using
byte[]
)
I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.
float vectors[][] = new float[field.vectors.size()][field.vectors.get(0).length]; | ||
for (int i = 0; i < vectors.length; i++) { | ||
for (int j = 0; j < vectors[i].length; j++) { | ||
vectors[i][j] = field.vectors.get(i)[j]; | ||
} | ||
} | ||
|
||
cagraIndexBytes = createCagraIndex(vectors, new ArrayList<Integer>(field.vectors.keySet())); | ||
bruteForceIndexBytes = createBruteForceIndex(vectors); | ||
hnswIndexBytes = createHnswIndex(vectors); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, you load all the vectors onto heap. This is frankly untenable in production.
The amount of GC & JVM heap would be enormous on medium size indices (10M+).
cagraIndexForHnsw = | ||
new CagraIndex.Builder(resources).withDataset(vectors).withIndexParams(indexParams).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cagra index builder should instead accept a file handle or something. Having to feed it on-heap vectors is not good.
File tmpFile = | ||
File.createTempFile( | ||
"tmpindex", "cag"); // TODO: Should we make this a file with random names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tmp file name basically means you cannot have different fields with different settings running at the same time? Seems like a very bad idea.
this.segmentWriteState.segmentInfo.getId(), | ||
this.segmentWriteState.segmentSuffix); | ||
|
||
CuVSSegmentFile cuVSFile = new CuVSSegmentFile(new SegmentOutputStream(cuVSIndex, 100000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use all the built in inputs/outputs? Seems weird to have a unique buffered output.
for (StackFrame s : stackTrace) { | ||
if (s.toString().startsWith("org.apache.lucene.index.IndexWriter.merge")) { | ||
isMergeCase = true; | ||
// log.info("Reader opening on merge call"); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should instead use getMergeInstance
which allows you do set merge settings, or whatever for a given reader.
private Map<String, List<CuVSIndex>> loadCuVSIndex(ZipInputStream zis, boolean isMergeCase) | ||
throws Throwable { | ||
Map<String, List<CuVSIndex>> ret = new HashMap<String, List<CuVSIndex>>(); | ||
Map<String, CagraIndex> cagraIndexes = new HashMap<String, CagraIndex>(); | ||
Map<String, BruteForceIndex> bruteforceIndexes = new HashMap<String, BruteForceIndex>(); | ||
Map<String, HnswIndex> hnswIndexes = new HashMap<String, HnswIndex>(); | ||
Map<String, List<Integer>> mappings = new HashMap<String, List<Integer>>(); | ||
Map<String, List<float[]>> vectors = new HashMap<String, List<float[]>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, every time this reader is opened, you load everything on heap and THEN build the cuvsindex by scratch?
Why not serialize the cuvindex itself? Doesn't it have a file format or something?
Indeed, Ben. This is WIP at the moment. More tests are WIP. As for loading the entire index in byte[], this is something that we're working with the NVIDIA/cuVS team to see if streaming can be supported (right now it is not). |
Yes, exactly. |
This is something we need to work out as we want this to be. Here are my thoughts at the moment, and things we need consensus on:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next to Dawid's comments about the dependencies and the code, I have some major problems with:
- cleanup the API and do not make everything public. Only the Codec and format class needs to be public, all other pkg-private.
- remove public fields that are modifiable.
- make all fields final, if possible.
- there is an issue with a static field which gets initialized in the ctor. This looks wrong!
- don't swallow exceptions and log errors. If you have no CUVS, fail hard. It makes no sense to proceed, because without the native library and a graphics adapter the whole codec won't work.
Anyways, I am happy to see this and that the underlying library was moved to Panama. So it was really a pleasure to talk to your colleagues last summer at berlinbuzzwords!
dependencies { | ||
moduleApi project(':lucene:core') | ||
moduleApi project(':lucene:queries') | ||
moduleApi project(':lucene:facet') | ||
moduleTestImplementation project(':lucene:test-framework') | ||
moduleImplementation deps.commons.lang3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would really be great to remove the commons-lang3 (bullshit) library. Sorry for this, but this library is mostly useless. Sometimes some of its rarely used funtions can be replicated (if they are of good use for other parts). But in general the broken null handling in this library (it ignores nulls) and lots of legacy code that can be much easier written with more modern Java 21 APIs do not justify its usage.
My suggestion: Actually the code here only uses SerializationUtils.deserialize()
and SerializationUtils.serialize()
. Maybe just copy those to the Utils class of the package as it seems to be of not much use elsewhere in Lucene - and remove this dependency.
public final String fieldName; | ||
public final ConcurrentHashMap<Integer, float[]> vectors = | ||
new ConcurrentHashMap<Integer, float[]>(); | ||
public int fieldVectorDimension = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be final and only initialized by ctor.
try { | ||
format = new CuVSVectorsFormat(1, 128, 64, MergeStrategy.NON_TRIVIAL_MERGE); | ||
setKnnFormat(format); | ||
} catch (LibraryNotFoundException ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should fail hard. If cuvs is not available it should throw exception and not just log a severe error. Instead this throws NPE later when the knn format is retrieved.
return knnFormat; | ||
} | ||
|
||
public void setKnnFormat(KnnVectorsFormat format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No setters please, codecs should be unmodifiable! It should initialize on ctor or fail if library cannot be loaded.
public class CuVSSegmentFile implements AutoCloseable { | ||
private final ZipOutputStream zos; | ||
|
||
private Set<String> filesAdded = new HashSet<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this must be final.
public final int cuvsWriterThreads; | ||
public final int intGraphDegree; | ||
public final int graphDegree; | ||
public MergeStrategy mergeStrategy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also be final and why is this one and the previous ones public?
// protected Logger log = Logger.getLogger(getClass().getName()); | ||
|
||
IndexInput vectorDataReader = null; | ||
public String fileName = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all fields should not be public, and if they need to be public they should be final at least. I assume they can be package private or private.
|
||
// protected Logger log = Logger.getLogger(getClass().getName()); | ||
|
||
private List<CagraFieldVectorsWriter> fieldVectorWriters = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check which fields can be final.
*/ | ||
public class PerLeafCuVSKnnCollector implements KnnCollector { | ||
|
||
public List<ScoreDoc> scoreDocs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again why is all this public?
/** | ||
* Some Utils used in CuVS integration | ||
*/ | ||
public class Util { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class is internal, make it pkg private
@@ -20,6 +20,9 @@ | |||
requires org.apache.lucene.core; | |||
requires org.apache.lucene.queries; | |||
requires org.apache.lucene.facet; | |||
requires java.logging; | |||
requires com.nvidia.cuvs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes cuVS a non-optional dependency. I would have expected to see this feature as optional, i.e. if not present you get a nice error message or something. I guess it kinda depends on how the cuVS and native implementation are tied together? That is, can one use or even instantiate types from the cuVS Java API without the native library being present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this in the snapshot module a problem. The code itsself should be written in a way that the codec can only be used, if the native library is there. Currently the code is a bit broken as it does not fail hard. I'd like to change this (see my review),
But if the dependency is on Maven central, it should work.
Of course we could make it also optional in module system, but then you would get CNFE when loading codecs or related classes, which would happen on SPI lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uschindler Agree - this feature should be optional. If the Java API is present and loadable without the native library - has at least one callable method, which we can call and catch, then that is fine. I'll try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code catches this, but just logs a warning and then fails later (due to NPE). I argued that the Codec should fail hard or at least fails when it is to be used.
This must be refactored before release:
- Allow to load the Codec, but delay initialization of native library until the codec is actually used to create components.
- Fail hard when an index with the custom codec is loaded and theres no native access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking at the code I think best wozuld be the following: Create a small static "holder" class (inside the Codec impl) which has a static initializer loading and setting up the CUVS code. This should fail hard with some LinkageError if library is not there. The holder class should have getters for some CUVS entry points.
All getter methods to create formats and readers in the Codec delegate to the getters in the holder and pass through any exceptions. In addition, all CUVS classes should be package private (I mentioned this already) and only let the codec and a essential config classes public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same for the postings formats and other public components loaded via SPI.
Hi, We have two options here:
Example of how this is currently failing:
FTR - there is no suggestion of intent to increase the required minimum JDK for Lucene - it will remain JDK 21. |
I want to start cleaning some of the outstanding items in this PR, but I do not have push access to SearchScale:cuvs-integration-main. Can I get access, or is there a better way to proceed? |
I've sent you a maintainer level access invitation. Thanks for your help! |
If this is possible, it would be very nice! I think @narangvivek10 has more details, but as I understand, the FFM functionality is being used in Project Panama, but those APIs are available only since JDK22. |
The third option is to bump the minimum Java requirement to Java 22 on main? I know it's an interim release but maybe we should just do it, anticipating the next major lts (due to be released in September)? |
Tha APIs are available in Java 21, too (with minimal changes regarding some specific parts like string handling). If you omit those, you can compile against java 21 and later strip the preview bit from all classes (Lucebe did this a while ago). Nowadays Lucene extracted a stub-jar with all public class signatures (without code) from java 21 and compiles against it with some compiler tricks. I don't know if the APIs which are different are used by cuvs, so I can't give a recommendation. If cuvs is only available with Java 22 due to incompatibility of APIs, we need to either upgrade minimum version of Lucene or we need toolkit magic to only compile this for Java 22 (which I'd like to avoid). P.S. I'd like to bite into the apple and make Java 22 minimum requirement. Then we can use Memory segments in our public API and clean up a lot. I know people will argue and complain, but we have to go that route soon. I was heavily arguing in the OpenJDK community to make Panama be non-preview in 21, but that's now too late. The best may be to delay this PR till next Java LTS comes out and jump to next LTS as soon as possible. |
@dweiss While I think that might work for unused/snapshot Lucene releases, I think @chatman et. al. is aiming for usage in the current Lucene Main so that Lucene focused search engines (e.g. OpenSearch, Elasticsearch, etc.) can take advantage of the cuvs format if they want. @chatman the only way to get this to work in current major releases (e.g. this year) for Apache Lucene engines is to handle this Java version code and use the MR-Jar module stuff. Meaning, it cannot have a hard compilation dependency on java 22.
You mean bump the minimum Java for every Java release until the next LTS, assuming that Lucene 11 will require Java 25? I would be for this myself :). Releasing Lucene 11 with Java 25 will have many nice benefits. But, I think there is a strong desire for @chatman and company to get this into production without waiting another year+. |
P.P.S. Elasticsearch has a Gradle plugin to strip preview flags. Basically it patches one short at a fixed position in all class files that are created by Javac. |
From a quick skim, then I think we can manage to make this work. For the small shimmer in the foreign API, we can provide static version specific stubs. We do similar in Elasticsearch.
At a minimum, the cuvs-java API exposes MemorySegment, so all consumers - Lucene in this case - would have to deal with the preview-ness nonsense!. But again, this may be doable with a little fiddling in gradle and the class file.
As you know, I'm strongly in favour of moving to newer java versions. However, this is quite a move. I will result in every project consuming Lucene to keep upgrading constantly until the next Java LTS. Which I'm personally ok with, but I'm not sure anyone has really done this before. |
Yes, we can do this. Along with an MRJAR plugin. In ES we use the toolchain to resolve newer JDKs, e.g. 22, etc |
One additional important thing: no public API in this new Lucene module/codec must export any preview API, so it must all be private/pkg-private. |
Description
This is an in-progress PR at the moment. Here is a way to test it out:
TODO:
This work is mainly done by @narangvivek10, @punAhuja and me, along with help from @cjnolet.