-
Notifications
You must be signed in to change notification settings - Fork 1.2k
GroupVarInt Encoding Implementation for HNSW Graphs #14932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Hi @aylonsk ! Thank you for digging into this issue. I am sure you are still working on it, but I had some feedback:
Handling the format change can be complicated. So, my first step would be to justify the change with performance metrics. Then do all the complicated format stuff. Good luck! |
Thanks for your response! My apologies, I forgot to post my results from LuceneUtil. Because I noticed variance between each run, I decided to test each set of hyperparameters 10 times and take the median for latency, netCPU, and AvgCpuCount. Therefore, my results aren't in the standard table format. I ran 12 comparison tests in total, each a different combination of HPs. Here were the variables I kept the same: (topK=100, fanout=50, beamWidth=250, numSegments=1) Here are some specific tests: BENCHMARKS (10 runs per test):
Baseline: Candidate: Latency Improvement: ~4.11% speedup
Baseline: Candidate: Latency Improvement: ~4.3% speedup
Baseline: Candidate: Latency Improvement: ~3.17% speedup
Baseline: Candidate: Latency Improvement: 5.49% speedup
Baseline: Candidate: Latency Improvement: ~18.1% speedup While the degree of improvement varied between tests, all tests except 1 showed improvement in latency over the baseline. Considering how simple and non-intrusive this implementation is, I think it would be an easy net benefit. Thank you for letting me know about the backwards compatibility requirement. I will look into fixing that tomorrow. |
@aylonsk great looking numbers! I expect for cheaper vector ops (e.g. single bit quantization), the impact is even higher. |
@aylonsk To handle backward compatibility, I'd recommend doing the following:
|
Hello, and thank you for all of your suggestions. I have updated the reader and format files accordingly to allow for backwards compatibility using a VERSION_GROUPVARINT parameter in the format class, and an interface near the top level of the reader class to make impact on runtime minimal. The testing part was trickier, as I needed to create a new class (TestLucene99HnswVectorsFormatV2) that would extend the same class (BaseKnnVectorsFormatTestCase) as the original TestLucene99HnswVectorsFormat, but with a getCodec() method that would return the a format with the old writer and the new reader. At first I thought I would have to create my own test to read, write, and check docIDs in HNSW graphs, but then I realized that there are already tests in the BaseKnnVectorsFormatTestCase class that do this (such as testRecall). To make this possible, I created two new classes in the lucene99 backwards_codecs directory: A VarInt-only writer (Lucene99HnswVectorsWriterV0) and a format that returns the current backwards-compatible reader in its fieldsReader class and the VarInt-only writer in its fieldsWriter class. To confirm the validity of the test, a VarInt-only reader was also created but not commited (Lucene99HnswVectorsReaderV0), and when I flipped the format class to using the new writer and the old reader, the testRecall test failed. Any questions/comments/suggestions are appreciated. Thank you! |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
@@ -212,7 +213,7 @@ public KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException | |||
|
|||
@Override | |||
public int getMaxDimensions(String fieldName) { | |||
return 1024; | |||
return 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't want to make this change? At least not as part of this PR :)
@@ -76,6 +79,7 @@ public final class Lucene99HnswVectorsReader extends KnnVectorsReader | |||
private final FieldInfos fieldInfos; | |||
private final IntObjectHashMap<FieldEntry> fields; | |||
private final IndexInput vectorIndex; | |||
private final Populator dataReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dataReader
will confuse people since that is the name of a class (DataReader
). Maybe call it decoder
? Or neighborDecoder
?
@@ -0,0 +1,233 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we are introducing these backward_codecs in order to be able to write graphs in the old VInt (v0) format in order to test that we are able to read them back with the existing codec?
Given that, I think any additional classes we add here could live in the tests rather than in backward_codecs, since we won't need these for reading old indexes (which is what backward_codecs are for).
Having said that, I wonder if we could add a test-only constructor to the format that would enable it to continue writing the old format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have a package private constructor that allows setting the version for the writer? That way tests can write with the old version if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think that is what you mean by "test only ctor".
I agree, a new package private ctor is likely best and easiest
Description
For HNSW Graphs, the alternate encoding I implemented was GroupVarInt encoding, which in theory should be less costly both in space and runtime. The pros of this encoding would be that it allocates all of the space for a group of 4 integers in advance, and that it can encode using all 8 bits per byte instead of the 7 for VarInt. The cons are that it can only encode integers (<=32bits), and uses the first byte to encode the size of each number. However, since we are using delta encoding to condense our integers, they will never be larger than 32bits, making this irrelevant.
Closes #12871