-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Decode doc ids in BKD leaves with auto-vectorized loops #14203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On a AVX-512 Linux X86 machine:
|
Confused +1 ... but the comparison of step512(baseline) and step32(candidate):
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
@jpountz Hi, do you have any idea how should we move forward on this optimization? several thoughts:
BTW, i got previous AVX512 results on a |
I have some small concerns:
But my main concern is more that I would like to better understand why 512 performs so much better. There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else. I have some discomfort merging something that is faster without having at least an intuition of why it's faster, so that I can also understand which JVMs and CPUs would enable this speedup. Could pipelining be the reason as 24 (bits per value) * 32 (step) < 2 * 512 (bit width of SIMD instructions)? But then something like 128 should perform well while your benchmark suggests it's still much worse than 512? |
Thanks for pointing out this. I studied the asm profile again and i can see at least loop unrolling differs there. According to the asm printed by jmh, i can see for bpv24 decoding:
This is corresponding to the result of jmh: vector API > InnerLoop step-512 > InnerLoop step-128. Things might change in luceneutil because we find InnerLoop step-512 faster than Vector API there. I confirmed the result of luceneutil of step-512(baseline) vs step-128(candidate):
|
Thanks for running benchmarks. So it looks like the JVM doesn't think these shorter loops (with step 128) are worth unrolling? This makes me wonder how something like that performs on your AVX-512 CPU. I think you had something similar in one of your previous iterations. On my machine it's on par with the current version. private void readInts24(IndexInput in, int count, int[] docIDs) throws IOException {
if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) {
// Same format, but enabling the JVM to specialize the decoding logic for the default number
// of points per node proved to help on benchmarks
doReadInts24(in, 512, docIDs);
} else {
doReadInts24(in, count, docIDs);
}
}
private void doReadInts24(IndexInput in, int count, int[] docIDs) throws IOException {
// Read the first (count - count % 4) values
int quarter = count >> 2;
int numBytes = quarter * 3;
in.readInts(scratch, 0, numBytes);
for (int i = 0; i < numBytes; ++i) {
docIDs[i] = scratch[i] >>> 8;
scratch[i] &= 0xFF;
}
for (int i = 0; i < quarter; ++i) {
docIDs[numBytes + i] = scratch[i]
| (scratch[quarter + i] << 8)
| (scratch[2 * quarter + i] << 16);
}
// Now read the remaining 0, 1, 2 or 3 values
for (int i = quarter << 2; i < count; ++i) {
docIDs[i] = (in.readShort() & 0xFFFF) | (in.readByte() & 0xFF) << 16;
}
} |
On the AVX-512 machine:
I pushed the benchmark code to the patch, here is result on my machine:
LuceneUtil hybridInnerLoop (baseline) vs specializedRead (candidate)
hybridInnerLoop (baseline) vs specializedDecodeMaskInRemainder (candidate)
I can refactor the code to the specialized decoding if it makes sense to you. BTW, should we disable the flexibility of changing |
Again, thanks a lot for running benchmarks.
That would be great, thank you. Sorry for making it hard for you to move this PR forward, I was a bit annoyed that we needed something complicated to speed things up, I like the simplicity of specializedDecodeMaskInRemainder.
In my opinion, it's good enough the way things are today as the default codec doesn't allow configuring the number of points per leaf node. |
No apologies needed, it was exciting to play with the asm code, vectorize loops and make the code simpler! I learned so much through these iterations :)
LGTM. IMO we should bind OpenStreet benchmark to DEFAULT_MAX_POINTS_IN_LEAF_NODE after this get merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good in general, just left minor comments. Thank you!
this(VERSION_CURRENT); | ||
} | ||
|
||
public Lucene90PointsFormat(int version) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be pkg-private? I think we only need it for testing?
SegmentWriteState writeState, int maxPointsInLeafNode, double maxMBSortInHeap) | ||
throws IOException { | ||
this(writeState, maxPointsInLeafNode, maxMBSortInHeap, Lucene90PointsFormat.VERSION_CURRENT); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make all constructors that take a version pkg-private?
BKDWriter.VERSION_CURRENT); | ||
} | ||
|
||
public BKDWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add javadocs that this ctor should be only used for testing with older versions?
final int longIdx = i + numInts + start; | ||
scratch[i] |= docIds[longIdx] >>> 16; | ||
scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF; | ||
scratch[i + quarter * 2] |= docIds[longIdx] & 0xFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe write bytes in little-endian order for consistency, so scratch[i] |= docIds[longIdx] & 0xFF; scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF; scratch[i + quarter * 2] |= docIds[longIdx] >>> 16;
, etc. ?
// Now read the remaining 0, 1, 2 or 3 values | ||
for (int i = quarter << 2; i < count; ++i) { | ||
docIDs[i] = (in.readShort() & 0xFFFF) | (in.readByte() & 0xFF) << 16; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, does it hurt performance if we add this as part of decode24? This would help save the above assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to keep decode24 small so i put it under the if else
block to save the assertion. luceneutil and jmh proved it has similar performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Nightly benchmark confirmed the speed up https://benchmarks.mikemccandless.com/2025.03.16.18.04.58.html. Thanks again for profile guide and helping figure out simpler and faster codes! |
I raised an PR for annotation. mikemccand/luceneutil#354. |
Fantastic speedup. Nice to see tasks like |
Context: #14176
I find that when running with constant block size (512), JIT can auto-vectorize the decoding loop. But it does not work when block size become variable, which can be true in real BKD leaves. This PR proposes to use vector API to decode DocIds in BKD.
MAC M2
Linux X86 (AVX512 supported)