-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These numbers look great! I want to run this change on the geo benchmarks but I expect similar speed ups. I am planning to do it early next week. One thing I am unhappy with is the introduction of another scratch array. I wonder if we can move the docIds array here into the docIdswriter and avoid the introduction of this variable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup makes sense to me, the previous pattern could not auto-verctorize while the new one can. And 24 bits per value should apply to all segments with less than 16M docs, so it's quite widely applicable.
The thing that makes me a bit unhappy is that we're losing a lot of bw testing for indices that use the old 24-bit encoding. Is there a way we can preserve this testing somehow?
out.writeLong(l1); | ||
out.writeLong(l2); | ||
out.writeLong(l3); | ||
final int quarterLen = count >>> 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places in the code base, we prefer to use a signed shift, which is less likely to hide a bug if count
ended up being negative for some reason (e.g. overflow)
final int quarterLen = count >>> 2; | |
final int quarterLen = count >> 2; |
Thanks for review! Here is the comparison of current commit(candidate) and the vectorized decoding commit(baseline).
It turns out that the speed up mainly comes from the reduction of virtual call, not the vectorized decoding method. I'll propose to take this simpler patch. |
Some new progress
The previous result was got by
When i introduce a new task running
The PR to introduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tells me we cannot just check the performance of queries in isolation.
I tried this exact change on the geo benchmarks and I did not see any change but of course we are testing each query in its own JVM. I wonder if we should add a mixed workload there.
I like this simplicity, big +1, it is expected to have this mixed workload.
Thanks @iverase ! For the vectorized decodeing, I benchmarked the decoding method with jmh, the result on my M2 mac:
It suggests that
The vectorized decoding method using vector API seems perform much better, still need to run luceneutil to confirm end-to-end result. I'll keep this PR simple and leave vectorized decoding optimization to another PR. #14203 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, can you update the PR title and description? And add a CHANGES entry?
Thanks @jpountz ! Updated. Could you please also help review mikemccand/luceneutil#335 ? I'd like to merge it first so that nightly benchmark can catch this change. |
Incredible speeds ups here https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here https://benchmarks.mikemccandless.com/IntNRQ.html |
Amazing! |
Yeah, wow!
The nightly geo benchy didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here. |
I think it is because we run each query in its own JVM so it does not suffer from megamorphic calls on the IntersectVisitor. |
I pushed an annotation. mikemccand/luceneutil@e07e590 |
UPDATE:
This PR changed to only reduce virtual call of bpv24 encoded doc ids. Vectorized decoding optimization will be in follow-up PRs.
Background
Proposal
This PR tries to introduce the bpv24 vectorized decoding again and use the new bulk visit method to reduce virtual call, in favor of #13149 and #14138.
Luceneutil now can load 3 implementors of IntersectVisitor: RangeQuery Visitor, RangeQuery InverseVisitor and DynamicPruning Visitor. Here is the result on
wikimediumall
andtaskCountPerCat=5
Tasks