Skip to content

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 9, 2025

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Jan 28, 2025

UPDATE:

This PR changed to only reduce virtual call of bpv24 encoded doc ids. Vectorized decoding optimization will be in follow-up PRs.

Background

Proposal
This PR tries to introduce the bpv24 vectorized decoding again and use the new bulk visit method to reduce virtual call, in favor of #13149 and #14138.

Luceneutil now can load 3 implementors of IntersectVisitor: RangeQuery Visitor, RangeQuery InverseVisitor and DynamicPruning Visitor. Here is the result on wikimediumall and taskCountPerCat=5

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      259.87      (3.9%)      269.26      (4.2%)    3.6% (  -4% -   12%) 0.005
             CountFilteredIntNRQ       61.70      (7.1%)       85.00      (2.0%)   37.8% (  26% -   50%) 0.000
                      TermDTSort      149.65      (6.2%)      232.85      (9.6%)   55.6% (  37% -   76%) 0.000
                  FilteredIntNRQ       82.76     (10.0%)      135.48      (3.7%)   63.7% (  45% -   85%) 0.000
                          IntNRQ       84.62     (10.5%)      139.05      (2.6%)   64.3% (  46% -   86%) 0.000

Tasks

TermDayOfYearSort: dayofyeardvsort//0 # freq=708472
TermDayOfYearSort: dayofyeardvsort//names # freq=402762
TermDayOfYearSort: dayofyeardvsort//nbsp # freq=492778
TermDayOfYearSort: dayofyeardvsort//part # freq=588644
TermDayOfYearSort: dayofyeardvsort//st # freq=306811

TermDateTimeSort: lastmodndvsort//0 # freq=708472
TermDateTimeSort: lastmodndvsort//names # freq=402762
TermDateTimeSort: lastmodndvsort//nbsp # freq=492778
TermDateTimeSort: lastmodndvsort//part # freq=588644
TermDateTimeSort: lastmodndvsort//st # freq=306811

IntNRQ: nrq//timesecnum 10044 66714
IntNRQ: nrq//timesecnum 1069 86092
IntNRQ: nrq//timesecnum 150 34646
IntNRQ: nrq//timesecnum 3110 51452
IntNRQ: nrq//timesecnum 3773 78558

FilteredIntNRQ: nrq//timesecnum 10044 66714 +filter=5%
FilteredIntNRQ: nrq//timesecnum 1069 86092 +filter=5%
FilteredIntNRQ: nrq//timesecnum 150 34646 +filter=5%
FilteredIntNRQ: nrq//timesecnum 3110 51452 +filter=5%
FilteredIntNRQ: nrq//timesecnum 3773 78558 +filter=5%

CountFilteredIntNRQ: count(nrq//timesecnum 10044 66714 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 1069 86092 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 150 34646 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 3110 51452 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 3773 78558 +filter=5%)

@gf2121 gf2121 changed the title bpv24 Introduce the bpv24 vectorized decoding for DocIdsWriter Jan 28, 2025
@gf2121 gf2121 changed the title Introduce the bpv24 vectorized decoding for DocIdsWriter Introduce bpv24 vectorized decoding for DocIdsWriter Jan 28, 2025
@gf2121 gf2121 requested review from iverase and jpountz January 28, 2025 03:23
@iverase
Copy link
Contributor

iverase commented Feb 2, 2025

These numbers look great! I want to run this change on the geo benchmarks but I expect similar speed ups. I am planning to do it early next week.

One thing I am unhappy with is the introduction of another scratch array. I wonder if we can move the docIds array here into the docIdswriter and avoid the introduction of this variable?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speedup makes sense to me, the previous pattern could not auto-verctorize while the new one can. And 24 bits per value should apply to all segments with less than 16M docs, so it's quite widely applicable.

The thing that makes me a bit unhappy is that we're losing a lot of bw testing for indices that use the old 24-bit encoding. Is there a way we can preserve this testing somehow?

out.writeLong(l1);
out.writeLong(l2);
out.writeLong(l3);
final int quarterLen = count >>> 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places in the code base, we prefer to use a signed shift, which is less likely to hide a bug if count ended up being negative for some reason (e.g. overflow)

Suggested change
final int quarterLen = count >>> 2;
final int quarterLen = count >> 2;

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 5, 2025

Thanks for review! Here is the comparison of current commit(candidate) and the vectorized decoding commit(baseline).

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      188.16      (8.3%)      185.87      (7.3%)   -1.2% ( -15% -   15%) 0.629
             CountFilteredIntNRQ       85.44      (3.0%)       85.98      (3.0%)    0.6% (  -5% -    6%) 0.519
                          IntNRQ      139.73      (2.4%)      140.76      (3.1%)    0.7% (  -4% -    6%) 0.408
                  FilteredIntNRQ      135.19      (2.4%)      138.04      (4.5%)    2.1% (  -4% -    9%) 0.069
                      TermDTSort      185.33      (7.3%)      189.89      (7.6%)    2.5% ( -11% -   18%) 0.307

It turns out that the speed up mainly comes from the reduction of virtual call, not the vectorized decoding method. I'll propose to take this simpler patch.

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 5, 2025

Some new progress

Luceneutil now can load 3 implementors of IntersectVisitor: RangeQuery Visitor, RangeQuery InverseVisitor and DynamicPruning Visitor. Here is the result on wikimediumall and taskCountPerCat=5
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
TermDayOfYearSort 259.87 (3.9%) 269.26 (4.2%) 3.6% ( -4% - 12%) 0.005
CountFilteredIntNRQ 61.70 (7.1%) 85.00 (2.0%) 37.8% ( 26% - 50%) 0.000
TermDTSort 149.65 (6.2%) 232.85 (9.6%) 55.6% ( 37% - 76%) 0.000
FilteredIntNRQ 82.76 (10.0%) 135.48 (3.7%) 63.7% ( 45% - 85%) 0.000
IntNRQ 84.62 (10.5%) 139.05 (2.6%) 64.3% ( 46% - 86%) 0.000

The previous result was got by taskRepeatCount=20 . I find that the speedup disappeared when taskRepeatCount increased to 50:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      196.21      (8.7%)      194.85     (11.2%)   -0.7% ( -18% -   21%) 0.871
             CountFilteredIntNRQ       84.92     (13.1%)       84.84     (12.1%)   -0.1% ( -22% -   28%) 0.987
                          IntNRQ      137.14     (20.2%)      137.30     (18.4%)    0.1% ( -31% -   48%) 0.989
                  FilteredIntNRQ      134.41     (20.0%)      135.05     (18.1%)    0.5% ( -31% -   48%) 0.954
                      TermDTSort      196.18      (9.0%)      201.19      (9.0%)    2.6% ( -14% -   22%) 0.506

When i introduce a new task running PointInSetQuery , the speedup appears stably when taskRepeatCount=50 or taskRepeatCount=100:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      200.79      (8.3%)      200.43      (8.5%)   -0.2% ( -15% -   18%) 0.947
                          IntSet     1358.83      (3.5%)     1364.40      (3.6%)    0.4% (  -6% -    7%) 0.714
                      TermDTSort      200.55      (9.4%)      203.26      (8.2%)    1.3% ( -14% -   20%) 0.630
             CountFilteredIntNRQ       61.05      (8.2%)       87.15      (7.9%)   42.7% (  24% -   64%) 0.000
                  FilteredIntNRQ       82.13     (10.2%)      139.49      (9.7%)   69.9% (  45% -   99%) 0.000
                          IntNRQ       83.22     (10.2%)      141.61      (9.5%)   70.2% (  45% -  100%) 0.000

The PR to introduce PointInSetQuery Task: mikemccand/luceneutil#335. Maybe we should look into merge it before this PR.

Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tells me we cannot just check the performance of queries in isolation.

I tried this exact change on the geo benchmarks and I did not see any change but of course we are testing each query in its own JVM. I wonder if we should add a mixed workload there.

I like this simplicity, big +1, it is expected to have this mixed workload.

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 6, 2025

Thanks @iverase !

For the vectorized decodeing, I benchmarked the decoding method with jmh, the result on my M2 mac:

Benchmark                             Mode  Cnt    Score   Error   Units
BKDCodecBenchmark.readInts16ForUtil  thrpt    5   94.529 ± 2.886  ops/ms
BKDCodecBenchmark.readInts16Vector   thrpt    5  194.320 ± 7.082  ops/ms
BKDCodecBenchmark.readInts24ForUtil  thrpt    5   93.435 ± 5.063  ops/ms
BKDCodecBenchmark.readInts24Legacy   thrpt    5   81.779 ± 1.390  ops/ms
BKDCodecBenchmark.readInts24Vector   thrpt    5  151.203 ± 0.460  ops/ms

It suggests that readInts24ForUtil and readInts24Legacy do not have to much difference, which is consistent with previous luceneutil result:

The previous result was got by taskRepeatCount=20 . I find that the speedup disappeared when taskRepeatCount increased to 50:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
TermDayOfYearSort 196.21 (8.7%) 194.85 (11.2%) -0.7% ( -18% - 21%) 0.871
CountFilteredIntNRQ 84.92 (13.1%) 84.84 (12.1%) -0.1% ( -22% - 28%) 0.987
IntNRQ 137.14 (20.2%) 137.30 (18.4%) 0.1% ( -31% - 48%) 0.989
FilteredIntNRQ 134.41 (20.0%) 135.05 (18.1%) 0.5% ( -31% - 48%) 0.954
TermDTSort 196.18 (9.0%) 201.19 (9.0%) 2.6% ( -14% - 22%) 0.506

The vectorized decoding method using vector API seems perform much better, still need to run luceneutil to confirm end-to-end result. I'll keep this PR simple and leave vectorized decoding optimization to another PR. #14203

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, can you update the PR title and description? And add a CHANGES entry?

@gf2121 gf2121 changed the title Introduce bpv24 vectorized decoding for DocIdsWriter Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves Feb 6, 2025
@gf2121
Copy link
Contributor Author

gf2121 commented Feb 6, 2025

Thanks @jpountz !

Updated.

Could you please also help review mikemccand/luceneutil#335 ? I'd like to merge it first so that nightly benchmark can catch this change.

@gf2121 gf2121 merged commit fe50684 into apache:main Feb 9, 2025
6 checks passed
@iverase
Copy link
Contributor

iverase commented Feb 10, 2025

@jpountz
Copy link
Contributor

jpountz commented Feb 10, 2025

Amazing!

@mikemccand
Copy link
Member

Incredible speeds ups here https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here https://benchmarks.mikemccandless.com/IntNRQ.html

Yeah, wow!

These numbers look great! I want to run this change on the geo benchmarks but I expect similar speed ups. I am planning to do it early next week.

The nightly geo benchy didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here.

@iverase
Copy link
Contributor

iverase commented Feb 11, 2025

The nightly geo benchy didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here.

I think it is because we run each query in its own JVM so it does not suffer from megamorphic calls on the IntersectVisitor.

@jpountz
Copy link
Contributor

jpountz commented Feb 14, 2025

I pushed an annotation. mikemccand/luceneutil@e07e590

@gf2121 gf2121 added this to the 10.2.0 milestone Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants