Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

gf2121 · 2025-01-28T03:10:13Z

UPDATE:

This PR changed to only reduce virtual call of bpv24 encoded doc ids. Vectorized decoding optimization will be in follow-up PRs.

Background

LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil #541 tried to introduce the vectorized decoding for BKD leaf blocks
LUCENE-10417: Revert "LUCENE-10315" #706 reverted the PR since we see a regression in nightly benchmark.
LUCENE-10315: Speed up DocIdsWriter by ForUtil #797 introduced the bpv16 and bpv32 bulk decoding again. We find the regression mainly comes from bpv24 decoding because one pass loop is faster than simd decoding + loop (more detail can be found in LUCENE-10315)

Proposal
This PR tries to introduce the bpv24 vectorized decoding again and use the new bulk visit method to reduce virtual call, in favor of #13149 and #14138.

Luceneutil now can load 3 implementors of IntersectVisitor: RangeQuery Visitor, RangeQuery InverseVisitor and DynamicPruning Visitor. Here is the result on wikimediumall and taskCountPerCat=5

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      259.87      (3.9%)      269.26      (4.2%)    3.6% (  -4% -   12%) 0.005
             CountFilteredIntNRQ       61.70      (7.1%)       85.00      (2.0%)   37.8% (  26% -   50%) 0.000
                      TermDTSort      149.65      (6.2%)      232.85      (9.6%)   55.6% (  37% -   76%) 0.000
                  FilteredIntNRQ       82.76     (10.0%)      135.48      (3.7%)   63.7% (  45% -   85%) 0.000
                          IntNRQ       84.62     (10.5%)      139.05      (2.6%)   64.3% (  46% -   86%) 0.000

Tasks

TermDayOfYearSort: dayofyeardvsort//0 # freq=708472
TermDayOfYearSort: dayofyeardvsort//names # freq=402762
TermDayOfYearSort: dayofyeardvsort//nbsp # freq=492778
TermDayOfYearSort: dayofyeardvsort//part # freq=588644
TermDayOfYearSort: dayofyeardvsort//st # freq=306811

TermDateTimeSort: lastmodndvsort//0 # freq=708472
TermDateTimeSort: lastmodndvsort//names # freq=402762
TermDateTimeSort: lastmodndvsort//nbsp # freq=492778
TermDateTimeSort: lastmodndvsort//part # freq=588644
TermDateTimeSort: lastmodndvsort//st # freq=306811

IntNRQ: nrq//timesecnum 10044 66714
IntNRQ: nrq//timesecnum 1069 86092
IntNRQ: nrq//timesecnum 150 34646
IntNRQ: nrq//timesecnum 3110 51452
IntNRQ: nrq//timesecnum 3773 78558

FilteredIntNRQ: nrq//timesecnum 10044 66714 +filter=5%
FilteredIntNRQ: nrq//timesecnum 1069 86092 +filter=5%
FilteredIntNRQ: nrq//timesecnum 150 34646 +filter=5%
FilteredIntNRQ: nrq//timesecnum 3110 51452 +filter=5%
FilteredIntNRQ: nrq//timesecnum 3773 78558 +filter=5%

CountFilteredIntNRQ: count(nrq//timesecnum 10044 66714 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 1069 86092 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 150 34646 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 3110 51452 +filter=5%)
CountFilteredIntNRQ: count(nrq//timesecnum 3773 78558 +filter=5%)

iverase · 2025-02-02T12:54:44Z

These numbers look great! I want to run this change on the geo benchmarks but I expect similar speed ups. I am planning to do it early next week.

One thing I am unhappy with is the introduction of another scratch array. I wonder if we can move the docIds array here into the docIdswriter and avoid the introduction of this variable?

jpountz

The speedup makes sense to me, the previous pattern could not auto-verctorize while the new one can. And 24 bits per value should apply to all segments with less than 16M docs, so it's quite widely applicable.

The thing that makes me a bit unhappy is that we're losing a lot of bw testing for indices that use the old 24-bit encoding. Is there a way we can preserve this testing somehow?

jpountz · 2025-02-04T10:09:42Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

-          out.writeLong(l1);
-          out.writeLong(l2);
-          out.writeLong(l3);
+        final int quarterLen = count >>> 2;


In other places in the code base, we prefer to use a signed shift, which is less likely to hide a bug if count ended up being negative for some reason (e.g. overflow)

Suggested change

final int quarterLen = count >>> 2;

final int quarterLen = count >> 2;

gf2121 · 2025-02-05T07:59:19Z

Thanks for review! Here is the comparison of current commit(candidate) and the vectorized decoding commit(baseline).

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      188.16      (8.3%)      185.87      (7.3%)   -1.2% ( -15% -   15%) 0.629
             CountFilteredIntNRQ       85.44      (3.0%)       85.98      (3.0%)    0.6% (  -5% -    6%) 0.519
                          IntNRQ      139.73      (2.4%)      140.76      (3.1%)    0.7% (  -4% -    6%) 0.408
                  FilteredIntNRQ      135.19      (2.4%)      138.04      (4.5%)    2.1% (  -4% -    9%) 0.069
                      TermDTSort      185.33      (7.3%)      189.89      (7.6%)    2.5% ( -11% -   18%) 0.307

It turns out that the speed up mainly comes from the reduction of virtual call, not the vectorized decoding method. I'll propose to take this simpler patch.

gf2121 · 2025-02-05T09:33:26Z

Some new progress

Luceneutil now can load 3 implementors of IntersectVisitor: RangeQuery Visitor, RangeQuery InverseVisitor and DynamicPruning Visitor. Here is the result on wikimediumall and taskCountPerCat=5
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
TermDayOfYearSort 259.87 (3.9%) 269.26 (4.2%) 3.6% ( -4% - 12%) 0.005
CountFilteredIntNRQ 61.70 (7.1%) 85.00 (2.0%) 37.8% ( 26% - 50%) 0.000
TermDTSort 149.65 (6.2%) 232.85 (9.6%) 55.6% ( 37% - 76%) 0.000
FilteredIntNRQ 82.76 (10.0%) 135.48 (3.7%) 63.7% ( 45% - 85%) 0.000
IntNRQ 84.62 (10.5%) 139.05 (2.6%) 64.3% ( 46% - 86%) 0.000

The previous result was got by taskRepeatCount=20 . I find that the speedup disappeared when taskRepeatCount increased to 50:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      196.21      (8.7%)      194.85     (11.2%)   -0.7% ( -18% -   21%) 0.871
             CountFilteredIntNRQ       84.92     (13.1%)       84.84     (12.1%)   -0.1% ( -22% -   28%) 0.987
                          IntNRQ      137.14     (20.2%)      137.30     (18.4%)    0.1% ( -31% -   48%) 0.989
                  FilteredIntNRQ      134.41     (20.0%)      135.05     (18.1%)    0.5% ( -31% -   48%) 0.954
                      TermDTSort      196.18      (9.0%)      201.19      (9.0%)    2.6% ( -14% -   22%) 0.506

When i introduce a new task running PointInSetQuery , the speedup appears stably when taskRepeatCount=50 or taskRepeatCount=100:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
               TermDayOfYearSort      200.79      (8.3%)      200.43      (8.5%)   -0.2% ( -15% -   18%) 0.947
                          IntSet     1358.83      (3.5%)     1364.40      (3.6%)    0.4% (  -6% -    7%) 0.714
                      TermDTSort      200.55      (9.4%)      203.26      (8.2%)    1.3% ( -14% -   20%) 0.630
             CountFilteredIntNRQ       61.05      (8.2%)       87.15      (7.9%)   42.7% (  24% -   64%) 0.000
                  FilteredIntNRQ       82.13     (10.2%)      139.49      (9.7%)   69.9% (  45% -   99%) 0.000
                          IntNRQ       83.22     (10.2%)      141.61      (9.5%)   70.2% (  45% -  100%) 0.000

The PR to introduce PointInSetQuery Task: mikemccand/luceneutil#335. Maybe we should look into merge it before this PR.

iverase

This tells me we cannot just check the performance of queries in isolation.

I tried this exact change on the geo benchmarks and I did not see any change but of course we are testing each query in its own JVM. I wonder if we should add a mixed workload there.

I like this simplicity, big +1, it is expected to have this mixed workload.

gf2121 · 2025-02-06T07:59:08Z

Thanks @iverase !

For the vectorized decodeing, I benchmarked the decoding method with jmh, the result on my M2 mac:

Benchmark                             Mode  Cnt    Score   Error   Units
BKDCodecBenchmark.readInts16ForUtil  thrpt    5   94.529 ± 2.886  ops/ms
BKDCodecBenchmark.readInts16Vector   thrpt    5  194.320 ± 7.082  ops/ms
BKDCodecBenchmark.readInts24ForUtil  thrpt    5   93.435 ± 5.063  ops/ms
BKDCodecBenchmark.readInts24Legacy   thrpt    5   81.779 ± 1.390  ops/ms
BKDCodecBenchmark.readInts24Vector   thrpt    5  151.203 ± 0.460  ops/ms

It suggests that readInts24ForUtil and readInts24Legacy do not have to much difference, which is consistent with previous luceneutil result:

The previous result was got by taskRepeatCount=20 . I find that the speedup disappeared when taskRepeatCount increased to 50:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
TermDayOfYearSort 196.21 (8.7%) 194.85 (11.2%) -0.7% ( -18% - 21%) 0.871
CountFilteredIntNRQ 84.92 (13.1%) 84.84 (12.1%) -0.1% ( -22% - 28%) 0.987
IntNRQ 137.14 (20.2%) 137.30 (18.4%) 0.1% ( -31% - 48%) 0.989
FilteredIntNRQ 134.41 (20.0%) 135.05 (18.1%) 0.5% ( -31% - 48%) 0.954
TermDTSort 196.18 (9.0%) 201.19 (9.0%) 2.6% ( -14% - 22%) 0.506

The vectorized decoding method using vector API seems perform much better, still need to run luceneutil to confirm end-to-end result. I'll keep this PR simple and leave vectorized decoding optimization to another PR. #14203

jpountz

LGTM, can you update the PR title and description? And add a CHANGES entry?

gf2121 · 2025-02-06T10:45:42Z

Thanks @jpountz !

Updated.

Could you please also help review mikemccand/luceneutil#335 ? I'd like to merge it first so that nightly benchmark can catch this change.

#14176)

iverase · 2025-02-10T15:37:30Z

Incredible speeds ups here https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here https://benchmarks.mikemccandless.com/IntNRQ.html

jpountz · 2025-02-10T20:23:04Z

Amazing!

mikemccand · 2025-02-11T11:14:47Z

Incredible speeds ups here https://benchmarks.mikemccandless.com/FilteredIntNRQ.html and here https://benchmarks.mikemccandless.com/IntNRQ.html

Yeah, wow!

These numbers look great! I want to run this change on the geo benchmarks but I expect similar speed ups. I am planning to do it early next week.

The nightly geo benchy didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here.

iverase · 2025-02-11T11:21:14Z

The nightly geo benchy didn't seem impacted either way; maybe the tasks it runs are not exercising the optimized path here.

I think it is because we run each query in its own JVM so it does not suffer from megamorphic calls on the IntersectVisitor.

jpountz · 2025-02-14T14:21:54Z

I pushed an annotation. mikemccand/luceneutil@e07e590

bpv24

97af1d2

gf2121 changed the title ~~bpv24~~ Introduce the bpv24 vectorized decoding for DocIdsWriter Jan 28, 2025

gf2121 changed the title ~~Introduce the bpv24 vectorized decoding for DocIdsWriter~~ Introduce bpv24 vectorized decoding for DocIdsWriter Jan 28, 2025

gf2121 requested review from iverase and jpountz January 28, 2025 03:23

jpountz reviewed Feb 4, 2025

View reviewed changes

only reduce virtual call

617cec7

github-actions bot added the module:core/other label Feb 5, 2025

iter

c72f9f4

gf2121 mentioned this pull request Feb 5, 2025

Add PointInSetQuery Task mikemccand/luceneutil#335

Merged

iverase approved these changes Feb 5, 2025

View reviewed changes

gf2121 mentioned this pull request Feb 6, 2025

Decode doc ids in BKD leaves with auto-vectorized loops #14203

Merged

jpountz approved these changes Feb 6, 2025

View reviewed changes

gf2121 changed the title ~~Introduce bpv24 vectorized decoding for DocIdsWriter~~ Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves Feb 6, 2025

add CHANGES

2de3ecd

gf2121 and others added 2 commits February 6, 2025 18:56

Merge branch 'main' into bpv24

4012732

move CHANGES entry to Optimizations

41bfd86

gf2121 merged commit fe50684 into apache:main Feb 9, 2025
6 checks passed

asfgit pushed a commit that referenced this pull request Feb 9, 2025

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves (

f1fb918

#14176)

gf2121 added this to the 10.2.0 milestone Mar 19, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 19, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

gf2121 commented Jan 28, 2025 •

edited

Loading

iverase commented Feb 2, 2025 •

edited

Loading

jpountz left a comment

jpountz Feb 4, 2025

gf2121 commented Feb 5, 2025 •

edited

Loading

gf2121 commented Feb 5, 2025 •

edited

Loading

iverase left a comment •

edited

Loading

gf2121 commented Feb 6, 2025 •

edited

Loading

jpountz left a comment •

edited

Loading

gf2121 commented Feb 6, 2025

iverase commented Feb 10, 2025

jpountz commented Feb 10, 2025

mikemccand commented Feb 11, 2025

iverase commented Feb 11, 2025 •

edited

Loading

jpountz commented Feb 14, 2025

	final int quarterLen = count >>> 2;
	final int quarterLen = count >> 2;

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

Conversation

gf2121 commented Jan 28, 2025 • edited Loading

iverase commented Feb 2, 2025 • edited Loading

jpountz left a comment

Choose a reason for hiding this comment

jpountz Feb 4, 2025

Choose a reason for hiding this comment

gf2121 commented Feb 5, 2025 • edited Loading

gf2121 commented Feb 5, 2025 • edited Loading

iverase left a comment • edited Loading

Choose a reason for hiding this comment

gf2121 commented Feb 6, 2025 • edited Loading

jpountz left a comment • edited Loading

Choose a reason for hiding this comment

gf2121 commented Feb 6, 2025

iverase commented Feb 10, 2025

jpountz commented Feb 10, 2025

mikemccand commented Feb 11, 2025

iverase commented Feb 11, 2025 • edited Loading

jpountz commented Feb 14, 2025

gf2121 commented Jan 28, 2025 •

edited

Loading

iverase commented Feb 2, 2025 •

edited

Loading

gf2121 commented Feb 5, 2025 •

edited

Loading

gf2121 commented Feb 5, 2025 •

edited

Loading

iverase left a comment •

edited

Loading

gf2121 commented Feb 6, 2025 •

edited

Loading

jpountz left a comment •

edited

Loading

iverase commented Feb 11, 2025 •

edited

Loading