Decode doc ids in BKD leaves with auto-vectorized loops #14203

gf2121 · 2025-02-06T07:55:37Z

Context: #14176

I find that when running with constant block size (512), JIT can auto-vectorize the decoding loop. But it does not work when block size become variable, which can be true in real BKD leaves. This PR proposes to use vector API to decode DocIds in BKD.

MAC M2

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score   Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   85.316 ± 2.181  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  208.971 ± 2.734  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   85.752 ± 2.129  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5  147.652 ± 1.786  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5  186.534 ± 2.376  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  213.891 ± 4.671  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5  140.298 ± 2.189  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5  134.398 ± 1.640  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   87.278 ± 1.432  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  201.612 ± 3.277  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   87.148 ± 1.704  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   84.830 ± 8.852  ops/ms

Linux X86 (AVX512 supported)

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   27.711 ?  2.777  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  132.859 ? 16.914  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   34.672 ?  5.730  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5   33.017 ?  5.080  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5   99.538 ? 11.813  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  107.525 ? 11.693  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5   69.268 ? 10.351  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5   64.134 ?  7.790  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   27.531 ?  3.810  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  125.707 ?  9.652  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   22.528 ?  4.724  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   23.903 ?  3.505  ops/ms

gf2121 · 2025-02-07T09:16:09Z

E2E result on Mac M2 is a bit disappointing:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      834.36      (3.4%)      835.17      (3.8%)    0.1% (  -6% -    7%) 0.933
             CountFilteredIntNRQ       90.86      (2.2%)       92.56      (3.2%)    1.9% (  -3% -    7%) 0.033
                      TermDTSort      201.03      (6.7%)      206.65      (4.9%)    2.8% (  -8% -   15%) 0.131
                          IntNRQ      147.82      (2.0%)      151.98      (2.9%)    2.8% (  -2% -    7%) 0.000
                  FilteredIntNRQ      145.10      (2.9%)      150.02      (3.1%)    3.4% (  -2% -    9%) 0.000
               TermDayOfYearSort      200.93      (5.7%)      208.48      (4.0%)    3.8% (  -5% -   14%) 0.016

Profile suggests bottleneck is FixedBitset#set rather than decoding

gf2121 · 2025-02-08T05:26:07Z

On a AVX-512 Linux X86 machine:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      247.99      (4.1%)      244.08      (2.5%)   -1.6% (  -7% -    5%) 0.143
                      TermDTSort       82.84      (6.5%)       83.46      (8.8%)    0.8% ( -13% -   17%) 0.759
               TermDayOfYearSort       83.58      (4.6%)       85.12      (6.3%)    1.8% (  -8% -   13%) 0.290
             CountFilteredIntNRQ       38.61      (2.9%)       42.32      (2.6%)    9.6% (   4% -   15%) 0.000
                  FilteredIntNRQ       64.02      (3.6%)       75.48      (3.7%)   17.9% (  10% -   26%) 0.000
                          IntNRQ       66.28      (4.5%)       79.10      (2.9%)   19.3% (  11% -   28%) 0.000

gf2121 · 2025-02-18T08:49:09Z

Confused +1 ... but the comparison of step512(baseline) and step32(candidate):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ       77.37      (4.1%)       54.55      (3.4%)  -29.5% ( -35% -  -22%) 0.000
                  FilteredIntNRQ       73.44      (4.0%)       52.70      (2.2%)  -28.2% ( -33% -  -22%) 0.000
             CountFilteredIntNRQ       40.78      (3.4%)       33.68      (2.4%)  -17.4% ( -22% -  -12%) 0.000
                          IntSet      242.01      (4.1%)      224.59      (3.9%)   -7.2% ( -14% -    0%) 0.000

github-actions · 2025-03-05T00:23:40Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

gf2121 · 2025-03-14T08:19:54Z

@jpountz Hi, do you have any idea how should we move forward on this optimization? several thoughts:

We can add another step32 for the hybrid-step decoding, which makes the code even more complex but resolves the concern that we might decrease the BKD leaf size in the future.
If the code of hybrid-step inner loop is too complex and single-step has performance issue, should we re-consider the original Vector API way?

BTW, i got previous AVX512 results on a Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz chip. I see similar regression running -XX:UseAVX=2 or -XX:UseAVX=3. I also trie some other machine with Intel chips and see the same result so it does not seem like a corner case.

jpountz · 2025-03-14T08:54:49Z

I have some small concerns:

The fact that the 512 step is tied to the number of points per leaf, though it's not a big deal at all, postings are similar: their encoding logic is specialized for blocks of 128. I guess I'd just rather err on a smaller block size than 512, which feels larg-ish.
Complexity: the encoding has 3 different sub encodings: 512, 128 and remainder. Could we have only two?

But my main concern is more that I would like to better understand why 512 performs so much better. There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else. I have some discomfort merging something that is faster without having at least an intuition of why it's faster, so that I can also understand which JVMs and CPUs would enable this speedup. Could pipelining be the reason as 24 (bits per value) * 32 (step) < 2 * 512 (bit width of SIMD instructions)? But then something like 128 should perform well while your benchmark suggests it's still much worse than 512?

gf2121 · 2025-03-14T17:55:51Z

There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else.

Thanks for pointing out this. I studied the asm profile again and i can see at least loop unrolling differs there. According to the asm printed by jmh, i can see for bpv24 decoding:

VectorAPI unrolled shift loop x8 (add 0x40 once) and remainder loop x4 (add 0x20 once)
InnerLoop 512 step unrolled shift loop x4 (add 0x20 once) and remainder loop x2 (add 0x10 once)
InnerLoop 128 step does not get loop unrolling for either shift loop (add 0x8 once) or remainder loop (add 0x8 once).

This is corresponding to the result of jmh: vector API > InnerLoop step-512 > InnerLoop step-128.

Things might change in luceneutil because we find InnerLoop step-512 faster than Vector API there. I confirmed the result of luceneutil of step-512(baseline) vs step-128(candidate):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  FilteredIntNRQ       80.02      (4.0%)       71.31      (3.0%)  -10.9% ( -17% -   -4%) 0.000
                          IntNRQ       80.94      (2.5%)       72.60      (3.6%)  -10.3% ( -16% -   -4%) 0.000
             CountFilteredIntNRQ       42.93      (2.9%)       40.22      (2.3%)   -6.3% ( -11% -   -1%) 0.001
                          IntSet       93.36      (2.1%)       93.85      (0.7%)    0.5% (  -2% -    3%) 0.633

jpountz · 2025-03-14T23:46:19Z

Thanks for running benchmarks. So it looks like the JVM doesn't think these shorter loops (with step 128) are worth unrolling? This makes me wonder how something like that performs on your AVX-512 CPU. I think you had something similar in one of your previous iterations. On my machine it's on par with the current version.

  private void readInts24(IndexInput in, int count, int[] docIDs) throws IOException {
    if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) {
      // Same format, but enabling the JVM to specialize the decoding logic for the default number
      // of points per node proved to help on benchmarks
      doReadInts24(in, 512, docIDs);
    } else {
      doReadInts24(in, count, docIDs);
    }
  }

  private void doReadInts24(IndexInput in, int count, int[] docIDs) throws IOException {
    // Read the first (count - count % 4) values
    int quarter = count >> 2;
    int numBytes = quarter * 3;
    in.readInts(scratch, 0, numBytes);
    for (int i = 0; i < numBytes; ++i) {
      docIDs[i] = scratch[i] >>> 8;
      scratch[i] &= 0xFF;
    }
    for (int i = 0; i < quarter; ++i) {
      docIDs[numBytes + i] = scratch[i]
         | (scratch[quarter + i] << 8)
         | (scratch[2 * quarter + i] << 16);
    }
    // Now read the remaining 0, 1, 2 or 3 values
    for (int i = quarter << 2; i < count; ++i) {
      docIDs[i] = (in.readShort() & 0xFFFF) | (in.readByte() & 0xFF) << 16;
    }
  }

gf2121 · 2025-03-15T15:20:29Z

On the AVX-512 machine:

Specialized read does not vectorize the remainder loop, it seems the complier failed to inline it.
Only specializing decode logic helps vectorize the remainder loop.
Push masks to remainder loop seems to get a better performance.

I pushed the benchmark code to the patch, here is result on my machine:

Benchmark                                                         Mode  Cnt   Score   Error   Units
InnerLoopDecodingBenchmark.hybridInnerLoop                       thrpt    5  76.311 ? 0.177  ops/ms
InnerLoopDecodingBenchmark.hybridInnerLoop:asm                   thrpt          NaN             ---
InnerLoopDecodingBenchmark.specializedDecode                     thrpt    5  73.600 ? 0.123  ops/ms
InnerLoopDecodingBenchmark.specializedDecode:asm                 thrpt          NaN             ---
InnerLoopDecodingBenchmark.specializedDecodeMaskInRemainder      thrpt    5  80.902 ? 0.046  ops/ms
InnerLoopDecodingBenchmark.specializedDecodeMaskInRemainder:asm  thrpt          NaN             ---
InnerLoopDecodingBenchmark.specializedRead                       thrpt    5  37.195 ? 0.099  ops/ms
InnerLoopDecodingBenchmark.specializedRead:asm                   thrpt          NaN             ---

LuceneUtil

hybridInnerLoop (baseline) vs specializedRead (candidate)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ       83.50      (3.7%)       78.07      (4.1%)   -6.5% ( -13% -    1%) 0.000
                  FilteredIntNRQ       81.66      (3.6%)       76.50      (3.8%)   -6.3% ( -13% -    1%) 0.000
             CountFilteredIntNRQ       44.30      (1.9%)       43.07      (2.8%)   -2.8% (  -7% -    1%) 0.000
                          IntSet       94.41      (1.7%)       94.38      (1.0%)   -0.0% (  -2% -    2%) 0.940

hybridInnerLoop (baseline) vs specializedDecodeMaskInRemainder (candidate)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  FilteredIntNRQ       81.19      (2.9%)       80.63     (10.2%)   -0.7% ( -13% -   12%) 0.773
                          IntSet       95.53      (1.3%)       94.99      (1.3%)   -0.6% (  -3% -    2%) 0.174
             CountFilteredIntNRQ       44.44      (2.6%)       44.27      (6.6%)   -0.4% (  -9% -    9%) 0.808
                          IntNRQ       82.72      (3.7%)       82.41     (10.2%)   -0.4% ( -13% -   14%) 0.878

I can refactor the code to the specialized decoding if it makes sense to you. BTW, should we disable the flexibility of changing maxPointsInLeafNode if we make optimization only work on default config?

jpountz · 2025-03-15T20:56:12Z

Again, thanks a lot for running benchmarks.

I can refactor the code to the specialized decoding if it makes sense to you

That would be great, thank you. Sorry for making it hard for you to move this PR forward, I was a bit annoyed that we needed something complicated to speed things up, I like the simplicity of specializedDecodeMaskInRemainder.

BTW, should we disable the flexibility of changing maxPointsInLeafNode if we make optimization only work on default config?

In my opinion, it's good enough the way things are today as the default codec doesn't allow configuring the number of points per leaf node.

gf2121 · 2025-03-16T06:02:52Z

Sorry for making it hard for you to move this PR forward, I was a bit annoyed that we needed something complicated to speed things up, I like the simplicity of specializedDecodeMaskInRemainder.

No apologies needed, it was exciting to play with the asm code, vectorize loops and make the code simpler! I learned so much through these iterations :)

In my opinion, it's good enough the way things are today as the default codec doesn't allow configuring the number of points per leaf node.

LGTM. IMO we should bind OpenStreet benchmark to DEFAULT_MAX_POINTS_IN_LEAF_NODE after this get merged.
https://github.com/mikemccand/luceneutil/blob/9724b69c9ef53715dae27baa3493a0ee7949748e/src/main/perf/IndexAndSearchOpenStreetMaps.java#L542

jpountz

It looks good in general, just left minor comments. Thank you!

jpountz · 2025-03-16T08:56:32Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsFormat.java

+    this(VERSION_CURRENT);
+  }
+
+  public Lucene90PointsFormat(int version) {


Could it be pkg-private? I think we only need it for testing?

jpountz · 2025-03-16T09:01:31Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PointsWriter.java

+      SegmentWriteState writeState, int maxPointsInLeafNode, double maxMBSortInHeap)
+      throws IOException {
+    this(writeState, maxPointsInLeafNode, maxMBSortInHeap, Lucene90PointsFormat.VERSION_CURRENT);
+  }


let's make all constructors that take a version pkg-private?

jpountz · 2025-03-16T09:02:37Z

lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java

+        BKDWriter.VERSION_CURRENT);
+  }
+
+  public BKDWriter(


Can you add javadocs that this ctor should be only used for testing with older versions?

jpountz · 2025-03-16T09:05:44Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

+            final int longIdx = i + numInts + start;
+            scratch[i] |= docIds[longIdx] >>> 16;
+            scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF;
+            scratch[i + quarter * 2] |= docIds[longIdx] & 0xFF;


nit: maybe write bytes in little-endian order for consistency, so scratch[i] |= docIds[longIdx] & 0xFF; scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF; scratch[i + quarter * 2] |= docIds[longIdx] >>> 16;, etc. ?

jpountz · 2025-03-16T09:08:52Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

+      // Now read the remaining 0, 1, 2 or 3 values
+      for (int i = quarter << 2; i < count; ++i) {
+        docIDs[i] = (in.readShort() & 0xFFFF) | (in.readByte() & 0xFF) << 16;
+      }


out of curiosity, does it hurt performance if we add this as part of decode24? This would help save the above assertion?

I want to keep decode24 small so i put it under the if else block to save the assertion. luceneutil and jmh proved it has similar performance.

jpountz

Looks great!

gf2121 · 2025-03-17T14:50:00Z

Nightly benchmark confirmed the speed up https://benchmarks.mikemccandless.com/2025.03.16.18.04.58.html.

Thanks again for profile guide and helping figure out simpler and faster codes!

gf2121 · 2025-03-17T15:12:09Z

I raised an PR for annotation. mikemccand/luceneutil#354.

jpountz · 2025-03-17T15:34:30Z

Fantastic speedup. Nice to see tasks like TermDayOfYearSort also take advantage from this change.

gf2121 added 4 commits January 28, 2025 10:30

bpv24

97af1d2

only reduce virtual call

617cec7

iter

c72f9f4

jmh

4446855

github-actions bot added the module:core/other label Feb 6, 2025

gf2121 mentioned this pull request Feb 6, 2025

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

Merged

gf2121 added module:core/codecs and removed module:core/other labels Feb 6, 2025

iter

b031449

github-actions bot added module:core/other and removed module:core/codecs labels Feb 6, 2025

gf2121 added 2 commits February 6, 2025 18:39

stash

e7a3056

e2e benchmark

10511a2

gf2121 force-pushed the vector_bpv24 branch from b0a2983 to 10511a2 Compare February 7, 2025 08:54

bwc issue

bb1b923

github-actions bot added module:core/codecs and removed module:core/other labels Feb 8, 2025

gf2121 added 8 commits February 8, 2025 15:45

jmh fix

28cb597

fix

62c214f

iter

d977c02

license

c5653f6

iter

9bf0870

add license

3c91bf3

add java doc

a474043

private

5c4e1e9

gf2121 changed the title ~~[WIP] Introduce bpv24 vectorized decoding for DocIdsWriter~~ Introduce bpv24 vectorized decoding for DocIdsWriter Feb 8, 2025

gf2121 requested review from jpountz and iverase February 8, 2025 09:19

github-actions bot added the Stale label Mar 5, 2025

github-actions bot removed the Stale label Mar 15, 2025

show benchmark code

38c2ef0

gf2121 added 3 commits March 16, 2025 13:37

specialized decoding for DEFAULT_MAX_POINTS_IN_LEAF_NODE

ec87c3c

unnecessary diff

3766875

drop BKDCodecBenchmark as well

908de60

gf2121 added 2 commits March 16, 2025 14:50

add more chance to test BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE

eb14ddd

no more public

346210c

jpountz reviewed Mar 16, 2025

View reviewed changes

review iter

be09930

gf2121 changed the title ~~Use Vector API to decode BKD docIds~~ Decode doc ids in BKD leaves with auto-vectorized loops Mar 16, 2025

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Mar 16, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Mar 16, 2025

jpountz approved these changes Mar 16, 2025

View reviewed changes

gf2121 merged commit 9472dca into apache:main Mar 16, 2025
7 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Mar 16, 2025

asfgit pushed a commit that referenced this pull request Mar 16, 2025

Decode doc ids in BKD leaves with auto-vectorized loops (#14203)

53cef55

gf2121 mentioned this pull request Mar 17, 2025

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #14361

Merged

Decode doc ids in BKD leaves with auto-vectorized loops #14203

Decode doc ids in BKD leaves with auto-vectorized loops #14203

Uh oh!

Conversation

gf2121 commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gf2121 commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gf2121 commented Feb 8, 2025

Uh oh!

gf2121 commented Feb 18, 2025

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

gf2121 commented Mar 14, 2025

Uh oh!

jpountz commented Mar 14, 2025

Uh oh!

gf2121 commented Mar 14, 2025

Uh oh!

jpountz commented Mar 14, 2025

Uh oh!

gf2121 commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Mar 15, 2025

Uh oh!

gf2121 commented Mar 16, 2025

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

gf2121 Mar 16, 2025

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

gf2121 commented Mar 17, 2025

Uh oh!

jpountz commented Mar 17, 2025

Uh oh!

Uh oh!

gf2121 commented Feb 6, 2025 •

edited

Loading

gf2121 commented Feb 7, 2025 •

edited

Loading

gf2121 commented Mar 15, 2025 •

edited

Loading