Skip to content

Decode doc ids in BKD leaves with auto-vectorized loops #14203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Mar 16, 2025

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Feb 6, 2025

Context: #14176

I find that when running with constant block size (512), JIT can auto-vectorize the decoding loop. But it does not work when block size become variable, which can be true in real BKD leaves. This PR proposes to use vector API to decode DocIds in BKD.

MAC M2

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score   Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   85.316 ± 2.181  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  208.971 ± 2.734  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   85.752 ± 2.129  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5  147.652 ± 1.786  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5  186.534 ± 2.376  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  213.891 ± 4.671  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5  140.298 ± 2.189  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5  134.398 ± 1.640  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   87.278 ± 1.432  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  201.612 ± 3.277  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   87.148 ± 1.704  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   84.830 ± 8.852  ops/ms

Linux X86 (AVX512 supported)

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   27.711 ?  2.777  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  132.859 ? 16.914  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   34.672 ?  5.730  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5   33.017 ?  5.080  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5   99.538 ? 11.813  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  107.525 ? 11.693  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5   69.268 ? 10.351  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5   64.134 ?  7.790  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   27.531 ?  3.810  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  125.707 ?  9.652  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   22.528 ?  4.724  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   23.903 ?  3.505  ops/ms

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 7, 2025

E2E result on Mac M2 is a bit disappointing:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      834.36      (3.4%)      835.17      (3.8%)    0.1% (  -6% -    7%) 0.933
             CountFilteredIntNRQ       90.86      (2.2%)       92.56      (3.2%)    1.9% (  -3% -    7%) 0.033
                      TermDTSort      201.03      (6.7%)      206.65      (4.9%)    2.8% (  -8% -   15%) 0.131
                          IntNRQ      147.82      (2.0%)      151.98      (2.9%)    2.8% (  -2% -    7%) 0.000
                  FilteredIntNRQ      145.10      (2.9%)      150.02      (3.1%)    3.4% (  -2% -    9%) 0.000
               TermDayOfYearSort      200.93      (5.7%)      208.48      (4.0%)    3.8% (  -5% -   14%) 0.016

Profile suggests bottleneck is FixedBitset#set rather than decoding
image

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 8, 2025

On a AVX-512 Linux X86 machine:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      247.99      (4.1%)      244.08      (2.5%)   -1.6% (  -7% -    5%) 0.143
                      TermDTSort       82.84      (6.5%)       83.46      (8.8%)    0.8% ( -13% -   17%) 0.759
               TermDayOfYearSort       83.58      (4.6%)       85.12      (6.3%)    1.8% (  -8% -   13%) 0.290
             CountFilteredIntNRQ       38.61      (2.9%)       42.32      (2.6%)    9.6% (   4% -   15%) 0.000
                  FilteredIntNRQ       64.02      (3.6%)       75.48      (3.7%)   17.9% (  10% -   26%) 0.000
                          IntNRQ       66.28      (4.5%)       79.10      (2.9%)   19.3% (  11% -   28%) 0.000

@gf2121 gf2121 changed the title [WIP] Introduce bpv24 vectorized decoding for DocIdsWriter Introduce bpv24 vectorized decoding for DocIdsWriter Feb 8, 2025
@gf2121 gf2121 requested review from jpountz and iverase February 8, 2025 09:19
@gf2121
Copy link
Contributor Author

gf2121 commented Feb 18, 2025

Confused +1 ... but the comparison of step512(baseline) and step32(candidate):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ       77.37      (4.1%)       54.55      (3.4%)  -29.5% ( -35% -  -22%) 0.000
                  FilteredIntNRQ       73.44      (4.0%)       52.70      (2.2%)  -28.2% ( -33% -  -22%) 0.000
             CountFilteredIntNRQ       40.78      (3.4%)       33.68      (2.4%)  -17.4% ( -22% -  -12%) 0.000
                          IntSet      242.01      (4.1%)      224.59      (3.9%)   -7.2% ( -14% -    0%) 0.000

Copy link

github-actions bot commented Mar 5, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Mar 5, 2025
@gf2121
Copy link
Contributor Author

gf2121 commented Mar 14, 2025

@jpountz Hi, do you have any idea how should we move forward on this optimization? several thoughts:

  • We can add another step32 for the hybrid-step decoding, which makes the code even more complex but resolves the concern that we might decrease the BKD leaf size in the future.

  • If the code of hybrid-step inner loop is too complex and single-step has performance issue, should we re-consider the original Vector API way?

BTW, i got previous AVX512 results on a Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz chip. I see similar regression running -XX:UseAVX=2 or -XX:UseAVX=3. I also trie some other machine with Intel chips and see the same result so it does not seem like a corner case.

@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2025

I have some small concerns:

  • The fact that the 512 step is tied to the number of points per leaf, though it's not a big deal at all, postings are similar: their encoding logic is specialized for blocks of 128. I guess I'd just rather err on a smaller block size than 512, which feels larg-ish.
  • Complexity: the encoding has 3 different sub encodings: 512, 128 and remainder. Could we have only two?

But my main concern is more that I would like to better understand why 512 performs so much better. There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else. I have some discomfort merging something that is faster without having at least an intuition of why it's faster, so that I can also understand which JVMs and CPUs would enable this speedup. Could pipelining be the reason as 24 (bits per value) * 32 (step) < 2 * 512 (bit width of SIMD instructions)? But then something like 128 should perform well while your benchmark suggests it's still much worse than 512?

@gf2121
Copy link
Contributor Author

gf2121 commented Mar 14, 2025

There must be something that happens with this 512 step that doesn't happen otherwise such as using different instructions, loop unrolling, better CPU pipelining or something else.

Thanks for pointing out this. I studied the asm profile again and i can see at least loop unrolling differs there. According to the asm printed by jmh, i can see for bpv24 decoding:

  • VectorAPI unrolled shift loop x8 (add 0x40 once) and remainder loop x4 (add 0x20 once)
  • InnerLoop 512 step unrolled shift loop x4 (add 0x20 once) and remainder loop x2 (add 0x10 once)
  • InnerLoop 128 step does not get loop unrolling for either shift loop (add 0x8 once) or remainder loop (add 0x8 once).

This is corresponding to the result of jmh: vector API > InnerLoop step-512 > InnerLoop step-128.

Things might change in luceneutil because we find InnerLoop step-512 faster than Vector API there. I confirmed the result of luceneutil of step-512(baseline) vs step-128(candidate):

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  FilteredIntNRQ       80.02      (4.0%)       71.31      (3.0%)  -10.9% ( -17% -   -4%) 0.000
                          IntNRQ       80.94      (2.5%)       72.60      (3.6%)  -10.3% ( -16% -   -4%) 0.000
             CountFilteredIntNRQ       42.93      (2.9%)       40.22      (2.3%)   -6.3% ( -11% -   -1%) 0.001
                          IntSet       93.36      (2.1%)       93.85      (0.7%)    0.5% (  -2% -    3%) 0.633

@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2025

Thanks for running benchmarks. So it looks like the JVM doesn't think these shorter loops (with step 128) are worth unrolling? This makes me wonder how something like that performs on your AVX-512 CPU. I think you had something similar in one of your previous iterations. On my machine it's on par with the current version.

  private void readInts24(IndexInput in, int count, int[] docIDs) throws IOException {
    if (count == BKDConfig.DEFAULT_MAX_POINTS_IN_LEAF_NODE) {
      // Same format, but enabling the JVM to specialize the decoding logic for the default number
      // of points per node proved to help on benchmarks
      doReadInts24(in, 512, docIDs);
    } else {
      doReadInts24(in, count, docIDs);
    }
  }

  private void doReadInts24(IndexInput in, int count, int[] docIDs) throws IOException {
    // Read the first (count - count % 4) values
    int quarter = count >> 2;
    int numBytes = quarter * 3;
    in.readInts(scratch, 0, numBytes);
    for (int i = 0; i < numBytes; ++i) {
      docIDs[i] = scratch[i] >>> 8;
      scratch[i] &= 0xFF;
    }
    for (int i = 0; i < quarter; ++i) {
      docIDs[numBytes + i] = scratch[i]
         | (scratch[quarter + i] << 8)
         | (scratch[2 * quarter + i] << 16);
    }
    // Now read the remaining 0, 1, 2 or 3 values
    for (int i = quarter << 2; i < count; ++i) {
      docIDs[i] = (in.readShort() & 0xFFFF) | (in.readByte() & 0xFF) << 16;
    }
  }

@github-actions github-actions bot removed the Stale label Mar 15, 2025
@gf2121
Copy link
Contributor Author

gf2121 commented Mar 15, 2025

On the AVX-512 machine:

  • Specialized read does not vectorize the remainder loop, it seems the complier failed to inline it.
  • Only specializing decode logic helps vectorize the remainder loop.
  • Push masks to remainder loop seems to get a better performance.

I pushed the benchmark code to the patch, here is result on my machine:

Benchmark                                                         Mode  Cnt   Score   Error   Units
InnerLoopDecodingBenchmark.hybridInnerLoop                       thrpt    5  76.311 ? 0.177  ops/ms
InnerLoopDecodingBenchmark.hybridInnerLoop:asm                   thrpt          NaN             ---
InnerLoopDecodingBenchmark.specializedDecode                     thrpt    5  73.600 ? 0.123  ops/ms
InnerLoopDecodingBenchmark.specializedDecode:asm                 thrpt          NaN             ---
InnerLoopDecodingBenchmark.specializedDecodeMaskInRemainder      thrpt    5  80.902 ? 0.046  ops/ms
InnerLoopDecodingBenchmark.specializedDecodeMaskInRemainder:asm  thrpt          NaN             ---
InnerLoopDecodingBenchmark.specializedRead                       thrpt    5  37.195 ? 0.099  ops/ms
InnerLoopDecodingBenchmark.specializedRead:asm                   thrpt          NaN             ---

LuceneUtil

hybridInnerLoop (baseline) vs specializedRead (candidate)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntNRQ       83.50      (3.7%)       78.07      (4.1%)   -6.5% ( -13% -    1%) 0.000
                  FilteredIntNRQ       81.66      (3.6%)       76.50      (3.8%)   -6.3% ( -13% -    1%) 0.000
             CountFilteredIntNRQ       44.30      (1.9%)       43.07      (2.8%)   -2.8% (  -7% -    1%) 0.000
                          IntSet       94.41      (1.7%)       94.38      (1.0%)   -0.0% (  -2% -    2%) 0.940

hybridInnerLoop (baseline) vs specializedDecodeMaskInRemainder (candidate)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                  FilteredIntNRQ       81.19      (2.9%)       80.63     (10.2%)   -0.7% ( -13% -   12%) 0.773
                          IntSet       95.53      (1.3%)       94.99      (1.3%)   -0.6% (  -3% -    2%) 0.174
             CountFilteredIntNRQ       44.44      (2.6%)       44.27      (6.6%)   -0.4% (  -9% -    9%) 0.808
                          IntNRQ       82.72      (3.7%)       82.41     (10.2%)   -0.4% ( -13% -   14%) 0.878

I can refactor the code to the specialized decoding if it makes sense to you. BTW, should we disable the flexibility of changing maxPointsInLeafNode if we make optimization only work on default config?

@jpountz
Copy link
Contributor

jpountz commented Mar 15, 2025

Again, thanks a lot for running benchmarks.

I can refactor the code to the specialized decoding if it makes sense to you

That would be great, thank you. Sorry for making it hard for you to move this PR forward, I was a bit annoyed that we needed something complicated to speed things up, I like the simplicity of specializedDecodeMaskInRemainder.

BTW, should we disable the flexibility of changing maxPointsInLeafNode if we make optimization only work on default config?

In my opinion, it's good enough the way things are today as the default codec doesn't allow configuring the number of points per leaf node.

@gf2121
Copy link
Contributor Author

gf2121 commented Mar 16, 2025

Sorry for making it hard for you to move this PR forward, I was a bit annoyed that we needed something complicated to speed things up, I like the simplicity of specializedDecodeMaskInRemainder.

No apologies needed, it was exciting to play with the asm code, vectorize loops and make the code simpler! I learned so much through these iterations :)

In my opinion, it's good enough the way things are today as the default codec doesn't allow configuring the number of points per leaf node.

LGTM. IMO we should bind OpenStreet benchmark to DEFAULT_MAX_POINTS_IN_LEAF_NODE after this get merged.
https://github.com/mikemccand/luceneutil/blob/9724b69c9ef53715dae27baa3493a0ee7949748e/src/main/perf/IndexAndSearchOpenStreetMaps.java#L542

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good in general, just left minor comments. Thank you!

this(VERSION_CURRENT);
}

public Lucene90PointsFormat(int version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be pkg-private? I think we only need it for testing?

SegmentWriteState writeState, int maxPointsInLeafNode, double maxMBSortInHeap)
throws IOException {
this(writeState, maxPointsInLeafNode, maxMBSortInHeap, Lucene90PointsFormat.VERSION_CURRENT);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make all constructors that take a version pkg-private?

BKDWriter.VERSION_CURRENT);
}

public BKDWriter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add javadocs that this ctor should be only used for testing with older versions?

final int longIdx = i + numInts + start;
scratch[i] |= docIds[longIdx] >>> 16;
scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF;
scratch[i + quarter * 2] |= docIds[longIdx] & 0xFF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe write bytes in little-endian order for consistency, so scratch[i] |= docIds[longIdx] & 0xFF; scratch[i + quarter] |= (docIds[longIdx] >>> 8) & 0xFF; scratch[i + quarter * 2] |= docIds[longIdx] >>> 16;, etc. ?

// Now read the remaining 0, 1, 2 or 3 values
for (int i = quarter << 2; i < count; ++i) {
docIDs[i] = (in.readShort() & 0xFFFF) | (in.readByte() & 0xFF) << 16;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, does it hurt performance if we add this as part of decode24? This would help save the above assertion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to keep decode24 small so i put it under the if else block to save the assertion. luceneutil and jmh proved it has similar performance.

@gf2121 gf2121 changed the title Use Vector API to decode BKD docIds Decode doc ids in BKD leaves with auto-vectorized loops Mar 16, 2025
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@gf2121 gf2121 merged commit 9472dca into apache:main Mar 16, 2025
7 checks passed
@gf2121
Copy link
Contributor Author

gf2121 commented Mar 17, 2025

Nightly benchmark confirmed the speed up https://benchmarks.mikemccandless.com/2025.03.16.18.04.58.html.

Thanks again for profile guide and helping figure out simpler and faster codes!

@gf2121
Copy link
Contributor Author

gf2121 commented Mar 17, 2025

I raised an PR for annotation. mikemccand/luceneutil#354.

@jpountz
Copy link
Contributor

jpountz commented Mar 17, 2025

Fantastic speedup. Nice to see tasks like TermDayOfYearSort also take advantage from this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants