Skip to content

Conversation

RamakrishnaChilaka
Copy link
Contributor

@RamakrishnaChilaka RamakrishnaChilaka commented Sep 17, 2025

This PR optimizes the expand8 routine by leveraging the JDK Vector API.

Benchmarks

I have validated performance using a standalone benchmark (see postings_expand_benchmark) for block_size: 256. Key take-aways are as follows. Benchmarks ran on i5-13600k and 256 bit vectors.

Benchmark Mode Cnt Score Error Units
expand16 (Scalar) thrpt 5 112.842 ± 0.221 ops/us
expand16 (Vector) thrpt 5 105.594 ± 1.307 ops/us
expand8 (Scalar) thrpt 5 66.726 ± 0.452 ops/us
expand8 (Vector) thrpt 5 105.821 ± 0.272 ops/us
  • expand8: Vectorized version is ~59% faster than scalar (66.7 → 105.8 ops/us).
  • expand16: Scalar slightly outperforms vector (112.8 vs 105.6 ops/us).

Lucene Microbenchmarks


baseline
Benchmark                                (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode            2  thrpt   15  35.409 ± 0.120  ops/us
PostingIndexInputBenchmark.decode            3  thrpt   15  29.128 ± 0.017  ops/us
PostingIndexInputBenchmark.decode            4  thrpt   15  41.492 ± 0.305  ops/us
PostingIndexInputBenchmark.decode            5  thrpt   15  32.205 ± 0.350  ops/us
PostingIndexInputBenchmark.decode            6  thrpt   15  31.237 ± 0.245  ops/us
PostingIndexInputBenchmark.decode            7  thrpt   15  29.984 ± 0.582  ops/us
PostingIndexInputBenchmark.decode            8  thrpt   15  56.366 ± 0.134  ops/us
PostingIndexInputBenchmark.decode            9  thrpt   15  22.802 ± 0.077  ops/us
PostingIndexInputBenchmark.decode           10  thrpt   15  23.502 ± 0.037  ops/us
PostingIndexInputBenchmark.decodeVector      2  thrpt   15  53.151 ± 0.070  ops/us
PostingIndexInputBenchmark.decodeVector      3  thrpt   15  48.863 ± 1.455  ops/us
PostingIndexInputBenchmark.decodeVector      4  thrpt   15  54.284 ± 2.195  ops/us
PostingIndexInputBenchmark.decodeVector      5  thrpt   15  39.302 ± 0.659  ops/us
PostingIndexInputBenchmark.decodeVector      6  thrpt   15  38.414 ± 0.830  ops/us
PostingIndexInputBenchmark.decodeVector      7  thrpt   15  39.609 ± 0.551  ops/us
PostingIndexInputBenchmark.decodeVector      8  thrpt   15  56.373 ± 0.118  ops/us
PostingIndexInputBenchmark.decodeVector      9  thrpt   15  27.295 ± 0.351  ops/us
PostingIndexInputBenchmark.decodeVector     10  thrpt   15  30.058 ± 0.172  ops/us


contender
Benchmark                                (bpv)   Mode  Cnt   Score   Error   Units
PostingIndexInputBenchmark.decode            2  thrpt   15  35.238 ± 0.209  ops/us
PostingIndexInputBenchmark.decode            3  thrpt   15  29.214 ± 0.098  ops/us
PostingIndexInputBenchmark.decode            4  thrpt   15  41.559 ± 0.580  ops/us
PostingIndexInputBenchmark.decode            5  thrpt   15  32.543 ± 0.175  ops/us
PostingIndexInputBenchmark.decode            6  thrpt   15  31.323 ± 0.061  ops/us
PostingIndexInputBenchmark.decode            7  thrpt   15  29.525 ± 0.315  ops/us
PostingIndexInputBenchmark.decode            8  thrpt   15  52.348 ± 0.079  ops/us
PostingIndexInputBenchmark.decode            9  thrpt   15  24.919 ± 0.056  ops/us
PostingIndexInputBenchmark.decode           10  thrpt   15  26.581 ± 0.049  ops/us
PostingIndexInputBenchmark.decodeVector      2  thrpt   15  71.223 ± 6.921  ops/us
PostingIndexInputBenchmark.decodeVector      3  thrpt   15  53.237 ± 1.962  ops/us
PostingIndexInputBenchmark.decodeVector      4  thrpt   15  73.437 ± 0.284  ops/us
PostingIndexInputBenchmark.decodeVector      5  thrpt   15  41.201 ± 2.067  ops/us
PostingIndexInputBenchmark.decodeVector      6  thrpt   15  46.622 ± 0.289  ops/us
PostingIndexInputBenchmark.decodeVector      7  thrpt   15  45.505 ± 1.044  ops/us
PostingIndexInputBenchmark.decodeVector      8  thrpt   15  58.368 ± 0.977  ops/us
PostingIndexInputBenchmark.decodeVector      9  thrpt   15  27.243 ± 0.358  ops/us
PostingIndexInputBenchmark.decodeVector     10  thrpt   15  30.059 ± 0.105  ops/us

Summary

bpv -9,10 uses primitive size as 16, hence no change in performance.

bpv baseline vector (ops/μs) contender vector (ops/μs) Δ
2 53.2 71.2 +33.8 %
3 48.9 53.2 +8.8 %
4 54.3 73.4 +35.2 %
5 39.3 41.2 +4.8 %
6 38.4 46.6 +21.4 %
7 39.6 45.5 +14.9 %
8 56.3 58.4 +3.7 %
9 27.3 27.2 –0.4 %
10 30.1 30.1 0.0 %

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.4.0 milestone Sep 17, 2025
@RamakrishnaChilaka
Copy link
Contributor Author

lucene util benchmark

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy1       52.76     (11.7%)       49.37      (9.3%)   -6.4% ( -24% -   16%) 0.055
         AndHighMedDayTaxoFacets       30.57      (3.9%)       29.19      (3.9%)   -4.5% ( -11% -    3%) 0.000
                         Respell       25.63      (9.5%)       24.90     (10.2%)   -2.8% ( -20% -   18%) 0.363
           BrowseMonthTaxoFacets        2.63      (5.1%)        2.56      (6.7%)   -2.8% ( -13% -    9%) 0.136
                      TermDTSort      152.52      (6.0%)      149.44      (7.4%)   -2.0% ( -14% -   12%) 0.344
            HighIntervalsOrdered       26.57      (9.0%)       26.08      (8.5%)   -1.9% ( -17% -   17%) 0.500
            HighTermTitleBDVSort       17.65      (7.5%)       17.36      (6.4%)   -1.6% ( -14% -   13%) 0.457
                          Fuzzy2       50.14     (11.6%)       49.33      (8.6%)   -1.6% ( -19% -   20%) 0.615
                    OrNotHighLow      987.37      (3.1%)      974.78      (4.7%)   -1.3% (  -8% -    6%) 0.313
            BrowseDateSSDVFacets        0.70      (8.8%)        0.69      (9.7%)   -0.9% ( -17% -   19%) 0.753
            BrowseDateTaxoFacets        2.49      (9.9%)        2.47     (11.0%)   -0.8% ( -19% -   22%) 0.800
     BrowseRandomLabelSSDVFacets        2.44     (12.9%)        2.43     (13.8%)   -0.4% ( -24% -   30%) 0.917
                        Wildcard      301.90      (2.7%)      300.69      (2.9%)   -0.4% (  -5% -    5%) 0.651
                         Prefix3      408.43      (2.4%)      407.03      (3.2%)   -0.3% (  -5% -    5%) 0.706
                     LowSpanNear      159.18      (8.2%)      158.74      (6.7%)   -0.3% ( -14% -   15%) 0.907
                   OrNotHighHigh      163.69      (7.6%)      163.29      (8.0%)   -0.3% ( -14% -   16%) 0.920
                      AndHighLow     1087.95      (4.9%)     1085.33      (4.1%)   -0.2% (  -8% -    9%) 0.866
                    OrHighNotMed      561.31      (6.0%)      560.41      (5.7%)   -0.2% ( -11% -   12%) 0.931
                          IntSet      497.27      (6.5%)      496.90      (8.1%)   -0.1% ( -13% -   15%) 0.975
             LowIntervalsOrdered       12.07      (4.1%)       12.07      (4.5%)    0.0% (  -8% -    8%) 0.977
               HighTermMonthSort      851.18      (5.4%)      852.11      (5.5%)    0.1% ( -10% -   11%) 0.950
                    OrNotHighMed      274.91      (7.4%)      275.42      (6.4%)    0.2% ( -12% -   15%) 0.933
        AndHighHighDayTaxoFacets        7.60      (7.3%)        7.61      (8.7%)    0.2% ( -14% -   17%) 0.933
                          IntNRQ      533.39     (13.8%)      534.64     (12.7%)    0.2% ( -23% -   31%) 0.956
             MedIntervalsOrdered        9.57      (4.5%)        9.60      (6.3%)    0.3% ( -10% -   11%) 0.881
            MedTermDayTaxoFacets       16.62      (6.5%)       16.68      (7.4%)    0.3% ( -12% -   15%) 0.882
                     MedSpanNear       87.38     (12.0%)       87.69      (8.5%)    0.4% ( -18% -   23%) 0.914
                      HighPhrase       56.33      (4.7%)       56.58      (5.0%)    0.4% (  -8% -   10%) 0.778
       BrowseDayOfYearTaxoFacets        2.51     (10.5%)        2.52     (11.7%)    0.5% ( -19% -   25%) 0.896
           BrowseMonthSSDVFacets        3.38     (10.8%)        3.40      (8.8%)    0.5% ( -17% -   22%) 0.864
               HighTermTitleSort       90.42      (2.9%)       91.10      (3.0%)    0.8% (  -4% -    6%) 0.414
     BrowseRandomLabelTaxoFacets        1.85      (5.1%)        1.87      (3.2%)    0.8% (  -7% -    9%) 0.540
                    OrHighNotLow      672.02      (8.7%)      679.42      (6.5%)    1.1% ( -13% -   17%) 0.652
           HighTermDayOfYearSort      218.30      (2.9%)      220.71      (3.3%)    1.1% (  -5% -    7%) 0.266
                        PKLookup      146.99     (13.5%)      149.20      (8.5%)    1.5% ( -18% -   27%) 0.673
                       MedPhrase      127.53      (4.9%)      129.64      (3.6%)    1.7% (  -6% -   10%) 0.224
                    HighSpanNear        8.31      (7.2%)        8.46      (5.3%)    1.8% (  -9% -   15%) 0.377
                           range     3829.26      (5.5%)     3901.35      (7.2%)    1.9% ( -10% -   15%) 0.350
                       OrHighMed      414.19      (7.1%)      421.98      (5.2%)    1.9% (  -9% -   15%) 0.338
                         LowTerm     1311.02      (7.6%)     1336.03      (5.8%)    1.9% ( -10% -   16%) 0.373
       BrowseDayOfYearSSDVFacets        3.15     (10.5%)        3.23      (6.2%)    2.5% ( -12% -   21%) 0.355
                      AndHighMed      449.54      (3.0%)      461.19      (2.3%)    2.6% (  -2% -    8%) 0.002
                   OrHighNotHigh      264.69      (7.4%)      271.60      (7.9%)    2.6% ( -11% -   19%) 0.279
                       LowPhrase       64.81      (3.4%)       66.60      (3.4%)    2.8% (  -3% -    9%) 0.009
                     AndHighHigh      183.33     (10.2%)      189.00     (10.8%)    3.1% ( -16% -   26%) 0.352
                 MedSloppyPhrase       50.66      (4.3%)       52.28      (3.6%)    3.2% (  -4% -   11%) 0.012
                      OrHighHigh      194.55     (10.0%)      200.90      (8.5%)    3.3% ( -13% -   24%) 0.266
          OrHighMedDayTaxoFacets        7.88     (11.1%)        8.17      (8.8%)    3.7% ( -14% -   26%) 0.240
                         MedTerm      792.86      (8.5%)      823.44      (7.5%)    3.9% ( -11% -   21%) 0.127
                 LowSloppyPhrase       29.20      (5.7%)       30.44      (7.8%)    4.2% (  -8% -   18%) 0.049
                HighSloppyPhrase       33.31     (10.7%)       34.91      (6.8%)    4.8% ( -11% -   24%) 0.089
                       OrHighLow      615.53      (6.9%)      645.25      (5.6%)    4.8% (  -7% -   18%) 0.015
                        HighTerm      586.61     (10.6%)      618.50      (9.7%)    5.4% ( -13% -   28%) 0.091

@mikemccand
Copy link
Member

Very cool -- thank you for running JFR (micro benchmark) and luceneutil (macro?)!

What exactly is expand8 and where it is it used in Lucene? Is it postings decode when bitwidth is 8?

@RamakrishnaChilaka
Copy link
Contributor Author

Thank you @mikemccand, @jpountz for reviewing the PR.

What exactly is expand8 and where it is it used in Lucene? Is it postings decode when bitwidth is 8?

The patch vectorises ForUtil.expand8 with the JDK Vector API.
expand8 is the low-level routine that inflates 1–8-bit packed integers back to 32-bit during postings decode; it is on the hot path for every segment that stores doc IDs, frequencies, or positions with ≤ 8 bits per value.

Added Javadocs now!

@RamakrishnaChilaka RamakrishnaChilaka merged commit 15ed5d7 into apache:main Sep 21, 2025
8 checks passed
@RamakrishnaChilaka RamakrishnaChilaka deleted the vectorise_expand8_expand16 branch September 21, 2025 05:24
@RamakrishnaChilaka
Copy link
Contributor Author

RamakrishnaChilaka commented Sep 23, 2025

Shows good speedup in the nightly benchmarks ~(1-3.5%). Will push an annotation.

https://benchmarks.mikemccandless.com/2025.09.21.18.04.40.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants