Eliminate per-code denormalization in uniform SQ distance computation by mulugetam · Pull Request #5166 · facebookresearch/faiss

mulugetam · 2026-04-30T15:46:38Z

This PR removes per-code denormalization from the inner loop of L2 and inner product distance computations for uniform scalar quantizers (QT_8bit_uniform and QT_4bit_uniform), yielding a speedup of up to 1.39x.

For uniform integer scalar quantizers, vmin and vdiff are scalars shared across all dimensions. The reconstructed value for each component is therefore a function of a per-code decode n that depends only on the byte code:

x_hat = vmin + vdiff * n

This structure lets us factor vmin and vdiff out of the per-database-vector inner loop entirely, instead of recomputing the transform on every code on every distance evaluation.

L2 distance.

||q - x_hat||^2 = ||q - (vmin + vdiff * n)||^2
                = vdiff^2 * ||(q - vmin) / vdiff - n||^2

We pre-adjust the query once in set_query() to q_adj = (q - vmin) / vdiff and precompute scale = vdiff^2. The hot loop then compares codes directly against q_adj in the codec's native decode space, applying scale exactly once at the end.

Inner product.

<q, x_hat> = sum_i q[i] * (vmin + vdiff * n[i])
           = vmin * sum_i q[i] + vdiff * sum_i q[i] * n[i]
           = bias + scale * <q, n>

We compute bias = vmin * sum_i q[i] and scale = vdiff once in set_query(). The hot loop accumulates the dot product against the raw decode n and applies bias + scale once at the end.

In both cases this removes one FMA (the vmin + vdiff * n denormalization) from the inner loop per 8 components per database vector. Beyond the raw FLOP reduction, shortening the dependency chain lets SIMD pipelines better overlap the decode of the next lane with the accumulator update of the previous lane.

The Change

                    Current                           This PR
                    --------                          ---------
set_query(q):      store q                           q_adj = (q - vmin) / vdiff  [once]

per code (×N):     raw = decode(bytes)               raw = decode(bytes)
                   x = vmin + raw * vdiff  ← gone    diff = q_adj - raw
                   diff = q - x                      accu += diff^2
                   accu += diff^2

The optimization is gated on a C++20 requires check for a new decode_8_raw() method, defined only on the uniform QuantizerTemplate specializations. All other quantizer types fall through to the original compute_distance path unchanged.

Speedup

Below are the results from running benchs/bench_scalar_quantizer.py for a dd build on SPR, compared to the existing implementation. Similar results were observed for avx-2 as well.

|              | QT_4bit_uniform | QT_8bit_uniform |
|--------------|-----------------|-----------------|
| RS_minmax    | 0.99x           | 1.05x           |
| RS_minmax    | 1.07x           | 1.03x           |
| RS_minmax    | 0.83x           | 1.03x           |
| RS_minmax    | 0.96x           | 1.05x           |
| RS_minmax    | 0.89x           | 1.03x           |
| RS_minmax    | 1.14x           | 1.03x           |
| RS_minmax    | 0.99x           | 1.05x           |
| RS_meanstd   | 1.28x           | 1.09x           |
| RS_meanstd   | 1.10x           | 1.01x           |
| RS_meanstd   | 1.14x           | 1.06x           |
| RS_meanstd   | 1.18x           | 1.08x           |
| RS_meanstd   | 1.11x           | 1.05x           |
| RS_meanstd   | 1.18x           | 1.06x           |
| RS_meanstd   | 1.16x           | 1.08x           |
| RS_quantiles | 1.39x           | 1.07x           |
| RS_quantiles | 1.08x           | 1.01x           |
| RS_quantiles | 1.21x           | 0.99x           |
| RS_quantiles | 1.25x           | 1.10x           |
| RS_optim     | 1.03x           | 1.03x           |

The raw performance results are available here: https://gist.github.com/mulugetam/7db50f89279bb270a1fe336206730d60

For uniform integer scalar quantizers (QT_8bit_uniform and QT_4bit_uniform), vmin and vdiff are scalars shared across all dimensions. The reconstructed value for each component is a function of a per-code decode n that depends only on the byte code: x_hat = v_min + (v_diff * n) This lets us factor vmin and vdiff out of the per-database-vector inner loop. For L2: ||q - x_hat||^2 = ||q - (vmin + vdiff * n)||^2 = vdiff^2 * ||(q - vmin) / vdiff - n||^2 The query is pre-adjusted once in set_query() to q_adj = (q - vmin)/vdiff with scale = vdiff^2, and the hot loop compares codes directly against q_adj in the codec's native decode space, applying scale once at the end. For Inner Product, the same factoring applies linearly: <q, x_hat> = sum_i q[i] * (vmin + vdiff * n[i]) = vmin * sum_i q[i] + vdiff * sum_i q[i] * n[i] = bias + scale * <q, n> with bias = vmin * sum_i q[i] and scale = vdiff computed once in set_query(); the hot loop accumulates the dot product against the raw decode n and applies bias + scale once at the end. In both cases this removes one FMA (the vmin + vdiff·n denormalization) from the inner loop per 8 components per database vector. All other quantizer types fall through to the original compute_distance path unchanged. Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

mulugetam · 2026-04-30T17:13:08Z

@subhadeepkaran @mnorris11 Could you take a look when you have time? Thanks!

mdouze

Thanks for the contribution.
I am mainly concerned by the pattern of if constexpr (xxx), would it be possible to move this to SimilarityL2/SimilarityIP ?

mdouze · 2026-05-04T07:29:32Z

                                          rsname)

-                index.rangestat_arg = val
+                index.sq.rangestat_arg = val


good catch!

mdouze · 2026-05-04T07:34:48Z

    void set_query(const float* x) final {
        q = x;
+        if constexpr (has_decode_raw()) {
+            if constexpr (Sim::metric_type == METRIC_L2) {


is it possible to defer these tests to the Similarity object? this is where the IP / L2 distintion is managed

mdouze · 2026-05-04T07:35:47Z

    Quantizer quant;

+    // Pre-adjusted query buffer for uniform quantizers
+    std::vector<float> q_adj;


what is the impact of dynamic allocation here? This object is intended to be very lightweight.

Move the metric-specific query pre-adjustment and raw-decode distance accumulation out of DCTemplate and into the Similarity classes, where the IP/L2 distinction is already managed. - Add a static adjust_query_for_raw_decode() method to each SimilarityL2 and SimilarityIP specialization (AVX512, AVX2, NEON). - Replace the if constexpr (Sim::metric_type == METRIC_L2) branches in DCTemplate::set_query() with a single call to Sim::adjust_query_for_raw_decode(). - Replace the hand-written SIMD loops in query_to_code_predecoded() with calls to the existing Similarity accumulator interface (begin_N / add_N_components / result_N). - Fix bench_scalar_quantizer.py: fix error by filter out QT_count since it's not a valid quantizer type. Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

mulugetam · 2026-05-04T15:52:25Z

Thank you @mdouze. I think that makes it even more clean. I have made these changes:

Added a static adjust_query_for_raw_decode() method to each SimilarityL2 and SimilarityIP specialization (AVX512, AVX2, NEON).
Replaced the if constexpr (Sim::metric_type == METRIC_L2) branches in DCTemplate::set_query() with a single call to Sim::adjust_query_for_raw_decode().
Replace the hand-written SIMD loops in query_to_code_predecoded() with calls to the existing Similarity accumulator interface (begin_N / add_N_components / result_N).
Fixed an error in benchs/bench_scalar_quantizer.py that was causes an error at the end of a ran because QT_count was being treated as a quantizer type.

Regarding the cost of vector<float> q_adj, given that it contains d elements (resized once in the constructor, q_adj.resize(d)) and that it costs one malloc per thread at search setup time, I think it's a right tradeoff to make.

meta-codesync · 2026-05-23T00:00:25Z

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106148760.

meta-codesync · 2026-05-29T23:56:44Z

@mnorris11 merged this pull request in 1cb7601.

meta-cla Bot added the CLA Signed label Apr 30, 2026

mdouze reviewed May 4, 2026

View reviewed changes

mnorris11 added the to-benchmark label May 18, 2026

mdouze added 2 commits May 19, 2026 17:23

Merge branch 'main' into sq-optim

212a940

Merge branch 'main' into sq-optim

eae47b2

meta-codesync Bot closed this in 1cb7601 May 29, 2026

facebook-github-tools Bot added the Merged label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate per-code denormalization in uniform SQ distance computation#5166

Eliminate per-code denormalization in uniform SQ distance computation#5166
mulugetam wants to merge 4 commits into
facebookresearch:mainfrom
mulugetam:sq-optim

mulugetam commented Apr 30, 2026 •

edited

Loading

Uh oh!

mulugetam commented Apr 30, 2026

Uh oh!

mdouze left a comment

Uh oh!

mdouze May 4, 2026

Uh oh!

mdouze May 4, 2026

Uh oh!

mdouze May 4, 2026

Uh oh!

mulugetam commented May 4, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented May 23, 2026

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mulugetam commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mulugetam commented Apr 30, 2026

Uh oh!

mdouze left a comment

Choose a reason for hiding this comment

Uh oh!

mdouze May 4, 2026

Choose a reason for hiding this comment

Uh oh!

mdouze May 4, 2026

Choose a reason for hiding this comment

Uh oh!

mdouze May 4, 2026

Choose a reason for hiding this comment

Uh oh!

mulugetam commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 23, 2026

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mulugetam commented Apr 30, 2026 •

edited

Loading

mulugetam commented May 4, 2026 •

edited

Loading