Skip to content

Eliminate per-code denormalization in uniform SQ distance computation#5166

Closed
mulugetam wants to merge 4 commits into
facebookresearch:mainfrom
mulugetam:sq-optim
Closed

Eliminate per-code denormalization in uniform SQ distance computation#5166
mulugetam wants to merge 4 commits into
facebookresearch:mainfrom
mulugetam:sq-optim

Conversation

@mulugetam
Copy link
Copy Markdown
Contributor

@mulugetam mulugetam commented Apr 30, 2026

This PR removes per-code denormalization from the inner loop of L2 and inner product distance computations for uniform scalar quantizers (QT_8bit_uniform and QT_4bit_uniform), yielding a speedup of up to 1.39x.

For uniform integer scalar quantizers, vmin and vdiff are scalars shared across all dimensions. The reconstructed value for each component is therefore a function of a per-code decode n that depends only on the byte code:

x_hat = vmin + vdiff * n

This structure lets us factor vmin and vdiff out of the per-database-vector inner loop entirely, instead of recomputing the transform on every code on every distance evaluation.

L2 distance.

||q - x_hat||^2 = ||q - (vmin + vdiff * n)||^2
                = vdiff^2 * ||(q - vmin) / vdiff - n||^2

We pre-adjust the query once in set_query() to q_adj = (q - vmin) / vdiff and precompute scale = vdiff^2. The hot loop then compares codes directly against q_adj in the codec's native decode space, applying scale exactly once at the end.

Inner product.

<q, x_hat> = sum_i q[i] * (vmin + vdiff * n[i])
           = vmin * sum_i q[i] + vdiff * sum_i q[i] * n[i]
           = bias + scale * <q, n>

We compute bias = vmin * sum_i q[i] and scale = vdiff once in set_query(). The hot loop accumulates the dot product against the raw decode n and applies bias + scale once at the end.

In both cases this removes one FMA (the vmin + vdiff * n denormalization) from the inner loop per 8 components per database vector. Beyond the raw FLOP reduction, shortening the dependency chain lets SIMD pipelines better overlap the decode of the next lane with the accumulator update of the previous lane.

The Change

                    Current                           This PR
                    --------                          ---------
set_query(q):      store q                           q_adj = (q - vmin) / vdiff  [once]

per code (×N):     raw = decode(bytes)               raw = decode(bytes)
                   x = vmin + raw * vdiff  ← gone    diff = q_adj - raw
                   diff = q - x                      accu += diff^2
                   accu += diff^2

The optimization is gated on a C++20 requires check for a new decode_8_raw() method, defined only on the uniform QuantizerTemplate specializations. All other quantizer types fall through to the original compute_distance path unchanged.

Speedup

Below are the results from running benchs/bench_scalar_quantizer.py for a dd build on SPR, compared to the existing implementation. Similar results were observed for avx-2 as well.

|              | QT_4bit_uniform | QT_8bit_uniform |
|--------------|-----------------|-----------------|
| RS_minmax    | 0.99x           | 1.05x           |
| RS_minmax    | 1.07x           | 1.03x           |
| RS_minmax    | 0.83x           | 1.03x           |
| RS_minmax    | 0.96x           | 1.05x           |
| RS_minmax    | 0.89x           | 1.03x           |
| RS_minmax    | 1.14x           | 1.03x           |
| RS_minmax    | 0.99x           | 1.05x           |
| RS_meanstd   | 1.28x           | 1.09x           |
| RS_meanstd   | 1.10x           | 1.01x           |
| RS_meanstd   | 1.14x           | 1.06x           |
| RS_meanstd   | 1.18x           | 1.08x           |
| RS_meanstd   | 1.11x           | 1.05x           |
| RS_meanstd   | 1.18x           | 1.06x           |
| RS_meanstd   | 1.16x           | 1.08x           |
| RS_quantiles | 1.39x           | 1.07x           |
| RS_quantiles | 1.08x           | 1.01x           |
| RS_quantiles | 1.21x           | 0.99x           |
| RS_quantiles | 1.25x           | 1.10x           |
| RS_optim     | 1.03x           | 1.03x           |

The raw performance results are available here: https://gist.github.com/mulugetam/7db50f89279bb270a1fe336206730d60

For uniform integer scalar quantizers (QT_8bit_uniform and
QT_4bit_uniform), vmin and vdiff are scalars shared across all
dimensions. The reconstructed value for each component is a
function of a per-code decode n that depends only on the byte code:

    x_hat = v_min + (v_diff * n)

This lets us factor vmin and vdiff out of the per-database-vector
inner loop. For L2:

  ||q - x_hat||^2 = ||q - (vmin + vdiff * n)||^2
                  = vdiff^2 * ||(q - vmin) / vdiff - n||^2

The query is pre-adjusted once in set_query() to q_adj = (q - vmin)/vdiff
with scale = vdiff^2, and the hot loop compares codes directly against
q_adj in the codec's native decode space, applying scale once at the
end.

For Inner Product, the same factoring applies linearly:

  <q, x_hat> = sum_i q[i] * (vmin + vdiff * n[i])
             = vmin * sum_i q[i] + vdiff * sum_i q[i] * n[i]
             = bias + scale * <q, n>

with bias = vmin * sum_i q[i] and scale = vdiff computed once in set_query();
the hot loop accumulates the dot product against the raw decode n and
applies bias + scale once at the end.

In both cases this removes one FMA (the vmin + vdiff·n denormalization)
from the inner loop per 8 components per database vector.

All other quantizer types fall through to the original compute_distance
path unchanged.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@meta-cla meta-cla Bot added the CLA Signed label Apr 30, 2026
@mulugetam
Copy link
Copy Markdown
Contributor Author

@subhadeepkaran @mnorris11 Could you take a look when you have time? Thanks!

Copy link
Copy Markdown
Contributor

@mdouze mdouze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution.
I am mainly concerned by the pattern of if constexpr (xxx), would it be possible to move this to SimilarityL2/SimilarityIP ?

rsname)

index.rangestat_arg = val
index.sq.rangestat_arg = val
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

void set_query(const float* x) final {
q = x;
if constexpr (has_decode_raw()) {
if constexpr (Sim::metric_type == METRIC_L2) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to defer these tests to the Similarity object? this is where the IP / L2 distintion is managed

Quantizer quant;

// Pre-adjusted query buffer for uniform quantizers
std::vector<float> q_adj;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the impact of dynamic allocation here? This object is intended to be very lightweight.

Move the metric-specific query pre-adjustment and raw-decode distance
accumulation out of DCTemplate and into the Similarity classes, where
the IP/L2 distinction is already managed.

- Add a static adjust_query_for_raw_decode() method to each
  SimilarityL2 and SimilarityIP specialization (AVX512, AVX2, NEON).
- Replace the if constexpr (Sim::metric_type == METRIC_L2) branches
  in DCTemplate::set_query() with a single call to
  Sim::adjust_query_for_raw_decode().
- Replace the hand-written SIMD loops in query_to_code_predecoded()
  with calls to the existing Similarity accumulator interface
  (begin_N / add_N_components / result_N).
- Fix bench_scalar_quantizer.py: fix error by filter out QT_count
  since it's not a valid quantizer type.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@mulugetam
Copy link
Copy Markdown
Contributor Author

mulugetam commented May 4, 2026

Thank you @mdouze. I think that makes it even more clean. I have made these changes:

  • Added a static adjust_query_for_raw_decode() method to each SimilarityL2 and SimilarityIP specialization (AVX512, AVX2, NEON).
  • Replaced the if constexpr (Sim::metric_type == METRIC_L2) branches in DCTemplate::set_query() with a single call to Sim::adjust_query_for_raw_decode().
  • Replace the hand-written SIMD loops in query_to_code_predecoded() with calls to the existing Similarity accumulator interface (begin_N / add_N_components / result_N).
  • Fixed an error in benchs/bench_scalar_quantizer.py that was causes an error at the end of a ran because QT_count was being treated as a quantizer type.

Regarding the cost of vector<float> q_adj, given that it contains d elements (resized once in the constructor, q_adj.resize(d)) and that it costs one malloc per thread at search setup time, I think it's a right tradeoff to make.

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 23, 2026

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106148760.

@meta-codesync meta-codesync Bot closed this in 1cb7601 May 29, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 29, 2026

@mnorris11 merged this pull request in 1cb7601.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants