Eliminate per-code denormalization in uniform SQ distance computation#5166
Eliminate per-code denormalization in uniform SQ distance computation#5166mulugetam wants to merge 4 commits into
Conversation
For uniform integer scalar quantizers (QT_8bit_uniform and
QT_4bit_uniform), vmin and vdiff are scalars shared across all
dimensions. The reconstructed value for each component is a
function of a per-code decode n that depends only on the byte code:
x_hat = v_min + (v_diff * n)
This lets us factor vmin and vdiff out of the per-database-vector
inner loop. For L2:
||q - x_hat||^2 = ||q - (vmin + vdiff * n)||^2
= vdiff^2 * ||(q - vmin) / vdiff - n||^2
The query is pre-adjusted once in set_query() to q_adj = (q - vmin)/vdiff
with scale = vdiff^2, and the hot loop compares codes directly against
q_adj in the codec's native decode space, applying scale once at the
end.
For Inner Product, the same factoring applies linearly:
<q, x_hat> = sum_i q[i] * (vmin + vdiff * n[i])
= vmin * sum_i q[i] + vdiff * sum_i q[i] * n[i]
= bias + scale * <q, n>
with bias = vmin * sum_i q[i] and scale = vdiff computed once in set_query();
the hot loop accumulates the dot product against the raw decode n and
applies bias + scale once at the end.
In both cases this removes one FMA (the vmin + vdiff·n denormalization)
from the inner loop per 8 components per database vector.
All other quantizer types fall through to the original compute_distance
path unchanged.
Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
|
@subhadeepkaran @mnorris11 Could you take a look when you have time? Thanks! |
mdouze
left a comment
There was a problem hiding this comment.
Thanks for the contribution.
I am mainly concerned by the pattern of if constexpr (xxx), would it be possible to move this to SimilarityL2/SimilarityIP ?
| rsname) | ||
|
|
||
| index.rangestat_arg = val | ||
| index.sq.rangestat_arg = val |
| void set_query(const float* x) final { | ||
| q = x; | ||
| if constexpr (has_decode_raw()) { | ||
| if constexpr (Sim::metric_type == METRIC_L2) { |
There was a problem hiding this comment.
is it possible to defer these tests to the Similarity object? this is where the IP / L2 distintion is managed
| Quantizer quant; | ||
|
|
||
| // Pre-adjusted query buffer for uniform quantizers | ||
| std::vector<float> q_adj; |
There was a problem hiding this comment.
what is the impact of dynamic allocation here? This object is intended to be very lightweight.
Move the metric-specific query pre-adjustment and raw-decode distance accumulation out of DCTemplate and into the Similarity classes, where the IP/L2 distinction is already managed. - Add a static adjust_query_for_raw_decode() method to each SimilarityL2 and SimilarityIP specialization (AVX512, AVX2, NEON). - Replace the if constexpr (Sim::metric_type == METRIC_L2) branches in DCTemplate::set_query() with a single call to Sim::adjust_query_for_raw_decode(). - Replace the hand-written SIMD loops in query_to_code_predecoded() with calls to the existing Similarity accumulator interface (begin_N / add_N_components / result_N). - Fix bench_scalar_quantizer.py: fix error by filter out QT_count since it's not a valid quantizer type. Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
|
Thank you @mdouze. I think that makes it even more clean. I have made these changes:
Regarding the cost of |
|
@mnorris11 has imported this pull request. If you are a Meta employee, you can view this in D106148760. |
|
@mnorris11 merged this pull request in 1cb7601. |
This PR removes per-code denormalization from the inner loop of L2 and inner product distance computations for uniform scalar quantizers (
QT_8bit_uniformandQT_4bit_uniform), yielding a speedup of up to 1.39x.For uniform integer scalar quantizers,
vminandvdiffare scalars shared across all dimensions. The reconstructed value for each component is therefore a function of a per-code decodenthat depends only on the byte code:This structure lets us factor
vminandvdiffout of the per-database-vector inner loop entirely, instead of recomputing the transform on every code on every distance evaluation.L2 distance.
We pre-adjust the query once in
set_query()toq_adj = (q - vmin) / vdiffand precomputescale = vdiff^2. The hot loop then compares codes directly againstq_adjin the codec's native decode space, applyingscaleexactly once at the end.Inner product.
We compute
bias = vmin * sum_i q[i]andscale = vdiffonce inset_query(). The hot loop accumulates the dot product against the raw decodenand appliesbias + scaleonce at the end.In both cases this removes one FMA (the
vmin + vdiff * ndenormalization) from the inner loop per 8 components per database vector. Beyond the raw FLOP reduction, shortening the dependency chain lets SIMD pipelines better overlap the decode of the next lane with the accumulator update of the previous lane.The Change
The optimization is gated on a C++20
requirescheck for a newdecode_8_raw()method, defined only on the uniformQuantizerTemplatespecializations. All other quantizer types fall through to the originalcompute_distancepath unchanged.Speedup
Below are the results from running
benchs/bench_scalar_quantizer.pyfor addbuild on SPR, compared to the existing implementation. Similar results were observed foravx-2as well.The raw performance results are available here: https://gist.github.com/mulugetam/7db50f89279bb270a1fe336206730d60