This file is meant to be a paper-facing draft for the experiments section.
It is anchored on the current figure slate and on the JSON artifacts in
benchmarks/results/.
Recommended main-text figure order:
12_status_quo_quantization.png05_text_embedding_models.png07_hybrid_dense_sparse.png11_beir_relevance_server.png
The experimental story should be narrow and defensible:
- TurboQuant is materially stronger than standard
PQ / OPQbaselines at the same storage budget. - TurboQuant is competitive with
SQat matched storage, while remaining near full-precision dense retrieval quality. - The compression cost remains close to zero inside a hybrid BM25+dense retrieval pipeline.
- TurboQuant has two different deployment stories: direct compressed search, which preserves packed storage savings, and reconstruction-based vector-store compatibility, which preserves quality but not server-side compression.
Avoid a broader thesis than this. In particular, the current repository does not support claims about billion-scale ANN routing, native compressed vector-store indexing, or universal latency wins over dense BLAS baselines.
We evaluate TurboQuant on official BeIR relevance judgments rather than proxy overlap with a full-precision ranking. Our main datasets are SciFact and NFCorpus test splits, which give two distinct retrieval regimes: a smaller scientific fact-checking corpus and a larger medical abstract search task. Dense encoders are all-MiniLM-L6-v2, all-mpnet-base-v2, and bge-m3, and all embeddings are L2-normalized before indexing or compression. Unless noted otherwise, evaluation uses NDCG@10, Recall@10, MRR@10, and query-level 95% bootstrap confidence intervals.
For status-quo comparisons, we benchmark TurboQuant against exact dense FP32 search and against standard FAISS quantizers: SQ4, SQ8, PQ16, and OPQ16. A key detail is that the x-axis and storage tables use actual serialized bytes per vector. For TurboQuant, this is the packed transferable representation; for FAISS, it is the serialized index size amortized over the corpus, which counts learned codebooks and transforms. This accounting matters for small and medium RAG corpora, where codebook overhead is not negligible.
For hybrid retrieval, the sparse leg is a bm25s Lucene-style BM25 baseline with English stemming and RRF@100 fusion. This is not a full search server benchmark; it is a controlled pipeline benchmark whose purpose is to answer the deployment question directly: if the dense leg is compressed, how much hybrid quality do we lose?
For integration experiments, we separate two deployment modes. The first is direct in-process TurboQuant search over packed vectors. The second uploads TurboQuant-reconstructed float32 vectors into a Chroma HTTP server. This distinction is essential: the second path measures compatibility with a conventional vector store, not native compressed indexing.
Figure 12_status_quo_quantization.png is the main result. It answers the question that most compression papers leave partially unresolved: why use TurboQuant instead of the standard scalar or product quantizers already available in FAISS?
At the ultra-compact budget, TQ-2b clearly outperforms PQ16 at matched or smaller storage. On SciFact with MiniLM, TQ-2b improves NDCG@10 by +0.114 over PQ16 at essentially the same footprint (100.0 vs 99.9 bytes per vector). On NFCorpus with MiniLM the gain is +0.034, and with mpnet the gains are +0.060 on SciFact and +0.036 on NFCorpus while using equal or fewer bytes. This is the strongest novelty claim in the current paper package.
At the medium budget, TQ-4b is not a dramatic win over SQ4, but it remains competitive at near-matched storage while staying close to full precision. The observed TQ-4b - SQ4 NDCG@10 deltas are +0.001, -0.001, +0.003, and -0.001 across the four dataset-model pairs. That is the correct way to present the result: TurboQuant does not universally dominate scalar quantization, but it does match or slightly exceed it while avoiding training and staying closer to the theoretical TurboQuant design.
OPQ16 is consistently weaker than TQ-4b in the current experiments despite using more bytes per vector. This matters for the paper narrative because it shows that the advantage is not limited to the weakest baseline family.
Figure 05_text_embedding_models.png shows that the 4-bit result is not tied to one encoder. Across SciFact and NFCorpus, the TQ-4b - FP32 NDCG@10 deltas range from -0.0037 to +0.0052 over all-MiniLM-L6-v2, all-mpnet-base-v2, and bge-m3. On SciFact, the observed deltas are positive for all three encoders (+0.0045, +0.0014, +0.0052), while NFCorpus shows small negative deltas (-0.0037, -0.0007, -0.0027). This is consistent with the claim that 4-bit TurboQuant stays near full-precision retrieval quality across modern embedding models.
This figure should not be used as the main novelty figure. Its role is to show robustness after the status-quo figure has already established why TurboQuant is interesting.
Figure 07_hybrid_dense_sparse.png is the main RAG-facing figure. It uses paired query-wise deltas so that the paper does not overstate raw quality differences that are really driven by the sparse leg. The correct question is not whether hybrid retrieval is good in the abstract, but whether compressing the dense leg damages the hybrid lift.
The answer is that the damage is negligible in the current setup. On SciFact, the paired hybrid deltas are -0.0012 in NDCG@10, +0.0033 in Recall@10, and -0.0039 in MRR@10. On NFCorpus, the paired hybrid deltas are -0.0006, -0.0035, and -0.0008, respectively. These values remain close to zero and their confidence intervals overlap zero in most cases. This supports the deployment claim that TurboQuant can be inserted into a hybrid BM25+dense pipeline without materially changing retrieval effectiveness.
The figure should still be framed carefully. It is a strong pipeline result, but the sparse baseline is a local BM25 reference implementation rather than a Pyserini/Anserini or production search service benchmark.
Figure 11_beir_relevance_server.png is a systems clarification figure rather than a novelty figure. Its value is that it separates two deployment modes that are easy to conflate.
In the direct path, TurboQuant performs search over its packed representation and achieves real storage compression: the packed ratio is 7.84x to 7.92x in the current BeIR runs, while NDCG@10 remains within -0.0037 to +0.0045 of exact dense retrieval. The trade-off is that this reference NumPy implementation is slower than exact dense BLAS search on CPU, with search latency ratios between 1.28x and 1.58x.
In the Chroma compatibility path, quality remains close to the Chroma FP32 baseline, with NDCG@10 deltas between -0.0019 and +0.0026, but server-side storage remains 1.00x by construction because reconstructed float32 vectors are uploaded. This figure is therefore useful to explain deployment trade-offs, not to claim native compressed indexing inside an external vector store.
TQ-2bmaterially outperformsPQ16at matched or smaller serialized storage on BeIR qrels.TQ-4bremains near FP32 quality and is competitive withSQ4at roughly8xcompression.- The hybrid BM25+dense lift is preserved under 4-bit TurboQuant.
- The direct TurboQuant path yields real packed storage savings.
- Reconstruction-based vector-store integration preserves quality but not compressed storage inside the server.
- TurboQuant is uniformly better than
SQacross all budgets and datasets. - TurboQuant is faster than exact dense CPU search in this reference implementation.
- Chroma results imply native compressed vector-store storage.
- The current experiments prove billion-scale ANN performance.
- The current experiments are a definitive ColBERT or multi-vector systems benchmark.
Preferred wording:
- "TurboQuant materially improves over PQ/OPQ at matched storage, and remains competitive with SQ while staying near full-precision quality."
Avoid:
- "TurboQuant dominates every standard quantizer."
Preferred wording:
- "The 4-bit result is stable across multiple dense encoders."
Avoid:
- "TurboQuant improves retrieval quality for every model and dataset."
Preferred wording:
- "The paired compression cost inside the hybrid pipeline remains close to zero."
Avoid:
- "TurboQuant improves hybrid retrieval."
Preferred wording:
- "TurboQuant supports a real compressed path and a compatibility path; only the direct path compresses stored vectors."
Avoid:
- "TurboQuant compresses Chroma."
The current figure set is produced by:
.venv/bin/python -u benchmarks/status_quo_quantization_bench.py
.venv/bin/python -u benchmarks/rag_benchmarks.py --skip 1 2 4 --include-experimental
.venv/bin/python -u benchmarks/beir_relevance_bench.pyInterpret latency carefully. The repository currently provides a NumPy reference implementation, so absolute search timings should be read as engineering diagnostics rather than as the main evidence for the method.
- Use Figure
12as the first retrieval result and main novelty anchor. - Use Figure
05to show that the result is robust across embedding models. - Use Figure
07to make the RAG pipeline argument. - Use Figure
11to explain deployment trade-offs without over-claiming server-side compression.