-
Notifications
You must be signed in to change notification settings - Fork 0
IVF (Inverted File Index)
Inverted File Index / Inverted Index / Inverted File System is a classic approximate nearest neighbour (ANN) technique widely used in vector search systems (e.g. FAISS).
- Two-stage search structure that partitions the vector space into “coarse cells” and then searches only within a few of those cells, rather than across the entire dataset.
- The idea is to avoid exhaustive search by first grouping similar vectors together, and then for a given query only comparing against vectors in a few relevant groups.
-
Coarse quantisation (clustering)
- Run k-means on your dataset to obtain k centroids:
${c₁, c₂, ..., c_k}$ . - Each centroid represents a coarse cell (or “inverted list”).
- Run k-means on your dataset to obtain k centroids:
-
Assign data points to centroids
- Each vector
$x_i$ is assigned to its nearest centroid$c_j$ . - You store the residual vector
$rᵢ = xᵢ - c_j$ in the inverted list corresponding to centroid$j$ .- The centroid already represents the coarse location of that region in space.
- The residual encodes the local offset within that region.
- The inverted list acts like an “index” for that cell.
- Each vector
When querying with vector q:
-
Find the closest n_probe centroids to q:
$nearest cells = argmin_{j} || q - c_j ||$ (often just the top N centroids by proximity). -
Search only in those cells
- For each selected centroid
$c_j$ :- Compute the residual:
$r_q = q - c_j$ - Compare
$r_q$ to the residuals stored in that cell to find nearest neighbours.
- Compute the residual:
- For each selected centroid
- Re-rank the best candidates globally to return the final top-k.
Why do we calculate and store residuals
-
Residuals captures the fine-grained difference between the vector and the centroid.
-
Why this matters:
- The centroid already represents the coarse location of that region in space.
- The residual encodes the local offset within that region.
-
So, instead of storing large, absolute coordinates (which may vary widely across cells), you store compact, small residuals relative to the centroid.
-
This is critical when IVF is used with quantisation, e.g. IVF+PQ (Product Quantisation):
- PQ learns codebooks to compress these residuals, not the original vectors.
- Because residuals are local, they have smaller variance, making compression much more accurate and efficient.
If you’re using IVF+Flat, there’s no compression — you could technically skip residuals — but IVF implementations (like FAISS) keep this structure for consistency.
IVF can be used alone (IVF+Flat) or with other search algorithms/indexes, for example:
- IVF+PQ → stores compressed residuals using Product Quantisation.
- IVF+HNSW → uses HNSW as a coarse quantiser.
High-dimensional spaces make clustering and distance estimation harder — the curse of dimensionality.
- Impacts on IVF:
- Centroid quality degrades: k-means struggles to form meaningful clusters because distances between all points become similar.
- Residuals become noisier: offsets don’t capture local variation well.
- Recall drops unless you increase nprobe or nlist, which increases compute.
- Mitigation:
- Apply dimensionality reduction (e.g. PCA, OPQ).
- Use IVF+PQ or IVF+HNSW hybrid — quantisation + graph helps correct coarse errors.
- IVF+Flat → high-throughput, static datasets (e.g. FAISS on GPU).
- HNSW → dynamic, low-latency CPU search (e.g. Milvus, Qdrant, Pinecone).