This guide explains how to use Seismic's Rust code independently (standalone) or integrate it into your own Rust project (via Cargo).
First, we clone the Seismic Git repository:
git clone git@github.com:TusKANNy/seismic.git
cd seismicThen, we have to compile the project. After executing the following command, the binary executable will be found in ./target/release.
RUSTFLAGS="-C target-cpu=native" cargo build --releaseLet's now build an index on the Splade embeddings for the MS MARCO v1 passage. To download the encoded vectors, please refer to Setting up for the Experiment.
Seismic has several parameters that control the space/time trade-offs when building the index:
-i, --input-filepath to the source collection file (documents.bin).-o, --output-fileoutput index base path; the binary appends.index.seismic.-n, --n-postingsdefines the posting list size, representing the average number of postings stored per list (default: 6000).-s, --summary-energycontrols the summary size, preserving a fraction of the overall energy for each summary (default: 0.5).--centroid-fractiondetermines the number of centroids per posting list, capped at a fraction of the posting list length (default: 0.1).--clustering-algorithmselects the algorithm to cluster postings within each posting list. Options:random-kmeans,random-kmeans-inverted-index,random-kmeans-inverted-index-approx.--pruning-strategyselects the posting list pruning strategy. Options:global-threshold,coi-threshold,fixed-size.-b, --block-sizeblock size used for fixed-size blocking (default: 10).--kmeans-doc-cutnumber of top components retained while clustering (default: 15).--kmeans-pruning-factorpruning factor used by the random k-means blocking (default: 0.005).--min-cluster-sizeminimum cluster size allowed (default: 2).-a, --alphafraction of L1 mass preserved by the COI pruning strategy (default: 0.15).-m, --max-fractionmaximum posting list length as a factor ofn_postings(default: 1.5).--knnnumber of neighbors stored per vector for the k-NN graph (default: 0, i.e., disabled).--knn-pathpath to a precomputed nearest-neighbor file.--component-typecomponent type:u16(default) oru32for large vocabularies.-v, --value-typevalue type:f16(default),bf16,f32,fixedu8,fixedu16, ordotvbyte.
For Splade on MS MARCO, good choices are:
--n-postings 4000 --summary-energy 0.4 --centroid-fraction 0.1 --clustering-algorithm random-kmeans-inverted-index-approx
To create a Seismic index serialized in the file documents.bin.4000_0.4_0.1.index.seismic, run:
./target/release/build_inverted_index \
-i ~/sparse_datasets/msmarco_v1_passage/cocondenser/data/documents.bin \
-o ~/sparse_datasets/msmarco_v1_passage/cocondenser/indexes/documents.bin.4000_0.4_0.1 \
--centroid-fraction 0.1 \
--summary-energy 0.4 \
--n-postings 4000 \
--clustering-algorithm random-kmeans-inverted-index-approxTo query the index we need to use the perf_inverted_index executable. Its parameters are:
-i, --index-filepath to the serialized index file.-q, --query-filequery file containing ground-truth vectors.-o, --output-pathoutput file with the ranked results.--query-cutlimits the search algorithm to the topquery_cutcomponents of the query (default: 10).--heap-factorskips blocks whose estimated dot product exceedsheap_factortimes the smallest dot product in the top-k results (default: 0.7).-knumber of top-k results to retrieve (default: 10).--n-knnnumber of neighbors of top results to rescore via the k-NN graph (default: 0).-f, --first-sortedwhether to sort the first posting list by estimated dot products before search (default: false).--n-queriesnumber of queries to evaluate (default: 10000).-r, --n-runsnumber of runs to average (default: 1).--query-energyoptional query energy filter.--component-typecomponent type:u16(default) oru32.-v, --value-typevalue type:f16(default),bf16,f32,fixedu8, orfixedu16.
Example command:
./target/release/perf_inverted_index \
-i ~/sparse_datasets/msmarco_v1_passage/cocondenser/indexes/documents.bin.4000_0.4_0.1.index.seismic \
-q ~/sparse_datasets/msmarco_v1_passage/cocondenser/data/queries.bin \
-o results.tsv \
--query-cut 5 \
--heap-factor 0.7The results are written to results.tsv. Each query produces k lines in the following format:
query_id\tdocument_id\tresult_rank\tdot_product
Where:
query_id: A progressive identifier for the query.document_id: The document ID from the indexed dataset.result_rank: The ranking of the result by dot product.dot_product: The similarity score.
To incorporate the Seismic library into your Rust project, navigate to your project directory and run:
cargo add seismicLet's create a toy dataset with vectors using f32 values. We'll then convert it to use half-precision floating points (half::f16) and check the dataset's properties.
use seismic::SparseDataset;
use half::f16;
let data = vec![
(vec![0, 2, 4], vec![1.0, 2.0, 3.0]),
(vec![1, 3], vec![4.0, 5.0]),
(vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
];
let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();
assert_eq!(dataset.len(), 3); // Number of vectors
assert_eq!(dataset.dim(), 5); // Number of components
assert_eq!(dataset.nnz(), 9); // Number of non-zero components To read a dataset in Seismic's internal binary format and quantize values to f16:
let dataset = SparseDataset::<f32>::read_bin_file(&input_filename).unwrap().quantize_f16();Let's build an index using our toy dataset and search for a query.
use seismic::inverted_index::{Configuration, InvertedIndex};
use seismic::SparseDataset;
use half::f16;
let data = vec![
(vec![0, 2, 4], vec![1.0, 2.0, 3.0]),
(vec![1, 3], vec![4.0, 5.0]),
(vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
];
let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();
let inverted_index = InvertedIndex::build(dataset, Configuration::default());
let result = inverted_index.search(&vec![0, 1], &vec![1.0, 2.0], 1, 5, 0.7);
assert_eq!(result[0].0, 8.0);
assert_eq!(result[0].1, 1);Key parameters to experiment with:
n_postings(PruningStrategy::GlobalThreshold) defines the posting list size.summary_energy(SummarizationStrategy::EnergyPreserving) controls summary size.centroid_fraction(BlockingStrategy::RandomKmeans) sets centroids per posting list.
For serialization/deserialization examples, see:
The search method signature:
pub fn search(
&self,
query: SparseVectorView<'_, ComponentFor<S>, f32>,
k: usize,
query_cut: usize,
heap_factor: f32,
n_knn: usize,
first_sorted: bool,
) -> Vec<ScoredVectorDotProduct>