Skip to content

Latest commit

 

History

History
181 lines (141 loc) · 7.45 KB

File metadata and controls

181 lines (141 loc) · 7.45 KB

Using the Rust Code

This guide explains how to use Seismic's Rust code independently (standalone) or integrate it into your own Rust project (via Cargo).

First, we clone the Seismic Git repository:

git clone git@github.com:TusKANNy/seismic.git
cd seismic

Then, we have to compile the project. After executing the following command, the binary executable will be found in ./target/release.

RUSTFLAGS="-C target-cpu=native" cargo build --release

Building the Index

Let's now build an index on the Splade embeddings for the MS MARCO v1 passage. To download the encoded vectors, please refer to Setting up for the Experiment.

Seismic has several parameters that control the space/time trade-offs when building the index:

  • -i, --input-file path to the source collection file (documents.bin).
  • -o, --output-file output index base path; the binary appends .index.seismic.
  • -n, --n-postings defines the posting list size, representing the average number of postings stored per list (default: 6000).
  • -s, --summary-energy controls the summary size, preserving a fraction of the overall energy for each summary (default: 0.5).
  • --centroid-fraction determines the number of centroids per posting list, capped at a fraction of the posting list length (default: 0.1).
  • --clustering-algorithm selects the algorithm to cluster postings within each posting list. Options: random-kmeans, random-kmeans-inverted-index, random-kmeans-inverted-index-approx.
  • --pruning-strategy selects the posting list pruning strategy. Options: global-threshold, coi-threshold, fixed-size.
  • -b, --block-size block size used for fixed-size blocking (default: 10).
  • --kmeans-doc-cut number of top components retained while clustering (default: 15).
  • --kmeans-pruning-factor pruning factor used by the random k-means blocking (default: 0.005).
  • --min-cluster-size minimum cluster size allowed (default: 2).
  • -a, --alpha fraction of L1 mass preserved by the COI pruning strategy (default: 0.15).
  • -m, --max-fraction maximum posting list length as a factor of n_postings (default: 1.5).
  • --knn number of neighbors stored per vector for the k-NN graph (default: 0, i.e., disabled).
  • --knn-path path to a precomputed nearest-neighbor file.
  • --component-type component type: u16 (default) or u32 for large vocabularies.
  • -v, --value-type value type: f16 (default), bf16, f32, fixedu8, fixedu16, or dotvbyte.

For Splade on MS MARCO, good choices are:

--n-postings 4000 --summary-energy 0.4 --centroid-fraction 0.1 --clustering-algorithm random-kmeans-inverted-index-approx

To create a Seismic index serialized in the file documents.bin.4000_0.4_0.1.index.seismic, run:

./target/release/build_inverted_index \
    -i ~/sparse_datasets/msmarco_v1_passage/cocondenser/data/documents.bin \
    -o ~/sparse_datasets/msmarco_v1_passage/cocondenser/indexes/documents.bin.4000_0.4_0.1 \
    --centroid-fraction 0.1 \
    --summary-energy 0.4 \
    --n-postings 4000 \
    --clustering-algorithm random-kmeans-inverted-index-approx

Executing Queries

To query the index we need to use the perf_inverted_index executable. Its parameters are:

  • -i, --index-file path to the serialized index file.
  • -q, --query-file query file containing ground-truth vectors.
  • -o, --output-path output file with the ranked results.
  • --query-cut limits the search algorithm to the top query_cut components of the query (default: 10).
  • --heap-factor skips blocks whose estimated dot product exceeds heap_factor times the smallest dot product in the top-k results (default: 0.7).
  • -k number of top-k results to retrieve (default: 10).
  • --n-knn number of neighbors of top results to rescore via the k-NN graph (default: 0).
  • -f, --first-sorted whether to sort the first posting list by estimated dot products before search (default: false).
  • --n-queries number of queries to evaluate (default: 10000).
  • -r, --n-runs number of runs to average (default: 1).
  • --query-energy optional query energy filter.
  • --component-type component type: u16 (default) or u32.
  • -v, --value-type value type: f16 (default), bf16, f32, fixedu8, or fixedu16.

Example command:

./target/release/perf_inverted_index \
    -i ~/sparse_datasets/msmarco_v1_passage/cocondenser/indexes/documents.bin.4000_0.4_0.1.index.seismic \
    -q ~/sparse_datasets/msmarco_v1_passage/cocondenser/data/queries.bin \
    -o results.tsv \
    --query-cut 5 \
    --heap-factor 0.7

The results are written to results.tsv. Each query produces k lines in the following format:

query_id\tdocument_id\tresult_rank\tdot_product

Where:

  • query_id: A progressive identifier for the query.
  • document_id: The document ID from the indexed dataset.
  • result_rank: The ranking of the result by dot product.
  • dot_product: The similarity score.

To incorporate the Seismic library into your Rust project, navigate to your project directory and run:

cargo add seismic

Creating a Toy Dataset

Let's create a toy dataset with vectors using f32 values. We'll then convert it to use half-precision floating points (half::f16) and check the dataset's properties.

use seismic::SparseDataset;
use half::f16;

let data = vec![
    (vec![0, 2, 4],    vec![1.0, 2.0, 3.0]),
    (vec![1, 3],       vec![4.0, 5.0]),
    (vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
];

let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();

assert_eq!(dataset.len(), 3);  // Number of vectors  
assert_eq!(dataset.dim(), 5);  // Number of components  
assert_eq!(dataset.nnz(), 9);  // Number of non-zero components  

To read a dataset in Seismic's internal binary format and quantize values to f16:

let dataset = SparseDataset::<f32>::read_bin_file(&input_filename).unwrap().quantize_f16();

Building and Querying an Index

Let's build an index using our toy dataset and search for a query.

use seismic::inverted_index::{Configuration, InvertedIndex};
use seismic::SparseDataset;
use half::f16;

let data = vec![
    (vec![0, 2, 4],    vec![1.0, 2.0, 3.0]),
    (vec![1, 3],       vec![4.0, 5.0]),
    (vec![0, 1, 2, 3], vec![1.0, 2.0, 3.0, 4.0])
];

let dataset: SparseDataset<f16> = data.into_iter().collect::<SparseDataset<f32>>().into();

let inverted_index = InvertedIndex::build(dataset, Configuration::default());

let result = inverted_index.search(&vec![0, 1], &vec![1.0, 2.0], 1, 5, 0.7);

assert_eq!(result[0].0, 8.0);
assert_eq!(result[0].1, 1);

Configuration Parameters

Key parameters to experiment with:

  • n_postings (PruningStrategy::GlobalThreshold) defines the posting list size.
  • summary_energy (SummarizationStrategy::EnergyPreserving) controls summary size.
  • centroid_fraction (BlockingStrategy::RandomKmeans) sets centroids per posting list.

For serialization/deserialization examples, see:

The search method signature:

pub fn search(
    &self,
    query: SparseVectorView<'_, ComponentFor<S>, f32>,
    k: usize,
    query_cut: usize,
    heap_factor: f32,
    n_knn: usize,
    first_sorted: bool,
) -> Vec<ScoredVectorDotProduct>