Skip to content

Conversation

@alibeklfc
Copy link
Contributor

Summary:
Introduction

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

Implementation

  • New Source and Header Files: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

  • Batched Processing: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

  • Specialized Post-processing Handler: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

  • LUT: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:

    • The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
    • It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
    • With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
  • Query Offset Parameter: RaBitQ uses query factors in distance calculations that should be computed in compute_float_LUT method (the most efficient place since we are calculating rotated_qq anyways) and used for final distance calculations in handlers. However, the previous version of compute_quantized_LUT that calls compute_float_LUT did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter query_offset to both compute_quantized_LUT and compute_float_LUT methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

Testing

  • Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
  • All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

Results
results_rabitq

  • Performance Dependency: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
  • Parallelized Training Loop: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
  • Consistency Across Metrics: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
  • One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307

@meta-cla meta-cla bot added the CLA Signed label Sep 30, 2025
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307.

@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307.

alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307.

facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307.

facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307.

alibeklfc added a commit that referenced this pull request Sep 30, 2025
Summary:
Pull Request resolved: #4595

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307.

facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307.

alibeklfc pushed a commit that referenced this pull request Sep 30, 2025
Summary:
Pull Request resolved: #4595

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@facebook-github-bot
Copy link
Contributor

@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307.

facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 1, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 1, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 9, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 9, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 9, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@alibeklfc
Copy link
Contributor Author

@alexanderguzhva thank you for your feedback, I have addressed your points.

@alexanderguzhva
Copy link
Contributor

@alibeklfc added some more
I'll review your changes as well

facebook-github-bot pushed a commit that referenced this pull request Oct 10, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 10, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc pushed a commit that referenced this pull request Oct 10, 2025
Summary:
Pull Request resolved: #4595

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 10, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
float* lut,
idx_t n,
const float* x,
const FastScanDistancePostProcessing& context = {}) const override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should not speficy default values in overrides.
the note applies for similar cases in this PR as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


void IndexRaBitQFastScan::compute_codes(uint8_t* codes, idx_t n, const float* x)
const {
FAISS_ASSERT(codes != nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Basically, the design of compute_codes() function is to produce a code for every given input vector that can be reconstructed back into a corresponding input vector (to the degree which it is possible, of course). I understand the situation in which compute_codes() refers to some internal table or dictionary, which is produced during the train() function and that does not depend on the input data itself. But this particular implementation does not store rbq factors for every code, which is an absolute must; rbq factors is not a table, it depends on a dataset. As a result, this implementation violates the design.
The implementation of sa_decode() down below implies that its design is to work correctly only if you pass the dataset that was used for train().

@mdouze do you have any comments?

Copy link
Contributor Author

@alibeklfc alibeklfc Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I have resolved this issue.
The codes in compute_codes() now store both bit patterns and RaBitQ factors together. In sa_decode(), we extract both components from the embedded codes to reconstruct the original vector. I have added a stride parameter to the packing methods, so SIMD operations only process the bit patterns and skip the embedded factors. The codes variable is not stored after packing, so in add() method we populate the factors_storage from the codes so that handler have access to it.

size_t b,
simd16uint16 d0,
simd16uint16 d1) {
ALIGNED(32) uint16_t d32tab[32];
Copy link
Contributor

@alexanderguzhva alexanderguzhva Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the approach for this function looks good


/// Query factors data array for RaBitQ (nullptr if not needed)
/// Memory management is handled by the caller
rabitq_utils::QueryFactorsData* query_factors = nullptr;
Copy link
Contributor

@alexanderguzhva alexanderguzhva Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leads back to the problem that the baseline class IndexFastScan becomes aware of rabitq. This could be solved by OOP inheritance. Just create a derived class, something like RaBitQFastScanDistancePostProcessing and override IndexFastScan functions that instantiate FastScanDistancePostProcessing.
Alternatively, please put rabitq_utils::QueryFactorsData class into a separate small independent file that does not carry complex dependencies, similar to NormTableScaler.
I leave it up to you.

@mdouze any comments?

alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 10, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 10, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
alibeklfc added a commit to alibeklfc/faiss that referenced this pull request Oct 13, 2025
Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307

// Hoist loop-invariant computations
const float* centroid_data = center.data();
const size_t bit_pattern_size = (d + 3) / 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow why it is (d + 3) / 4. It is 1 bit per dimension, so it should be (d + 7) / 8, no? Otherwise, please add a comment why it is (d + 3) / 4, maybe I'm misinterpreting something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I have changed it to (d + 7) / 8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next question: why this problem and similar ones were not caught by unit tests? What if other (d + x) / y things are not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using (d + 3) / 4, I allocated more space than needed, which is why the tests passed. Recall is correct, we were allocating more space than needed.

bit_pattern_size is the variable that calculates the size of the bit pattern portion in the codes. Since we store 1 bit per dimension, (d + 7) / 8 is the correct formula.

I have verified this experimentally: using values where x > 7 or y > 8 in (d + x) / y causes test failures, proving that (d + 7) / 8 is correct

FAISS_ASSERT(x != nullptr);

const float inv_d_sqrt = (d == 0) ? 1.0f : (1.0f / std::sqrt((float)d));
const size_t bit_pattern_size = (d + 3) / 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. why (d + 3) / 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I have changed it to (d + 7) / 8

READ1(idxqfs->code_size);

// Need to initialize the FastScan base class fields
const size_t M_fastscan = (idxqfs->d + 3) / 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, why (d + 3) / 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, (d + 3) / 4 is correct because it is the number of fastscan sub-quantizers. RaBitQ packs 4 dimensions (4 bits) into each sub-quantizer, so we need (d + 3) / 4 sub-quantizers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not (d + 1) / 2, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(d + 3) / 4 is correct, because it is the number of fastscan sub-quantizers. One sub-quantizer is 4 dimensions (4 bits). Therefore, (d + 3) / 4 is correct

Summary:

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Differential Revision: D81787307
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Oct 14, 2025

This pull request has been merged in 01d394e.

AlSchlo pushed a commit to AlSchlo/faiss-panorama that referenced this pull request Oct 20, 2025
Summary:
Pull Request resolved: facebookresearch#4595

**Introduction**

This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.

**Implementation**

* **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.

* **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.

* **Specialized Post-processing Handler**:  A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.

* **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
  * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ.
  * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy.
  * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy.
* **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.

**Testing**

* Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety.
* All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan.

**Results**
results_rabitq
* **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb.
* **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ.
* **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures.
* One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context.

Reviewed By: mdouze

Differential Revision: D81787307

fbshipit-source-id: d30827b990116254a88e80df0c08a2f5e086b64c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants