Skip to content

Conversation

@marianotepper
Copy link
Collaborator

@marianotepper marianotepper commented Nov 3, 2025

This PR does extensive work to bring back the Fused Graph Index (FGI). In a non-fused graph, the PQ codebook of each vector in the index is stored in memory. The memory complexity is the linear in the number of vectors. FGI reduces significantly the amount of heap memory used during search by offloading the PQ codebooks to storage. These PQ codebooks are packed and stored in-line with the graph, to avoid runtime overheads resulting from this offload.

The memory complexity has two cases now:

  • When using a non-hierarchical graph, the fused graph reduces the linear memory complexity to a small constant (the number of vectors in the graph does not change this constant).
  • When using a hierarchical graph, the upper layers of the hierarchy are kept in memory an the bottom layer is in storage. The PQ codebooks of those vectors in the upper layers are kept in memory. The bottom layer behaves exactly like a non-hierarchical graph. Since the upper graph layers are sampled using a logarithmic distribution, we end up with a logarithmic memory complexity.

These savings come with a very moderate slowdown (reduction in throughput and increase in latency) of about 15%. See the results below for an example.

In this version (and in past versions), FGI only works with PQ through the FUSED_PQ feature. This feature used to be called FUSED_ADC, but to highlight the link with PQ, it has been renamed.

The routine for expanding a node (gathering its out-neighbors and computing their similarities to the query), has been pushed down to the GraphIndex views. This enables having slightly different algorithms depending on the graph layout that may be a little bit more efficient than if abstracted away in the GraphSearcher.

This PR refactors the use of SIMD instructions by FUSED PQ:

  • The old algorithm used a transposed layout similar to Quick(er)-ADC. However, that design performed the SIMD parallelization not for each codebook, but across different codebooks (thus the transpose). This parallelization was virtually impossible to combine with the skipping of the computation for previously computed similarities and implied additional computational overhead.
  • An analysis of the number of skipped similarity computations yielded about 50% skips. thus, the new algorithm simplifies the layout by storing the vectors in a non-transposed fashion, with no across-codebook parallelization. This makes it compatible with the visited checks. Additionally, it is more efficient than the non-fused approach because we have improved the locality of the PQ codebooks to gather when expanding a given graph node (in the non-fused graph this required random accesses to multiple rows of a very tall and skinny matrix that exhibit poor locality).

These SIMD changes have opened the possibility of deprecating the native vector util backend. Not effecting this deprecation in this PR because there might be another considerations to keep it around.

Edits:

  • To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices.

Experimental results:

Dataset: ada002-100k
Configuration:
M : 32
usePruning : true
neighborOverflow : 1.2
addHierarchy : true
efConstruction : 100

Results with topK=10

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         16118.8           285.2        1.8         0.456                0.049               0.770                307.4          12.7                       0.67        
2.00         14100.7           19.5         0.1         0.507                0.064               0.802                421.6          22.9                       0.85        
5.00         11307.2           260.8        2.3         0.640                0.111               1.093                702.5          52.9                       0.94        
10.00        8335.0            88.8         1.1         0.849                0.184               1.587                1108.2         102.1                      0.97        

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         13768.9           972.6        7.1         0.533                0.045               0.909                307.4          12.7                       0.67        
2.00         12384.5           582.9        4.7         0.586                0.057               0.966                421.6          22.9                       0.85        
5.00         9609.4            151.8        1.6         0.735                0.098               1.254                702.5          52.9                       0.94        
10.00        7135.4            162.5        2.3         0.955                0.163               1.603                1108.2         102.1                      0.97        

With the fused graph, the number of queries per second (QPS) is slowed down by less than 15% with an average of 14% and the latency by less than 17% with an average of 15%.

Results with topK=100

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         8314.4            307.9        3.7         0.862                0.187               1.570                1108.2         102.1                      0.78         
2.00         5415.1            17.0         0.3         1.294                0.302               2.293                1871.1         198.0                      0.93         

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         6744.6            303.1        4.5         0.975                0.172               1.800                1108.2         102.1                      0.78         
2.00         4550.6            113.0        2.5         1.401                0.263               2.445                1871.1         198.0                      0.93         

With the fused graph, the number of queries per second (QPS) is slowed down by 19% and 16% (overquery=1 and 2, respectively) with an average and the latency by 13% and 8% (overquery=1 and 2, respectively).

Experimental results on larger datasets

In the plots below, QPS, latency, and recall are stable (there's run-to-run variability that is intrinsic to the benchmark). Index construction time increased a bit by the process of fusing the graph on disk, which involves multiple random memory accesses for each node, and writing more to disk.

Index_Construction_Time Mean_Latency QPS Recall_at_10

… format version to 6 because of new ordering of fused features.
… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors
… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.
# Conflicts:
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java
#	jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskSequentialGraphIndexWriter.java
#	jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java
#	jvector-tests/src/test/java/io/github/jbellis/jvector/TestUtil.java
#	jvector-tests/src/test/java/io/github/jbellis/jvector/quantization/TestADCGraphIndex.java
@marianotepper marianotepper self-assigned this Nov 3, 2025
@marianotepper marianotepper marked this pull request as ready for review November 5, 2025 13:38
@marianotepper marianotepper requested a review from jkni November 5, 2025 14:02
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am posting a partial review with some relatively minor suggestions. I'll revisit later today or tomorrow.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as a maintainer of a consuming application. I didn't analyze every line of the PR, but I did perform downstream tests using fused adc in CC and things appear to work as expected.

Copy link
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marianotepper It would be good to see the level of coverage of new/changed code here. Many of the changes are absolutely dependent on numerical and functional unit tests. It's non-trivial to see this overlap here.

Copy link
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One comment added about a possible omission in README.md of NVQ as a second pass option. Also would be great to see some empirical data on the memory savings. The other performance data is thorough enough.

Thanks for this major contribution!

@marianotepper
Copy link
Collaborator Author

marianotepper commented Nov 19, 2025

@tlwillke I did some measurements for ada002-100k, using 192 PQ segments. The dataset contains 99562 vectors.

According the ramBytesUsed estimate in PQVectors, they should take 19.74 MB. Roughly speaking, this matches the count 192 * 99562 / (1024 * 1024) = 18.23 MB.

I used:

System.gc();
double usedMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.format("Used Memory: [%.2f MB]%n", usedMemory / (1024 * 1024));

to measure the memory used before and after loading the PQVectors in memory, which the fused graph avoids.

According to this method, the memory used before loading the PQvectors is 672.02 MB and 697.73 MB after. Thus, the PQVectors occupy 25.71MB. This is larger than 19.74, so either the estimate is a bit optimistic or the garbage collection did not actually collect everything.

When running the fused graph index, the memory consumption pre and post loading is flat, as there is no loading.

Happy to incorporate these these changes in Grid. They have performance upsides and downsides in the specific setting of grid, where we are running a matrix of configurations, so that efficiency may be different than when running a single configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants