Bring back the fused graph index #561

marianotepper · 2025-11-03T13:20:26Z

This PR does extensive work to bring back the Fused Graph Index (FGI). In a non-fused graph, the PQ codebook of each vector in the index is stored in memory. The memory complexity is the linear in the number of vectors. FGI reduces significantly the amount of heap memory used during search by offloading the PQ codebooks to storage. These PQ codebooks are packed and stored in-line with the graph, to avoid runtime overheads resulting from this offload.

The memory complexity has two cases now:

When using a non-hierarchical graph, the fused graph reduces the linear memory complexity to a small constant (the number of vectors in the graph does not change this constant).
When using a hierarchical graph, the upper layers of the hierarchy are kept in memory an the bottom layer is in storage. The PQ codebooks of those vectors in the upper layers are kept in memory. The bottom layer behaves exactly like a non-hierarchical graph. Since the upper graph layers are sampled using a logarithmic distribution, we end up with a logarithmic memory complexity.

These savings come with a very moderate slowdown (reduction in throughput and increase in latency) of about 15%. See the results below for an example.

In this version (and in past versions), FGI only works with PQ through the FUSED_PQ feature. This feature used to be called FUSED_ADC, but to highlight the link with PQ, it has been renamed.

The routine for expanding a node (gathering its out-neighbors and computing their similarities to the query), has been pushed down to the GraphIndex views. This enables having slightly different algorithms depending on the graph layout that may be a little bit more efficient than if abstracted away in the GraphSearcher.

This PR refactors the use of SIMD instructions by FUSED PQ:

The old algorithm used a transposed layout similar to Quick(er)-ADC. However, that design performed the SIMD parallelization not for each codebook, but across different codebooks (thus the transpose). This parallelization was virtually impossible to combine with the skipping of the computation for previously computed similarities and implied additional computational overhead.
An analysis of the number of skipped similarity computations yielded about 50% skips. thus, the new algorithm simplifies the layout by storing the vectors in a non-transposed fashion, with no across-codebook parallelization. This makes it compatible with the visited checks. Additionally, it is more efficient than the non-fused approach because we have improved the locality of the PQ codebooks to gather when expanding a given graph node (in the non-fused graph this required random accesses to multiple rows of a very tall and skinny matrix that exhibit poor locality).

These SIMD changes have opened the possibility of deprecating the native vector util backend. Not effecting this deprecation in this PR because there might be another considerations to keep it around.

Edits:

To enable the FUSED_PQ feature, we introduced the new version 6 file format for our graph indices.

Experimental results:

Dataset: ada002-100k
Configuration:
M : 32
usePruning : true
neighborOverflow : 1.2
addHierarchy : true
efConstruction : 100

Results with topK=10

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         16118.8           285.2        1.8         0.456                0.049               0.770                307.4          12.7                       0.67        
2.00         14100.7           19.5         0.1         0.507                0.064               0.802                421.6          22.9                       0.85        
5.00         11307.2           260.8        2.3         0.640                0.111               1.093                702.5          52.9                       0.94        
10.00        8335.0            88.8         1.1         0.849                0.184               1.587                1108.2         102.1                      0.97

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@10   
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         13768.9           972.6        7.1         0.533                0.045               0.909                307.4          12.7                       0.67        
2.00         12384.5           582.9        4.7         0.586                0.057               0.966                421.6          22.9                       0.85        
5.00         9609.4            151.8        1.6         0.735                0.098               1.254                702.5          52.9                       0.94        
10.00        7135.4            162.5        2.3         0.955                0.163               1.603                1108.2         102.1                      0.97

With the fused graph, the number of queries per second (QPS) is slowed down by less than 15% with an average of 14% and the latency by less than 17% with an average of 15%.

Results with topK=100

With a non-fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         8314.4            307.9        3.7         0.862                0.187               1.570                1108.2         102.1                      0.78         
2.00         5415.1            17.0         0.3         1.294                0.302               2.293                1871.1         198.0                      0.93

With a fused graph:

Overquery    Avg QPS (of 3)    ± Std Dev    CV %        Mean Latency (ms)    STD Latency (ms)    p999 Latency (ms)    Avg Visited    Avg Expanded Base Layer    Recall@100   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.00         6744.6            303.1        4.5         0.975                0.172               1.800                1108.2         102.1                      0.78         
2.00         4550.6            113.0        2.5         1.401                0.263               2.445                1871.1         198.0                      0.93

With the fused graph, the number of queries per second (QPS) is slowed down by 19% and 16% (overquery=1 and 2, respectively) with an average and the latency by 13% and 8% (overquery=1 and 2, respectively).

Experimental results on larger datasets

In the plots below, QPS, latency, and recall are stable (there's run-to-run variability that is intrinsic to the benchmark). Index construction time increased a bit by the process of fusing the graph on disk, which involves multiple random memory accesses for each node, and writing more to disk.

…dADC

… format version to 6 because of new ordering of fused features.

… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors

… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.

# Conflicts: # jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/OnHeapGraphIndex.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndexWriter.java # jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskSequentialGraphIndexWriter.java # jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java # jvector-tests/src/test/java/io/github/jbellis/jvector/TestUtil.java # jvector-tests/src/test/java/io/github/jbellis/jvector/quantization/TestADCGraphIndex.java

…s in testRecallOnGraphWithRandomVectors

…and 256

…verything works when using the native backend on machines with AVX512.

…12 so that everything works when using the native backend on machines with AVX512.

…a its own binary selector now.

michaeljmarshall

I am posting a partial review with some relatively minor suggestions. I'll revisit later today or tomorrow.

UPGRADING.md

jvector-base/src/main/java/io/github/jbellis/jvector/graph/ImmutableGraphIndex.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/feature/FeatureId.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

jvector-examples/src/main/java/io/github/jbellis/jvector/example/Bench.java

…o that ramBytesUsed can be computed.

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/Header.java

…isited nodes

… is not fused

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java

michaeljmarshall

Approving as a maintainer of a consuming application. I didn't analyze every line of the PR, but I did perform downstream tests using fused adc in CC and things appear to work as expected.

jshook

@marianotepper It would be good to see the level of coverage of new/changed code here. Many of the changes are absolutely dependent on numerical and functional unit tests. It's non-trivial to see this overlap here.

README.md

tlwillke

LGTM. One comment added about a possible omission in README.md of NVQ as a second pass option. Also would be great to see some empirical data on the memory savings. The other performance data is thorough enough.

Thanks for this major contribution!

marianotepper · 2025-11-19T14:11:00Z

@tlwillke I did some measurements for ada002-100k, using 192 PQ segments. The dataset contains 99562 vectors.

According the ramBytesUsed estimate in PQVectors, they should take 19.74 MB. Roughly speaking, this matches the count 192 * 99562 / (1024 * 1024) = 18.23 MB.

I used:

System.gc();
double usedMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.format("Used Memory: [%.2f MB]%n", usedMemory / (1024 * 1024));

to measure the memory used before and after loading the PQVectors in memory, which the fused graph avoids.

According to this method, the memory used before loading the PQvectors is 672.02 MB and 697.73 MB after. Thus, the PQVectors occupy 25.71MB. This is larger than 19.74, so either the estimate is a bit optimistic or the garbage collection did not actually collect everything.

When running the fused graph index, the memory consumption pre and post loading is flat, as there is no loading.

Happy to incorporate these these changes in Grid. They have performance upsides and downsides in the specific setting of grid, where we are running a matrix of configurations, so that efficiency may be different than when running a single configuration.

marianotepper added 30 commits September 2, 2025 15:51

Enable specifying the benchmarks in the yaml file

9d191a4

Use the original feature set

2a76e39

Enable FUSED feature in yaml files

1b16d25

Temporary yaml file for ada-002

8f514d8

Fix out-of-sync neighbor iterator and vector of similarities for Fuse…

f315b61

…dADC

Initial accuracy test for FusedADC

7246a38

Updating the code towards efficient storage access

cc9cfc8

Fused code is working but the hierarchy still has issues. Bumped file…

b633d5d

… format version to 6 because of new ordering of fused features.

throw an exception if more than one fused feature is used.

0177c43

Fully working FusedADC branch

670a97d

Update new file format version in yaml files

0dded51

Fix features sorting with NVQ and add the corresponding tests. Rename…

4179de7

… FusedADC to FusedPQ for clarity. Improve function signature of OnDiskGraphIndex.View.getPackedNeighbors

Uncomment latency computation in ada002-100k.yml

f0a9e4f

Cleanup signature of OnDiskGraphIndex.View.getPackedNeighbors. Reduce…

fc0aa88

… additional copy of neighbors array between OnDiskGraphIndex.View and FusedADCPQDecoder.

Improvements in code clarity

e0cb196

Remove duplicated code

c0b53a4

Refactor repeated code for simplicity.

e5fbf1d

Rename class FusedADCPQDecoder into FusedPQDecoderfor simplicity

0300fda

Use "level" when referring to integers instead of "layer"

ddbf06d

Add missing whitespace

14a879f

Fix imports

459c63c

Merged changes from main

698ebcd

Fix indentation

ca0cd79

Add ability to remap ordinals in TestUtil.writeFusedGraph

fd1bb4a

Add renumbering tests to TestFusedGraphIndex. Performance improvement…

450b0eb

…s in testRecallOnGraphWithRandomVectors

Merge branch 'main' into revive-fused-adc

ae9416b

Add support in PanamaVectorUtilSupport.quantizePartials for SIMD 128 …

7e4b118

…and 256

Basic implementation of SIMD code for FADC. Not tested yet

2c2e9f3

Basic implementation of SIMD code for FADC. Not tested yet

48735b9

marianotepper requested review from michaeljmarshall and sam-herman November 3, 2025 13:39

marianotepper added 5 commits November 3, 2025 05:40

Adding missing licenses

d17838d

Add back the call to NativeSimdOps.assemble_and_sum_f32_512 so that e…

dfc38bf

…verything works when using the native backend on machines with AVX512.

Add back the call to NativeSimdOps.pq_decoded_cosine_similarity_f32_5…

600040b

…12 so that everything works when using the native backend on machines with AVX512.

Enable all options in ada002-100k.yml

c02d731

Improve the way that the fused option is used in the YML files. It hs…

75d7822

…a its own binary selector now.

marianotepper self-assigned this Nov 3, 2025

marianotepper marked this pull request as ready for review November 5, 2025 13:38

marianotepper requested review from MarkWolters, jshook and tlwillke as code owners November 5, 2025 13:38

marianotepper requested a review from jkni November 5, 2025 14:02

michaeljmarshall reviewed Nov 5, 2025

View reviewed changes

marianotepper added 5 commits November 5, 2025 12:46

Uncomment FeatureId.FUSED_PQ in Bench.java

74d5f3e

Make FusedFeature.InlineSource extend Accountable

c9b902e

Remove stale comment

3dc9210

Load the upper layers of the hierarchy when constructing the object s…

7f07216

…o that ramBytesUsed can be computed.

Specify that FUSED_PQ only works with v6 file format in UPGRADING.md

1212f00

eolivelli reviewed Nov 6, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/AbstractGraphIndexWriter.java Show resolved Hide resolved

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/Header.java Show resolved Hide resolved

marianotepper added 3 commits November 6, 2025 04:13

Streamline AbstractGraphIndexWriter.writeSeparatedFeatures

0a86e50

Switch from Function<Integer, Boolean> to IntMarker for the marking v…

afc269d

…isited nodes

Do not load the in-memory features for the fused graph when the graph…

7e9d468

… is not fused

michaeljmarshall reviewed Nov 11, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java Show resolved Hide resolved

michaeljmarshall reviewed Nov 11, 2025

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/graph/disk/OnDiskGraphIndex.java Show resolved Hide resolved

michaeljmarshall approved these changes Nov 12, 2025

View reviewed changes

jshook approved these changes Nov 13, 2025

View reviewed changes

tlwillke reviewed Nov 19, 2025

View reviewed changes

README.md Show resolved Hide resolved

tlwillke approved these changes Nov 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bring back the fused graph index #561

Bring back the fused graph index #561

marianotepper commented Nov 3, 2025 •

edited

Loading

Uh oh!

michaeljmarshall left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaeljmarshall left a comment

Uh oh!

jshook left a comment

Uh oh!

Uh oh!

tlwillke left a comment

Uh oh!

marianotepper commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Bring back the fused graph index #561

Are you sure you want to change the base?

Bring back the fused graph index #561

Conversation

marianotepper commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edits:

Experimental results:

Results with topK=10

Results with topK=100

Experimental results on larger datasets

Uh oh!

michaeljmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaeljmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tlwillke left a comment

Choose a reason for hiding this comment

Uh oh!

marianotepper commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

marianotepper commented Nov 3, 2025 •

edited

Loading

marianotepper commented Nov 19, 2025 •

edited

Loading