.Net: Vector store abstractions hybrid search ADR #10196

westey-m · 2025-01-15T15:33:58Z

Motivation and Context

Create ADR for adding hybrid search support to the VectorStore abstractions.

Description

Adding hybrid ADR document

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

…n. Add property naming section.

roji

Great stuff @westey-m, here are some thoughts.

docs/decisions/00NN-hybrid-search.md

roji · 2025-01-22T12:49:01Z

docs/decisions/00NN-hybrid-search.md

+    // The name of the property to target the text search against.
+    public string? TextPropertyName { get; init; }
+    // Allow fusion method to be configurable for dbs that support configuration. If null, a default is used.
+    public string FusionMethod { get; init; } = null;


Is it sufficient to simply allow users to choose the fusion method name, or do different fusion methods also have parameters which would need to be set? That may complicated things here, possibly requiring a hierarchy of HybridSearchOptions to support the different parameters etc...

Good point. Milvus and MongoDB allows weights to be provided for RRF. The other DBs in the sample, don't seem to support something similar for those that support RRF. Milvus also supports a setting for Weighted.

docs/decisions/00NN-hybrid-search.md

roji · 2025-01-22T12:51:52Z

docs/decisions/00NN-hybrid-search.md

+    public string FusionMethod { get; init; } = null;
+
+    public VectorSearchFilter? Filter { get; init; }
+    public int Top { get; init; } = 3;


Continuing on the above, it may make sense to have an abstract base class for the common options (e.g. Top/Skip, which would seem to be a part of any search type). Though I'm not sure, and if we haven't captured such similarities via a base class it's not the end of the world either...

Yeah, it would help a bit with implementation coding, where we want to pass the common options around. As you say though, it's not the end of the world. It shouldn't affect users.

roji · 2025-01-22T12:54:43Z

docs/decisions/00NN-hybrid-search.md

+{
+    Task<VectorSearchResults<TRecord>> KeywordVectorizableHybridSearch(
+        string description,
+        string? keywords = default,


Intentionally nullable? The analogous non-vectorizable signature above has it required. Same below with KeywordVectorizableHybridSearchOptions.

The idea here is that if you wanted to use the description for both the vector and the keywords, you don't need to pass the same string twice. Do you think it's confusing? If so, we can just make it so that the user always passes both, even if they are the same string.

roji · 2025-01-22T13:09:47Z

docs/decisions/00NN-hybrid-search.md

+SparseVectorPropertyName
+
+VectorPropertyName
+TextPropertyName


Yeah, naming is always tricky.

First, for TextPropertyName I'd even consider FullTextPropertyName, as that more closely connects the naming with the search type being done.

For SparseVectorPropertyName, ideally we'd have a naming that conveys the function/use of the property rather than its type ("sparse vector"). For example, does something like DocumentFrequencyDataProperty make sense (you know this area more than me)? In other words, I'm trying to find something that will express the data contained in the sparse vector - and what its used for.

If we manage to do that, then the name SparseVectorPropertyName is replaced by some functional term, and there's no more ambiguity with sparse/dense; at that point it should probably be find to just call the main vector (for vector search) VectorPropertyName, just like we call it on the non-hybrid search (this consistency should be a goal IMHO).

I was trying to avoid FullText since I (maybe incorrectly) associate the word with more advanced search capabilities like wildcards, partial word matches and potentially even boolean logic, and these may not necessarily be supported by all the DBs. Maybe my definition of FullText is too narrow.

For SparseVectorPropertyName, with regards to using its type rather than usage, I'm following the same naming we are using elsewhere. E.g. we are not calling Vectors, Embeddings, since it assumes a usage that may not be there. I'm not sure it makes sense to put the usage in the name, when for sparse vectors they might be generated in a way that isn't related to text document frequency, and the database is just doing a dotproduct comparison on the vectors. I actually think that the search method name and options name may be wrong in this case as well, and I might need to generalize it further. Thoughts?

roji · 2025-01-22T13:13:42Z

docs/decisions/00NN-hybrid-search.md

+        IEnumerable<string> keywords,
+        KeywordVectorizedHybridSearchOptions options,
+        CancellationToken cancellationToken);
+    Task<VectorSearchResults<TRecord>> KeywordVectorizedHybridSearch(


The single-string overload could be an extension method which simply calls the multiple-string overload. This would obviate having to implement the single-string overload in each and every provider (and doing the exact same thing).

(if we only needed to support modern .NET we'd use a default interface implementation instead)

roji · 2025-01-22T13:20:33Z

docs/decisions/00NN-hybrid-search.md

+        CancellationToken cancellationToken);
+```
+
+Pros: Easier for a user to use, since they don't need to do any keyword splitting themselves.


Is there any support for doing this search for terms that contain spaces (e.g. multiple words)? I get that word frequency (and not frequencies of arbitrary strings) arbitrary has been precalculated in the database, and so the keyword must be a word that would much the document frequency data.

But looking forward, if/when we add support for hybrid search where the user passes in a sparse vector, they could (at least in theory) produce a sparse vector for any strings which they find interesting - including some which may have spaces in them, no? If so, then implicitly splitting on strings in the API would make it impossible to use the API in such scenarios, right?

In any case, implicitly doing the splitting inside seems quite magical and non-.NET-ish... I think it's better to be quite explicit here and have splitting happening outside. I'm also unsure that users will always have an already concatenated string, as opposed to a list - which they would then have to join into a single string before being able to use the above API... That requires some assumptions on what consuming code does, how UIs are designed, etc. etc.

Considering that the primary use case for this is RAG scenarios, my expectation would be that the input keywords would be coming from a natural language query from a user. The developer would need to pre-process this sentence though to extract keywords or remove noise using methods specific to the language(s) in question.

roji · 2025-01-22T13:21:03Z

docs/decisions/00NN-hybrid-search.md

+
+### 3. Accept either in interface but throw for not supported
+
+Accept either option but throw for the one not supported by the underly DB.


I'm not sure how this is different from option 3 above... If there are databases that don't supported multiple keywords, then wouldn't option 2 also have to throw if multiple keywords are provided?

I'll update the text of number 3 to explain this better. With option 3 the idea is that the connector would combine or split the keywords depending on what is required by the underlying DB.

roji · 2025-01-22T13:24:16Z

docs/decisions/00NN-hybrid-search.md

+```csharp
+    Task<VectorSearchResults<TRecord>> KeywordVectorizedHybridSearch(
+        TVector vector,
+        IEnumerable<string> keywords,


Exposing IEnumerable<string> in abstraction APIs can be problematic, if there's any chance it would need to be enumerated twice inside. For example, if some provider needs to know the number of keywords up-front (e.g. in order to allocate some array or whatever), then they'd need to enumerate twice, which might be extremely heavy (e.g. if the IEnumerable represents some big LINQ query). The way I see it, exposing an IEnumerable parameter effectively promises the user that the argument will only be enumerated once; if that's not a reasonable promise we're sure we can keep, it's better to accept something more specific, like an ICollection, or in this case, an ISet. This forces the user to do the materialization of their LINQ query and pass the results to the API, ensuring single-enumeration.

…able full text search language

westey-m added 3 commits December 3, 2024 15:33

Add initial hybrid ADR doc.

f19e88e

Add sparse data model row.

531d220

Add cosmosdb nosql and mongo db to comparison. Add FusionMethod optio…

5819cfb

…n. Add property naming section.

markwallace-microsoft added the documentation label Jan 15, 2025

westey-m marked this pull request as draft January 15, 2025 15:34

westey-m added 7 commits January 15, 2025 15:48

Fix typo

26eaa82

Fix more typos.

e6e1819

TF-IDF .net link

174cbd1

Add another decision to adr and improve formatting.

bab8740

Add more keyword param options

5f119ce

Merge branch 'main' into vector-store-hybrid-adr

c56ffb7

Add Azure AI Search implementation and common keyword hybrid tests.

165cac1

markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Jan 20, 2025

github-actions bot changed the title ~~Vector store abstractions hybrid search ADR~~ .Net: Vector store abstractions hybrid search ADR Jan 20, 2025

westey-m temporarily deployed to integration January 20, 2025 15:19 — with GitHub Actions Inactive

Merge branch 'main' into vector-store-hybrid-adr

92853ee

westey-m temporarily deployed to integration January 20, 2025 17:40 — with GitHub Actions Inactive

Add ability to choose text property for hybrid azure ai search.

ac95190

westey-m temporarily deployed to integration January 20, 2025 18:28 — with GitHub Actions Inactive

westey-m added 2 commits January 22, 2025 11:14

Fix namespace issue.

fee9651

Merge branch 'main' into vector-store-hybrid-adr

3d58e63

westey-m temporarily deployed to integration January 22, 2025 11:16 — with GitHub Actions Inactive

roji reviewed Jan 22, 2025

View reviewed changes

Merge branch 'main' into vector-store-hybrid-adr

a68afec

westey-m temporarily deployed to integration January 23, 2025 09:42 — with GitHub Actions Inactive

Update ADR with suggestions from pr and other improvements.

a920489

westey-m temporarily deployed to integration January 23, 2025 11:40 — with GitHub Actions Inactive

Add options around index required params.

c153aa9

westey-m temporarily deployed to integration January 23, 2025 14:54 — with GitHub Actions Inactive

Add support for azure cosmos db nosql hybrid search, without configur…

cb358fb

…able full text search language

westey-m had a problem deploying to integration January 24, 2025 14:42 — with GitHub Actions Error

westey-m added 2 commits January 24, 2025 14:45

Fix typos.

589528a

Fix typo

f85e098

westey-m temporarily deployed to integration January 24, 2025 14:46 — with GitHub Actions Inactive

westey-m added 2 commits January 28, 2025 12:30

Add a comparison of keyword matching behaviors between different DBs.

c774af9

Add a qdrant hybrid search implementation

585405a

westey-m had a problem deploying to integration January 28, 2025 12:32 — with GitHub Actions Error

Fix typo

f7a65bc

westey-m temporarily deployed to integration January 28, 2025 12:34 — with GitHub Actions Inactive

Clarify "either" option further and fix typo.

c09e36a

westey-m temporarily deployed to integration January 29, 2025 14:29 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.Net: Vector store abstractions hybrid search ADR #10196

.Net: Vector store abstractions hybrid search ADR #10196

westey-m commented Jan 15, 2025

roji left a comment

roji Jan 22, 2025

westey-m Jan 23, 2025

roji Jan 22, 2025

westey-m Jan 23, 2025

roji Jan 22, 2025

westey-m Jan 23, 2025

roji Jan 22, 2025

westey-m Jan 23, 2025

roji Jan 22, 2025

roji Jan 22, 2025

westey-m Jan 29, 2025

roji Jan 22, 2025

westey-m Jan 29, 2025

roji Jan 22, 2025


		### 3. Accept either in interface but throw for not supported

		Accept either option but throw for the one not supported by the underly DB.

.Net: Vector store abstractions hybrid search ADR #10196

Are you sure you want to change the base?

.Net: Vector store abstractions hybrid search ADR #10196

Conversation

westey-m commented Jan 15, 2025

Motivation and Context

Description

Contribution Checklist

roji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment