Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: Vector store abstractions hybrid search ADR #10196

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

westey-m
Copy link
Contributor

Motivation and Context

Create ADR for adding hybrid search support to the VectorStore abstractions.

Description

Adding hybrid ADR document

Contribution Checklist

@westey-m westey-m marked this pull request as draft January 15, 2025 15:34
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Jan 20, 2025
@github-actions github-actions bot changed the title Vector store abstractions hybrid search ADR .Net: Vector store abstractions hybrid search ADR Jan 20, 2025
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @westey-m, here are some thoughts.

docs/decisions/00NN-hybrid-search.md Outdated Show resolved Hide resolved
// The name of the property to target the text search against.
public string? TextPropertyName { get; init; }
// Allow fusion method to be configurable for dbs that support configuration. If null, a default is used.
public string FusionMethod { get; init; } = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sufficient to simply allow users to choose the fusion method name, or do different fusion methods also have parameters which would need to be set? That may complicated things here, possibly requiring a hierarchy of HybridSearchOptions to support the different parameters etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Milvus and MongoDB allows weights to be provided for RRF. The other DBs in the sample, don't seem to support something similar for those that support RRF. Milvus also supports a setting for Weighted.

docs/decisions/00NN-hybrid-search.md Outdated Show resolved Hide resolved
public string FusionMethod { get; init; } = null;

public VectorSearchFilter? Filter { get; init; }
public int Top { get; init; } = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing on the above, it may make sense to have an abstract base class for the common options (e.g. Top/Skip, which would seem to be a part of any search type). Though I'm not sure, and if we haven't captured such similarities via a base class it's not the end of the world either...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would help a bit with implementation coding, where we want to pass the common options around. As you say though, it's not the end of the world. It shouldn't affect users.

{
Task<VectorSearchResults<TRecord>> KeywordVectorizableHybridSearch(
string description,
string? keywords = default,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally nullable? The analogous non-vectorizable signature above has it required. Same below with KeywordVectorizableHybridSearchOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that if you wanted to use the description for both the vector and the keywords, you don't need to pass the same string twice. Do you think it's confusing? If so, we can just make it so that the user always passes both, even if they are the same string.

SparseVectorPropertyName

VectorPropertyName
TextPropertyName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, naming is always tricky.

First, for TextPropertyName I'd even consider FullTextPropertyName, as that more closely connects the naming with the search type being done.

For SparseVectorPropertyName, ideally we'd have a naming that conveys the function/use of the property rather than its type ("sparse vector"). For example, does something like DocumentFrequencyDataProperty make sense (you know this area more than me)? In other words, I'm trying to find something that will express the data contained in the sparse vector - and what its used for.

If we manage to do that, then the name SparseVectorPropertyName is replaced by some functional term, and there's no more ambiguity with sparse/dense; at that point it should probably be find to just call the main vector (for vector search) VectorPropertyName, just like we call it on the non-hybrid search (this consistency should be a goal IMHO).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid FullText since I (maybe incorrectly) associate the word with more advanced search capabilities like wildcards, partial word matches and potentially even boolean logic, and these may not necessarily be supported by all the DBs. Maybe my definition of FullText is too narrow.

For SparseVectorPropertyName, with regards to using its type rather than usage, I'm following the same naming we are using elsewhere. E.g. we are not calling Vectors, Embeddings, since it assumes a usage that may not be there. I'm not sure it makes sense to put the usage in the name, when for sparse vectors they might be generated in a way that isn't related to text document frequency, and the database is just doing a dotproduct comparison on the vectors. I actually think that the search method name and options name may be wrong in this case as well, and I might need to generalize it further. Thoughts?

IEnumerable<string> keywords,
KeywordVectorizedHybridSearchOptions options,
CancellationToken cancellationToken);
Task<VectorSearchResults<TRecord>> KeywordVectorizedHybridSearch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-string overload could be an extension method which simply calls the multiple-string overload. This would obviate having to implement the single-string overload in each and every provider (and doing the exact same thing).

(if we only needed to support modern .NET we'd use a default interface implementation instead)

CancellationToken cancellationToken);
```

Pros: Easier for a user to use, since they don't need to do any keyword splitting themselves.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any support for doing this search for terms that contain spaces (e.g. multiple words)? I get that word frequency (and not frequencies of arbitrary strings) arbitrary has been precalculated in the database, and so the keyword must be a word that would much the document frequency data.

But looking forward, if/when we add support for hybrid search where the user passes in a sparse vector, they could (at least in theory) produce a sparse vector for any strings which they find interesting - including some which may have spaces in them, no? If so, then implicitly splitting on strings in the API would make it impossible to use the API in such scenarios, right?

In any case, implicitly doing the splitting inside seems quite magical and non-.NET-ish... I think it's better to be quite explicit here and have splitting happening outside. I'm also unsure that users will always have an already concatenated string, as opposed to a list - which they would then have to join into a single string before being able to use the above API... That requires some assumptions on what consuming code does, how UIs are designed, etc. etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that the primary use case for this is RAG scenarios, my expectation would be that the input keywords would be coming from a natural language query from a user. The developer would need to pre-process this sentence though to extract keywords or remove noise using methods specific to the language(s) in question.


### 3. Accept either in interface but throw for not supported

Accept either option but throw for the one not supported by the underly DB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this is different from option 3 above... If there are databases that don't supported multiple keywords, then wouldn't option 2 also have to throw if multiple keywords are provided?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the text of number 3 to explain this better. With option 3 the idea is that the connector would combine or split the keywords depending on what is required by the underlying DB.

```csharp
Task<VectorSearchResults<TRecord>> KeywordVectorizedHybridSearch(
TVector vector,
IEnumerable<string> keywords,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing IEnumerable<string> in abstraction APIs can be problematic, if there's any chance it would need to be enumerated twice inside. For example, if some provider needs to know the number of keywords up-front (e.g. in order to allocate some array or whatever), then they'd need to enumerate twice, which might be extremely heavy (e.g. if the IEnumerable represents some big LINQ query). The way I see it, exposing an IEnumerable parameter effectively promises the user that the argument will only be enumerated once; if that's not a reasonable promise we're sure we can keep, it's better to accept something more specific, like an ICollection, or in this case, an ISet. This forces the user to do the materialization of their LINQ query and pass the results to the API, ensuring single-enumeration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants