-
Notifications
You must be signed in to change notification settings - Fork 1.6k
VoyageAI embeddings support #3856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| voyageai_truncation: bool | ||
| """Whether to truncate inputs that exceed the model's context length. | ||
| Defaults to True. If False, an error is raised for inputs that are too long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Cohere, I decided to default to False for consistency with OpenAI. Can that be the default here as well? Or do you think that was the wrong call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, it's hard to say, I did not notice to be honest in your original PR, I would have raised this.
I would be inclined to have truncation as default primarily because we do not have usable and accurate tokenizers for all providers. Ollama for example has no tokenizer and truncates silently with no option. There is tiktoken, but that really only covers few providers.
Having said that, since that's the default elsewhere let's keep it like that, I will adapt.
| """ | ||
|
|
||
| voyageai_output_dtype: Literal['float', 'int8', 'uint8', 'binary', 'ubinary'] | ||
| """The output data type for embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our types currently require embeddings to be floats though
| to use as defaults for this model. | ||
| """ | ||
| self._model_name = model_name | ||
| self._client = AsyncClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do follow the existing provider class / model class pattern for consistency.
| texts=list(inputs), | ||
| model=self.model_name, | ||
| input_type=voyageai_input_type, | ||
| truncation=settings.get('voyageai_truncation', True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, I think this should be False
| usage_data = {'total_tokens': total_tokens} | ||
| response_data = {'model': model, 'usage': usage_data} | ||
|
|
||
| return RequestUsage.extract( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only use this if the models have data in genai-prices. In this case, it's better to build a RequestUsage object manually. As you can see in the tests snapshots, this will actually fail to extract anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not notice. Will look thoroughly and adapt.
Address review comments on PR pydantic#3856: - Create VoyageAIProvider class following Cohere pattern - Change truncation default from True to False - Remove voyageai_output_dtype setting (types require floats) - Build RequestUsage manually instead of using extract() - Update embeddings/__init__.py to pass provider to VoyageAI model - Re-record VCR cassettes with live API
b934d5b to
88ab61b
Compare
|
Hi, with regards to the coverage issue that's blocking CI: The
I would remove the sentence-transformers case from |
|
@ggozad The precedent for a local-model provider that doesn't actually do much is So even though it's not strictly necessary, I'd prefer to keep the |
|
|
||
| model_kind = normalize_gateway_provider(model_kind) | ||
|
|
||
| # Handle models that don't need a provider first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we revert this change, would we get test coverage again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you are right, reverted.
|
|
||
| # ALL FIELDS MUST BE `voyageai_` PREFIXED SO YOU CAN MERGE THEM WITH OTHER MODELS. | ||
|
|
||
| voyageai_truncation: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since multiple models support truncation to be toggled now, let's move this to the EmbeddingSettings superclass. We should keep supporting cohere_truncate as well, but can prioritize the main truncate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added truncate to EmbeddingSettings and kept the Cohere settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this field anymore then, right?
| assert api_key is None, 'Cannot provide both `voyageai_client` and `api_key`' | ||
| assert base_url is None, 'Cannot provide both `voyageai_client` and `base_url`' | ||
| assert max_retries == 0, 'Cannot provide both `voyageai_client` and `max_retries`' | ||
| assert timeout is None, 'Cannot provide both `voyageai_client` and `timeout`' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless "most users" will need them, I'd prefer to not expose all the arguments on AsyncClient as arguments here: users can just pass their own voyageai_client if they want this level of control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, left only api_key & voyageai_client.
|
|
||
| # Only pass base_url if explicitly set; otherwise use VoyageAI's default | ||
| base_url = base_url or os.getenv('VOYAGE_BASE_URL') | ||
| self._client = AsyncClient( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this takes a http_client, we should use a cached version like we do in the openai provider etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VoyageAI sdk does not support custom HTTP clients :(
| """The embedding model provider.""" | ||
| return self._provider.name | ||
|
|
||
| async def embed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a shame they don't (seem to) support counting tokens :(
78e615f to
093eaa5
Compare
DouweM
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggozad Thanks Yiorgis, a few more comments, + I just merged the Google embedding model, so you'll have a few conflicts to resolve
| """ | ||
|
|
||
| truncate: bool | ||
| """Whether to truncate inputs that exceed the model's context length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We should specify that the default is
False - I think it's worth explaining that you can use the
max_input_tokensandcount_tokensmethods to implement your own (smarter) "truncation"
|
|
||
| # ALL FIELDS MUST BE `voyageai_` PREFIXED SO YOU CAN MERGE THEM WITH OTHER MODELS. | ||
|
|
||
| voyageai_truncation: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this field anymore then, right?
| Defaults to False. If True, inputs that are too long will be truncated. | ||
| """ | ||
|
|
||
| voyageai_input_type: VoyageAIEmbedInputType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm if it only supports query and document anyway I don't think a setting is warranted. If "direct embedding without prefix" is something users would want to do, I think we should make it 'none' instead of None so that the difference is more clear between this field being omitted (and we should use the default input type implied by the embed_query/document method) and explicitly being set to none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The None option according to their docs does "raw" embedding. My guess is that that's useful say if you want to do clustering or classification. For retrieval one would use document or query.
Will change to none as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm thinking about this more, what do you think about changing our EmbedInputType type to accept None as well, plus having Embedder.embed's input_type argument default to None? Then we would not need this custom setting at all anymore. I don't think any of the embeddings APIs require an input type, and if they do we can pick a reasonable default. OpenAI ignores the argument entirely anyway.
Then we would not need a new setting here at all, so even though it's kind of a separate task from this PR, I think it's worth trying it here so we don't introduce the new setting and then immediately deprecate it.
tests/test_embeddings.py
Outdated
| async def test_query_with_cohere_truncate(self, co_api_key: str): | ||
| model = CohereEmbeddingModel('embed-v4.0', provider=CohereProvider(api_key=co_api_key)) | ||
| embedder = Embedder(model) | ||
| result = await embedder.embed_query('Hello, world!', settings={'cohere_truncate': 'END'}) # pyright: ignore[reportArgumentType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the pyright ignore? Maybe if we use the CohereEmbeddingSettings constructor we won't need it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the series 4 models that were released I think yesterday. The other models I saw from series 1 & 2 are legacy and I would not include them.
# Conflicts: # docs/api/embeddings.md # tests/test_embeddings.py
…rity when raw embeddings are desired
|
@DouweM thank you for the thorough review, I think I addressed all your comments, merged the google embeddings changes. I also added the series 4 embedding models that were released yesterday I think. |
docs/embeddings.md
Outdated
| 'voyageai:voyage-3.5', | ||
| settings=VoyageAIEmbeddingSettings( | ||
| dimensions=512, # Reduce output dimensions | ||
| truncate=True, # Truncate input if it exceeds context length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently in the "VoyageAI-specific settings" section, but neither of these are actually VoyageAI-specific :) So I think we should mention truncate in the top-level Settings section (where we mention dimensions) already, and change this section to be about voyageai_input_type, similar to the one about google_task_type
| output_dimension=settings.get('dimensions'), | ||
| input_type=cohere_input_type, | ||
| max_tokens=settings.get('cohere_max_tokens'), | ||
| truncate=settings.get('cohere_truncate', 'NONE'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specify on the cohere_truncate docstring that it overrides the truncate boolean
| Defaults to False. If True, inputs that are too long will be truncated. | ||
| """ | ||
|
|
||
| voyageai_input_type: VoyageAIEmbedInputType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm thinking about this more, what do you think about changing our EmbedInputType type to accept None as well, plus having Embedder.embed's input_type argument default to None? Then we would not need this custom setting at all anymore. I don't think any of the embeddings APIs require an input type, and if they do we can pick a reasonable default. OpenAI ignores the argument entirely anyway.
Then we would not need a new setting here at all, so even though it's kind of a separate task from this PR, I think it's worth trying it here so we don't introduce the new setting and then immediately deprecate it.
| def __init__(self, *, voyageai_client: AsyncClient) -> None: ... | ||
|
|
||
| @overload | ||
| def __init__(self, *, api_key: str | None = None, voyageai_client: None = None) -> None: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should just accept the api key right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regards to the EmbedInputType:
I think the None choice in VoyageAI is a bit weird. Typically you would want to either embed for a query or documents (when the differentiation is available). The raw embeddings is an edge-case as I mentioned when you want to do something other than semantic search.
This makes me think that None here is ambiguous, as it does not mean "use default behaviour" (query for most) but rather use no prefixes in the embedding. So if we used None as default should we then introduce a "raw" setting?
Given we have meaningful methods for the common RAG case, i.e. embed_query() and embed_documents() and that different vendors have a mix of possibilities, I would keep input_type as vendor-specific.
But happy to discuss this, let me know what you think.
Pre-Review Checklist
make formatandmake typecheck.Pre-Merge Checklist