Skip to content

Conversation

@ggozad
Copy link

@ggozad ggozad commented Dec 26, 2025

  • Add VoyageAIEmbeddingModel for VoyageAI's embedding API
  • Support all current VoyageAI models including domain-specific ones (code, finance, law)
  • Add voyageai optional dependency group

Pre-Review Checklist

  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • No breaking changes in accordance with the version policy.
  • Linting and type checking pass per make format and make typecheck.
  • PR title is fit for the release changelog.

Pre-Merge Checklist

  • New tests for any fix or new behavior, maintaining 100% coverage.
  • Updated documentation for new features and behaviors, including docstrings for API docs.

@DouweM DouweM added feature New feature request, or PR implementing a feature (enhancement) size: M Medium PR (101-500 weighted lines) labels Jan 6, 2026
voyageai_truncation: bool
"""Whether to truncate inputs that exceed the model's context length.
Defaults to True. If False, an error is raised for inputs that are too long.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Cohere, I decided to default to False for consistency with OpenAI. Can that be the default here as well? Or do you think that was the wrong call?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, it's hard to say, I did not notice to be honest in your original PR, I would have raised this.
I would be inclined to have truncation as default primarily because we do not have usable and accurate tokenizers for all providers. Ollama for example has no tokenizer and truncates silently with no option. There is tiktoken, but that really only covers few providers.

Having said that, since that's the default elsewhere let's keep it like that, I will adapt.

"""

voyageai_output_dtype: Literal['float', 'int8', 'uint8', 'binary', 'ubinary']
"""The output data type for embeddings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our types currently require embeddings to be floats though

to use as defaults for this model.
"""
self._model_name = model_name
self._client = AsyncClient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do follow the existing provider class / model class pattern for consistency.

texts=list(inputs),
model=self.model_name,
input_type=voyageai_input_type,
truncation=settings.get('voyageai_truncation', True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, I think this should be False

usage_data = {'total_tokens': total_tokens}
response_data = {'model': model, 'usage': usage_data}

return RequestUsage.extract(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only use this if the models have data in genai-prices. In this case, it's better to build a RequestUsage object manually. As you can see in the tests snapshots, this will actually fail to extract anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not notice. Will look thoroughly and adapt.

ggozad added 5 commits January 9, 2026 12:25
Address review comments on PR pydantic#3856:
- Create VoyageAIProvider class following Cohere pattern
- Change truncation default from True to False
- Remove voyageai_output_dtype setting (types require floats)
- Build RequestUsage manually instead of using extract()
- Update embeddings/__init__.py to pass provider to VoyageAI model
- Re-record VCR cassettes with live API
@ggozad ggozad force-pushed the voyageai-embeddings branch from b934d5b to 88ab61b Compare January 9, 2026 13:05
@ggozad
Copy link
Author

ggozad commented Jan 9, 2026

Hi, with regards to the coverage issue that's blocking CI:

The sentence-transformers case in infer_provider_class() (providers/__init__.py) appears to be dead code that was introduced in PR #3252 but never covered by tests.

  • SentenceTransformerEmbeddingModel doesn't use a provider - it runs locally
  • infer_embedding_model('sentence-transformers:...') handles this case directly without calling infer_provider_class

I would remove the sentence-transformers case from infer_provider_class() and delete providers/sentence_transformers.py since it serves no purpose, but this is your decision :)

@DouweM
Copy link
Collaborator

DouweM commented Jan 13, 2026

@ggozad The precedent for a local-model provider that doesn't actually do much is OutlinesProvider, which only exists because of the model_profile method, and so that OutlinesModel can have consistent __init__ signature with other models.

So even though it's not strictly necessary, I'd prefer to keep the SentenceTransformersProvider. I could imagine us adding EmbeddingModelProfile for example, which would then need SentenceTransformersProvider.model_profile.


model_kind = normalize_gateway_provider(model_kind)

# Handle models that don't need a provider first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we revert this change, would we get test coverage again?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you are right, reverted.


# ALL FIELDS MUST BE `voyageai_` PREFIXED SO YOU CAN MERGE THEM WITH OTHER MODELS.

voyageai_truncation: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since multiple models support truncation to be toggled now, let's move this to the EmbeddingSettings superclass. We should keep supporting cohere_truncate as well, but can prioritize the main truncate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added truncate to EmbeddingSettings and kept the Cohere settings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this field anymore then, right?

assert api_key is None, 'Cannot provide both `voyageai_client` and `api_key`'
assert base_url is None, 'Cannot provide both `voyageai_client` and `base_url`'
assert max_retries == 0, 'Cannot provide both `voyageai_client` and `max_retries`'
assert timeout is None, 'Cannot provide both `voyageai_client` and `timeout`'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless "most users" will need them, I'd prefer to not expose all the arguments on AsyncClient as arguments here: users can just pass their own voyageai_client if they want this level of control.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, left only api_key & voyageai_client.


# Only pass base_url if explicitly set; otherwise use VoyageAI's default
base_url = base_url or os.getenv('VOYAGE_BASE_URL')
self._client = AsyncClient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this takes a http_client, we should use a cached version like we do in the openai provider etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VoyageAI sdk does not support custom HTTP clients :(

"""The embedding model provider."""
return self._provider.name

async def embed(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame they don't (seem to) support counting tokens :(

@ggozad ggozad force-pushed the voyageai-embeddings branch from 78e615f to 093eaa5 Compare January 15, 2026 11:21
Copy link
Collaborator

@DouweM DouweM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggozad Thanks Yiorgis, a few more comments, + I just merged the Google embedding model, so you'll have a few conflicts to resolve

"""

truncate: bool
"""Whether to truncate inputs that exceed the model's context length.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should specify that the default is False
  • I think it's worth explaining that you can use the max_input_tokens and count_tokens methods to implement your own (smarter) "truncation"


# ALL FIELDS MUST BE `voyageai_` PREFIXED SO YOU CAN MERGE THEM WITH OTHER MODELS.

voyageai_truncation: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this field anymore then, right?

Defaults to False. If True, inputs that are too long will be truncated.
"""

voyageai_input_type: VoyageAIEmbedInputType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm if it only supports query and document anyway I don't think a setting is warranted. If "direct embedding without prefix" is something users would want to do, I think we should make it 'none' instead of None so that the difference is more clear between this field being omitted (and we should use the default input type implied by the embed_query/document method) and explicitly being set to none.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The None option according to their docs does "raw" embedding. My guess is that that's useful say if you want to do clustering or classification. For retrieval one would use document or query.
Will change to none as you suggested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm thinking about this more, what do you think about changing our EmbedInputType type to accept None as well, plus having Embedder.embed's input_type argument default to None? Then we would not need this custom setting at all anymore. I don't think any of the embeddings APIs require an input type, and if they do we can pick a reasonable default. OpenAI ignores the argument entirely anyway.

Then we would not need a new setting here at all, so even though it's kind of a separate task from this PR, I think it's worth trying it here so we don't introduce the new setting and then immediately deprecate it.

async def test_query_with_cohere_truncate(self, co_api_key: str):
model = CohereEmbeddingModel('embed-v4.0', provider=CohereProvider(api_key=co_api_key))
embedder = Embedder(model)
result = await embedder.embed_query('Hello, world!', settings={'cohere_truncate': 'END'}) # pyright: ignore[reportArgumentType]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the pyright ignore? Maybe if we use the CohereEmbeddingSettings constructor we won't need it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the series 4 models that were released I think yesterday. The other models I saw from series 1 & 2 are legacy and I would not include them.

@ggozad
Copy link
Author

ggozad commented Jan 16, 2026

@DouweM thank you for the thorough review, I think I addressed all your comments, merged the google embeddings changes. I also added the series 4 embedding models that were released yesterday I think.

'voyageai:voyage-3.5',
settings=VoyageAIEmbeddingSettings(
dimensions=512, # Reduce output dimensions
truncate=True, # Truncate input if it exceeds context length
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently in the "VoyageAI-specific settings" section, but neither of these are actually VoyageAI-specific :) So I think we should mention truncate in the top-level Settings section (where we mention dimensions) already, and change this section to be about voyageai_input_type, similar to the one about google_task_type

output_dimension=settings.get('dimensions'),
input_type=cohere_input_type,
max_tokens=settings.get('cohere_max_tokens'),
truncate=settings.get('cohere_truncate', 'NONE'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify on the cohere_truncate docstring that it overrides the truncate boolean

Defaults to False. If True, inputs that are too long will be truncated.
"""

voyageai_input_type: VoyageAIEmbedInputType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm thinking about this more, what do you think about changing our EmbedInputType type to accept None as well, plus having Embedder.embed's input_type argument default to None? Then we would not need this custom setting at all anymore. I don't think any of the embeddings APIs require an input type, and if they do we can pick a reasonable default. OpenAI ignores the argument entirely anyway.

Then we would not need a new setting here at all, so even though it's kind of a separate task from this PR, I think it's worth trying it here so we don't introduce the new setting and then immediately deprecate it.

def __init__(self, *, voyageai_client: AsyncClient) -> None: ...

@overload
def __init__(self, *, api_key: str | None = None, voyageai_client: None = None) -> None: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should just accept the api key right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regards to the EmbedInputType:
I think the None choice in VoyageAI is a bit weird. Typically you would want to either embed for a query or documents (when the differentiation is available). The raw embeddings is an edge-case as I mentioned when you want to do something other than semantic search.
This makes me think that None here is ambiguous, as it does not mean "use default behaviour" (query for most) but rather use no prefixes in the embedding. So if we used None as default should we then introduce a "raw" setting?
Given we have meaningful methods for the common RAG case, i.e. embed_query() and embed_documents() and that different vendors have a mix of possibilities, I would keep input_type as vendor-specific.
But happy to discuss this, let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting author revision feature New feature request, or PR implementing a feature (enhancement) size: M Medium PR (101-500 weighted lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for VoyageAI embeddings

3 participants