-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support embeddings models #3252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| from pydantic_ai.models.instrumented import InstrumentationSettings | ||
| from pydantic_ai.providers import infer_provider | ||
|
|
||
| KnownEmbeddingModelName = TypeAliasType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test like this one to verify this is up to date:
| def test_known_model_names(): # pragma: lax no cover |
Docs Preview
|
|
Thanks for starting this and please do let me know if you need help :) One thing you might want to support from the start is having as part of the Embedding models have a limit of how many tokens of input they can handle. Most providers will raise ( All this is well explained here I would not necessarily truncate like in the cookbook and still just raise, but I would be grateful to have available from the model side the The only difficulty I see with this is that not all providers expose the tokenizers, for example Ollama does not. But still, would be nice to have it for the providers that do support it, as it's a crucial step when you are trying to chunk a document for embedding. In Edit: I am not suggesting that calling |
gvanrossum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to be able to comment on the API, but there are no tests showing how to call it.
|
@gvanrossum I'll make some progress on the PR today, but this is the API as it stands today: import asyncio
from pydantic_ai.embeddings import Embedder
embedder = Embedder("openai:text-embedding-3-large")
async def main():
result = await embedder.embed("Hello, world!")
print(result)
if __name__ == "__main__":
asyncio.run(main())With Azure OpenAI you currently have to create the model and provider manually, but we'll make import asyncio
from pydantic_ai.embeddings import Embedder
from pydantic_ai.embeddings.openai import OpenAIEmbeddingModel
from pydantic_ai.providers.azure import AzureProvider
model = OpenAIEmbeddingModel("text-embedding-3-large", provider=AzureProvider())
embedder = Embedder(model)
async def main():
result = await embedder.embed("Hello, world!")
print(result)
if __name__ == "__main__":
asyncio.run(main()) |
|
Nice. Do you have a bulk API too? That's essential for typeagent. |
|
@gvanrossum Yep, the |
|
@gvanrossum In case you'd like to give it a try pre-release, I've made some progress today, including support for |
|
Unfortunately I haven't managed to get to this this week. Next week should be better. |
# Conflicts: # pydantic_ai_slim/pydantic_ai/models/__init__.py
I use lancedb |
I am using something I wrote myself. Persistence is optional (currently embeddings are stored in sqlite). Here's the code: |
# Conflicts: # pydantic_ai_slim/pyproject.toml # uv.lock
|
Nice work on this! |
|
Hi, It has support for local embeddings? |
Based on document, I think it has. |
|
Thanks @daikeren, implemented here and it works. |
|
Tiktoken has a subtle footgun that bites us occasionally, which is its lazy loading for tokenizers. We run in an air-gapped environment and best case scenario, it just fails immediately, and the worst case scenario, it hangs the process until it times out. I haven't tested this update yet, but wanted to share this bit as I do not see any particular support for offline/air-gapped environments. |
|
Thanks @Mazyod. In my case, I use it with internet access, so it downloads normally. Also, since you brought up the subject, how do I save it to a Dockerfile so I don't have to download it again when I need to use it in the application? |
|
@paulocoutinhox I did this approach for a while, till it failed because tiktoken cache was outdated and it attempted to refetch. # Cache tiktoken definition from the internet
RUN http_proxy=http://my-proxy \
https_proxy=http://my-proxy \
python -c "import tiktoken; tiktoken.encoding_for_model('gpt-4o')" |
|
@Mazyod Good catch about the offline environments; can you please file an issue for that? We can at the very least document a workaround like that one for Docker? |
|
Yeah. Will be nice a solution for this. 💯 |
|
Does this work with bedrock yet ? I haven't quite work it out - or maybe something isn't implemented yet ? By way of example on bedrock (not using pydantic-ai yet) I'm using the embedding models: "cohere.embed-english-v3" and "amazon.titan-embed-text-v2" My assumption was that I would prefix these with "bedrock:". Have the embeddings models available on bedrock been added anywhere in the codebase ? So far, I get unknown model using various combinations. |
|
@stuartaxonHO The supported providers are documented under https://ai.pydantic.dev/embeddings/#providers; Bedrock is not yet one of them but contribution welcome! |
|
@DouweM the word provider is a little overloaded - what should happen, to handle embedding models from different providers on an API provider ? e.g. Cohere models on bedrock, vs the Amazon models (the model parameters are different and one uses embeddings[0] and the other embedding) Also - in the code, I noticed there's a list of LLM models, but I can't find one of the embedding models, would that need to be added ? |
|
@stuaxo I agree the provider/model terms are a bit overloaded, we use them as described in https://ai.pydantic.dev/models/overview/#models-and-providers. That means that in this context, "model" maps to "API format", and "provider" maps to "API client/base URL". So in this case we'd need a |
Started this in collaboration with @DouweM, I'd like to ensure consensus on the API design before adding the remaining-providers/logfire-instrumentation/docs/tests.
This is inspired by the approach in haiku.rag, though we adapted it to be a bit closer to the
AgentAPIs are used (and how you can override model, settings, etc.).Closes #58
Example:
To do:
Embedder.embed_synccount_tokensmax_input_tokenslogfire.instrument_pydantic_ai(): Instrument Pydantic AI embedders and embedding models logfire#1575ModelAPIError)