-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Voyageai integration update #5598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Voyageai integration update #5598
Conversation
Adding token counting and flexible batch size Extending the tests
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
VoyageAI Contextual and Multimodal Model Integration & Token Counting Support This PR introduces comprehensive support for contextual and multimodal embedding models for VoyageAI, adding functionalities for multimodal (text+image) embeddings, contextual model handling, and a robust batching/token counting mechanism designed to operate within VoyageAI model token limits. The embedding function API is enhanced to allow flexible batch sizing, token counting, and support for a wide range of VoyageAI models with specific configuration options. The update also comes with a major expansion of test coverage across various contextual, multimodal, batching, and token-counting scenarios. API compatibility and error handling are improved, and new test suites verify correct behavior for all modes, including integration with Chroma's multimodal collections. Key Changes• Expanded Affected Areas• Embedding functions and abstraction in This summary was automatically generated by @propel-code-bot |
chromadb/utils/embedding_functions/voyageai_embedding_function.py
Outdated
Show resolved
Hide resolved
Adding token counting and flexible batch size Extending the tests
# Tokenize all texts in one API call | ||
all_token_lists = self._client.tokenize(texts, model=self.model_name) | ||
token_counts = [len(tokens) for tokens in all_token_lists] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential API call failure: The _build_batches
method calls self._client.tokenize(texts, model=self.model_name)
but there's no error handling if this API call fails. If the tokenize API is unavailable or returns an error, this will cause the entire embedding process to crash.
Add error handling:
Suggested Change
# Tokenize all texts in one API call | |
all_token_lists = self._client.tokenize(texts, model=self.model_name) | |
token_counts = [len(tokens) for tokens in all_token_lists] | |
# Tokenize all texts in one API call | |
try: | |
all_token_lists = self._client.tokenize(texts, model=self.model_name) | |
token_counts = [len(tokens) for tokens in all_token_lists] | |
except Exception: | |
# Fallback to simple batching by batch_size if tokenization fails | |
if self.batch_size: | |
for i in range(0, len(texts), self.batch_size): | |
yield texts[i:i + self.batch_size] | |
else: | |
yield texts | |
return |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
Potential API call failure: The `_build_batches` method calls `self._client.tokenize(texts, model=self.model_name)` but there's no error handling if this API call fails. If the tokenize API is unavailable or returns an error, this will cause the entire embedding process to crash.
Add error handling:
<details>
<summary>Suggested Change</summary>
```suggestion
# Tokenize all texts in one API call
try:
all_token_lists = self._client.tokenize(texts, model=self.model_name)
token_counts = [len(tokens) for tokens in all_token_lists]
except Exception:
# Fallback to simple batching by batch_size if tokenization fails
if self.batch_size:
for i in range(0, len(texts), self.batch_size):
yield texts[i:i + self.batch_size]
else:
yield texts
return
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: chromadb/utils/embedding_functions/voyageai_embedding_function.py
Line: 158
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._client.tokenize
is running locally. Since tokenization is an important feature here, i feel that if it fails, it is better the whole process to fail.
Co-authored-by: propel-code-bot[bot] <203372662+propel-code-bot[bot]@users.noreply.github.com>
Description of changes
Voyageai contextual and multimodal
Adding token counting
Adding more tests
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
No
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?