-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH] Embed query strings in search api #5599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: jai/schema-js-impl
Are you sure you want to change the base?
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
7923660
to
ac96cfc
Compare
Embed String Queries Directly in Search API for Knn Operations This PR introduces the capability to embed string queries directly within Key Changes• Added Affected Areas• This summary was automatically generated by @propel-code-bot |
ac96cfc
to
7f43188
Compare
# Handle main embedding field | ||
if key == EMBEDDING_KEY: | ||
# Use the collection's main embedding function | ||
embedding = self._embed(input=[query_text], is_query=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Error handling gap: Multiple embedding function calls (self._embed
, self._sparse_embed
, embedding_func.embed_query
) can raise exceptions but aren't wrapped in try-catch blocks. If any embedding function fails (network issues, model errors, invalid input), the error will propagate uncaught and could crash the application.
Consider wrapping embedding calls:
try:
embedding = self._embed(input=[query_text], is_query=True)
except Exception as e:
raise ValueError(f"Failed to embed query '{query_text}': {e}") from e
Context for Agents
[**BestPractice**]
Error handling gap: Multiple embedding function calls (`self._embed`, `self._sparse_embed`, `embedding_func.embed_query`) can raise exceptions but aren't wrapped in try-catch blocks. If any embedding function fails (network issues, model errors, invalid input), the error will propagate uncaught and could crash the application.
Consider wrapping embedding calls:
```python
try:
embedding = self._embed(input=[query_text], is_query=True)
except Exception as e:
raise ValueError(f"Failed to embed query '{query_text}': {e}") from e
```
File: chromadb/api/models/CollectionCommon.py
Line: 773
f"Please provide an embedded vector or configure an embedding function." | ||
) | ||
|
||
def _embed_rank_string_queries(self, rank: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential infinite recursion: The method _embed_rank_string_queries
recursively processes rank expressions but doesn't have a maximum depth check. For deeply nested or circular rank structures, this could cause a stack overflow.
Consider adding a depth limit:
def _embed_rank_string_queries(self, rank: Any, depth: int = 0, max_depth: int = 100) -> Any:
if depth > max_depth:
raise ValueError(f"Maximum recursion depth ({max_depth}) exceeded in rank expression")
# ... existing logic with depth + 1 passed to recursive calls
Context for Agents
[**BestPractice**]
Potential infinite recursion: The method `_embed_rank_string_queries` recursively processes rank expressions but doesn't have a maximum depth check. For deeply nested or circular rank structures, this could cause a stack overflow.
Consider adding a depth limit:
```python
def _embed_rank_string_queries(self, rank: Any, depth: int = 0, max_depth: int = 100) -> Any:
if depth > max_depth:
raise ValueError(f"Maximum recursion depth ({max_depth}) exceeded in rank expression")
# ... existing logic with depth + 1 passed to recursive calls
```
File: chromadb/api/models/CollectionCommon.py
Line: 868
7f43188
to
bc4e1cc
Compare
93a1561
to
2c673db
Compare
bc4e1cc
to
e7655bc
Compare
2c673db
to
13791bd
Compare
e7655bc
to
99144a3
Compare
13791bd
to
c51fb45
Compare
embedded_searches = [ | ||
self._embed_search_string_queries(search) for search in searches_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential error handling gap: If _embed_search_string_queries()
raises an exception for any search in the list comprehension, the entire operation will fail and no searches will be processed. Consider adding error handling to gracefully handle individual search embedding failures:
embedded_searches = []
for search in searches_list:
try:
embedded_searches.append(self._embed_search_string_queries(search))
except Exception as e:
logger.warning(f"Failed to embed search: {e}")
embedded_searches.append(search) # Use original search as fallback
Context for Agents
[**BestPractice**]
Potential error handling gap: If `_embed_search_string_queries()` raises an exception for any search in the list comprehension, the entire operation will fail and no searches will be processed. Consider adding error handling to gracefully handle individual search embedding failures:
```python
embedded_searches = []
for search in searches_list:
try:
embedded_searches.append(self._embed_search_string_queries(search))
except Exception as e:
logger.warning(f"Failed to embed search: {e}")
embedded_searches.append(search) # Use original search as fallback
```
File: chromadb/api/models/AsyncCollection.py
Line: 362
embedded_searches = [ | ||
self._embed_search_string_queries(search) for search in searches_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Same error handling concern applies here. If _embed_search_string_queries()
fails for any search in the list, all searches will fail to process. Consider adding individual error handling as suggested for the AsyncCollection version.
Context for Agents
[**BestPractice**]
Same error handling concern applies here. If `_embed_search_string_queries()` fails for any search in the list, all searches will fail to process. Consider adding individual error handling as suggested for the AsyncCollection version.
File: chromadb/api/models/Collection.py
Line: 366
try: | ||
embeddings = embedding_func.embed_query(input=[query_text]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Error handling issue: The try-except AttributeError
block only catches AttributeError
but the embedding function could fail with other exceptions (network errors, validation errors, etc.). Consider catching broader exceptions:
try:
embeddings = embedding_func.embed_query(input=[query_text])
except AttributeError:
# Fallback if embed_query doesn't exist
embeddings = embedding_func([query_text])
except Exception as e:
raise ValueError(
f"Failed to embed string query '{query_text}' using embedding function: {e}"
) from e
Context for Agents
[**BestPractice**]
Error handling issue: The `try-except AttributeError` block only catches `AttributeError` but the embedding function could fail with other exceptions (network errors, validation errors, etc.). Consider catching broader exceptions:
```python
try:
embeddings = embedding_func.embed_query(input=[query_text])
except AttributeError:
# Fallback if embed_query doesn't exist
embeddings = embedding_func([query_text])
except Exception as e:
raise ValueError(
f"Failed to embed string query '{query_text}' using embedding function: {e}"
) from e
```
File: chromadb/api/models/CollectionCommon.py
Line: 843
99144a3
to
4ee74c9
Compare
c51fb45
to
61c4404
Compare
22dbb01
to
a600f21
Compare
# Embed the query | ||
sparse_embedding = self._sparse_embed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Error handling gap: Similar to the main embedding case, if self._sparse_embed()
fails, the error lacks context about which query and key failed. With multiple searches containing different sparse embedding keys, this makes debugging difficult.
Consider adding context:
Suggested Change
# Embed the query | |
sparse_embedding = self._sparse_embed( | |
# Embed the query | |
try: | |
sparse_embedding = self._sparse_embed( | |
input=[query_text], | |
sparse_embedding_function=embedding_func, | |
is_query=True, | |
) | |
except Exception as e: | |
raise ValueError( | |
f"Failed to embed string query '{query_text}' for sparse key '{key}': {e}" | |
) from e |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
Error handling gap: Similar to the main embedding case, if `self._sparse_embed()` fails, the error lacks context about which query and key failed. With multiple searches containing different sparse embedding keys, this makes debugging difficult.
Consider adding context:
<details>
<summary>Suggested Change</summary>
```suggestion
# Embed the query
try:
sparse_embedding = self._sparse_embed(
input=[query_text],
sparse_embedding_function=embedding_func,
is_query=True,
)
except Exception as e:
raise ValueError(
f"Failed to embed string query '{query_text}' for sparse key '{key}': {e}"
) from e
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: chromadb/api/models/CollectionCommon.py
Line: 812
a600f21
to
e509739
Compare
e509739
to
662e8ad
Compare
a9944bd
to
2bb4533
Compare
662e8ad
to
588a488
Compare
2bb4533
to
e346e31
Compare
588a488
to
3dd9ddc
Compare
e346e31
to
399e30b
Compare
3dd9ddc
to
1ab6fef
Compare
399e30b
to
7d4aeb3
Compare
1ab6fef
to
091aa54
Compare
7d4aeb3
to
ed69aca
Compare
091aa54
to
842d149
Compare
ed69aca
to
d567604
Compare
842d149
to
499ac84
Compare
return Knn( | ||
query=embedding[0], | ||
key=knn.key, | ||
limit=knn.limit, | ||
default=knn.default, | ||
return_rank=knn.return_rank, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
To reduce code repetition, you could use dataclasses.replace
to create new Knn
instances with an updated query
. This would make the code more concise and less prone to errors if the Knn
dataclass changes in the future.
You would need to add from dataclasses import replace
at the top of the file.
Then, this block and the similar ones at lines 824-830 and 854-860 could be simplified.
For example, this:
return Knn(
query=embedding[0],
key=knn.key,
limit=knn.limit,
default=knn.default,
return_rank=knn.return_rank,
)
becomes:
return replace(knn, query=embedding[0])
Note: dataclasses.replace()
is the standard library recommended approach for creating modified copies of dataclass instances. It's safer than manual construction because it handles __post_init__
calls correctly and is more maintainable when dataclass fields change. The codebase already imports dataclasses in multiple files, so this change aligns with existing patterns.
Context for Agents
[**BestPractice**]
To reduce code repetition, you could use `dataclasses.replace` to create new `Knn` instances with an updated `query`. This would make the code more concise and less prone to errors if the `Knn` dataclass changes in the future.
You would need to add `from dataclasses import replace` at the top of the file.
Then, this block and the similar ones at lines 824-830 and 854-860 could be simplified.
For example, this:
```python
return Knn(
query=embedding[0],
key=knn.key,
limit=knn.limit,
default=knn.default,
return_rank=knn.return_rank,
)
```
becomes:
```python
return replace(knn, query=embedding[0])
```
**Note**: `dataclasses.replace()` is the standard library recommended approach for creating modified copies of dataclass instances. It's safer than manual construction because it handles `__post_init__` calls correctly and is more maintainable when dataclass fields change. The codebase already imports dataclasses in multiple files, so this change aligns with existing patterns.
File: chromadb/api/models/CollectionCommon.py
Line: 785
const payloads = await Promise.all( | ||
items.map(async (search) => { | ||
const payload = toSearch(search).toPayload(); | ||
return this.embedSearchPayload(payload); | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Error Handling Gap: Embedding Failures
If any embedding operation fails during the Promise.all, the entire search operation will fail without meaningful error context. Consider adding error handling that identifies which specific query failed:
const payloads = await Promise.all(
items.map(async (search, index) => {
try {
const payload = toSearch(search).toPayload();
return await this.embedSearchPayload(payload);
} catch (error) {
throw new Error(
`Failed to embed search query at index ${index}: ${error.message}`
);
}
})
);
Note: Use standard Error
class instead of ChromaValueError
in TypeScript client, as error classes may differ between Python and TypeScript implementations.
Context for Agents
[**BestPractice**]
**Error Handling Gap: Embedding Failures**
If any embedding operation fails during the Promise.all, the entire search operation will fail without meaningful error context. Consider adding error handling that identifies which specific query failed:
```typescript
const payloads = await Promise.all(
items.map(async (search, index) => {
try {
const payload = toSearch(search).toPayload();
return await this.embedSearchPayload(payload);
} catch (error) {
throw new Error(
`Failed to embed search query at index ${index}: ${error.message}`
);
}
})
);
```
**Note**: Use standard `Error` class instead of `ChromaValueError` in TypeScript client, as error classes may differ between Python and TypeScript implementations.
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 821
query = queryInput; | ||
} else if ( | ||
isPlainObject(queryInput) && | ||
Array.isArray((queryInput as SparseVector).indices) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[DataTypeCheck]
Type Safety Issue: Unsafe Type Assertion
The type assertion (queryInput as SparseVector)
assumes the object has the correct structure without runtime validation. If the object has indices
and values
properties but they're not arrays, this will cause runtime errors.
Suggested Change
query = queryInput; | |
} else if ( | |
isPlainObject(queryInput) && | |
Array.isArray((queryInput as SparseVector).indices) && | |
if ( | |
isPlainObject(queryInput) && | |
Array.isArray((queryInput as any).indices) && | |
Array.isArray((queryInput as any).values) && | |
(queryInput as any).indices.every((i: any) => typeof i === 'number') && | |
(queryInput as any).values.every((v: any) => typeof v === 'number') | |
) { | |
const sparse = queryInput as SparseVector; | |
query = { | |
indices: sparse.indices.slice(), | |
values: sparse.values.slice(), | |
}; |
Validation: ✅ This approach aligns with ChromaDB's internal SparseVector validation, which checks for proper array types, numeric values, and indices constraints.
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**DataTypeCheck**]
**Type Safety Issue: Unsafe Type Assertion**
The type assertion `(queryInput as SparseVector)` assumes the object has the correct structure without runtime validation. If the object has `indices` and `values` properties but they're not arrays, this will cause runtime errors.
<details>
<summary>Suggested Change</summary>
```suggestion
if (
isPlainObject(queryInput) &&
Array.isArray((queryInput as any).indices) &&
Array.isArray((queryInput as any).values) &&
(queryInput as any).indices.every((i: any) => typeof i === 'number') &&
(queryInput as any).values.every((v: any) => typeof v === 'number')
) {
const sparse = queryInput as SparseVector;
query = {
indices: sparse.indices.slice(),
values: sparse.values.slice(),
};
```
**Validation**: ✅ This approach aligns with ChromaDB's internal SparseVector validation, which checks for proper array types, numeric values, and indices constraints.
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/chromadb/src/execution/expression/rank.ts
Line: 378
|
||
search = Search().rank(Knn(query="hello world")) | ||
|
||
print(collection.schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This print
statement appears to be a leftover from debugging. It should be removed to keep the test suite clean.
Context for Agents
[**BestPractice**]
This `print` statement appears to be a leftover from debugging. It should be removed to keep the test suite clean.
File: chromadb/test/api/test_schema_e2e.py
Line: 548
499ac84
to
bf5fcfb
Compare
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?