Skip to content

Conversation

jairad26
Copy link
Contributor

@jairad26 jairad26 commented Oct 13, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • This PR adds support to embed string queries within Knn if provided & if embedding function can be found for the given key
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

jairad26 commented Oct 13, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 7923660 to ac96cfc Compare October 16, 2025 17:08
@jairad26 jairad26 marked this pull request as ready for review October 16, 2025 17:09
Copy link
Contributor

propel-code-bot bot commented Oct 16, 2025

Embed String Queries Directly in Search API for Knn Operations

This PR introduces the capability to embed string queries directly within Knn search expressions in the Chroma API, both in Python and JavaScript bindings. When a string is provided as a Knn.query, the server or client attempts to embed it using the configured embedding function (for dense queries) or a schema-defined embedding function (for sparse or alternate metadata fields). The change includes recursive handling for arbitrarily nested rank expressions and ensures proper embedding prior to search execution. Comprehensive tests validate that string queries are transformed into embedded vectors, both as direct and nested Knn queries, and proper errors are raised when no embedding function is available.

Key Changes

• Added _embed_knn_string_queries, _embed_rank_string_queries, and _embed_search_string_queries to chromadb/api/models/CollectionCommon.py to recursively embed string queries in Knn and rank/search objects
• Extended Knn operator in chromadb/execution/expression/operator.py (and TypeScript equivalent) to accept string queries, updating API documentation and type annotations
• Implemented analogous embedding logic in JavaScript client: added embedKnnLiteral, embedRankLiteral, and embedSearchPayload to clients/new-js/packages/chromadb/src/collection.ts
• Modified Collection.search and AsyncCollection.search to preprocess all searches and embed string queries before sending to backend
• Enhanced test coverage in chromadb/test/api/test_schema_e2e.py with new tests that assert string queries are embedded and intermediary hooks are invoked
• Updated docstrings and in-code documentation to make accepted Knn.query types explicit (string or vector), and improved example code throughout

Affected Areas

chromadb/api/models/CollectionCommon.py (core embedding logic)
chromadb/api/models/Collection.py, AsyncCollection.py (search preprocessing)
clients/new-js/packages/chromadb/src/collection.ts (JS embedding logic)
chromadb/execution/expression/operator.py, clients/new-js/packages/chromadb/src/execution/expression/rank.ts (updated Knn query support, type checks)
chromadb/test/api/test_schema_e2e.py (unit/integration tests)

This summary was automatically generated by @propel-code-bot

@jairad26 jairad26 changed the base branch from main to graphite-base/5599 October 16, 2025 18:56
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from ac96cfc to 7f43188 Compare October 16, 2025 18:56
@jairad26 jairad26 changed the base branch from graphite-base/5599 to jai/schema-e2e-tests October 16, 2025 18:56
# Handle main embedding field
if key == EMBEDDING_KEY:
# Use the collection's main embedding function
embedding = self._embed(input=[query_text], is_query=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Error handling gap: Multiple embedding function calls (self._embed, self._sparse_embed, embedding_func.embed_query) can raise exceptions but aren't wrapped in try-catch blocks. If any embedding function fails (network issues, model errors, invalid input), the error will propagate uncaught and could crash the application.

Consider wrapping embedding calls:

try:
    embedding = self._embed(input=[query_text], is_query=True)
except Exception as e:
    raise ValueError(f"Failed to embed query '{query_text}': {e}") from e
Context for Agents
[**BestPractice**]

Error handling gap: Multiple embedding function calls (`self._embed`, `self._sparse_embed`, `embedding_func.embed_query`) can raise exceptions but aren't wrapped in try-catch blocks. If any embedding function fails (network issues, model errors, invalid input), the error will propagate uncaught and could crash the application.

Consider wrapping embedding calls:
```python
try:
    embedding = self._embed(input=[query_text], is_query=True)
except Exception as e:
    raise ValueError(f"Failed to embed query '{query_text}': {e}") from e
```

File: chromadb/api/models/CollectionCommon.py
Line: 773

f"Please provide an embedded vector or configure an embedding function."
)

def _embed_rank_string_queries(self, rank: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential infinite recursion: The method _embed_rank_string_queries recursively processes rank expressions but doesn't have a maximum depth check. For deeply nested or circular rank structures, this could cause a stack overflow.

Consider adding a depth limit:

def _embed_rank_string_queries(self, rank: Any, depth: int = 0, max_depth: int = 100) -> Any:
    if depth > max_depth:
        raise ValueError(f"Maximum recursion depth ({max_depth}) exceeded in rank expression")
    # ... existing logic with depth + 1 passed to recursive calls
Context for Agents
[**BestPractice**]

Potential infinite recursion: The method `_embed_rank_string_queries` recursively processes rank expressions but doesn't have a maximum depth check. For deeply nested or circular rank structures, this could cause a stack overflow.

Consider adding a depth limit:
```python
def _embed_rank_string_queries(self, rank: Any, depth: int = 0, max_depth: int = 100) -> Any:
    if depth > max_depth:
        raise ValueError(f"Maximum recursion depth ({max_depth}) exceeded in rank expression")
    # ... existing logic with depth + 1 passed to recursive calls
```

File: chromadb/api/models/CollectionCommon.py
Line: 868

@jairad26 jairad26 changed the base branch from jai/schema-e2e-tests to graphite-base/5599 October 16, 2025 19:17
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 7f43188 to bc4e1cc Compare October 16, 2025 20:12
@jairad26 jairad26 changed the base branch from graphite-base/5599 to jai/schema-js-impl October 16, 2025 20:12
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from bc4e1cc to e7655bc Compare October 16, 2025 20:18
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from e7655bc to 99144a3 Compare October 16, 2025 20:19
Comment on lines +361 to +362
embedded_searches = [
self._embed_search_string_queries(search) for search in searches_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential error handling gap: If _embed_search_string_queries() raises an exception for any search in the list comprehension, the entire operation will fail and no searches will be processed. Consider adding error handling to gracefully handle individual search embedding failures:

embedded_searches = []
for search in searches_list:
    try:
        embedded_searches.append(self._embed_search_string_queries(search))
    except Exception as e:
        logger.warning(f"Failed to embed search: {e}")
        embedded_searches.append(search)  # Use original search as fallback
Context for Agents
[**BestPractice**]

Potential error handling gap: If `_embed_search_string_queries()` raises an exception for any search in the list comprehension, the entire operation will fail and no searches will be processed. Consider adding error handling to gracefully handle individual search embedding failures:

```python
embedded_searches = []
for search in searches_list:
    try:
        embedded_searches.append(self._embed_search_string_queries(search))
    except Exception as e:
        logger.warning(f"Failed to embed search: {e}")
        embedded_searches.append(search)  # Use original search as fallback
```

File: chromadb/api/models/AsyncCollection.py
Line: 362

Comment on lines +365 to +366
embedded_searches = [
self._embed_search_string_queries(search) for search in searches_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Same error handling concern applies here. If _embed_search_string_queries() fails for any search in the list, all searches will fail to process. Consider adding individual error handling as suggested for the AsyncCollection version.

Context for Agents
[**BestPractice**]

Same error handling concern applies here. If `_embed_search_string_queries()` fails for any search in the list, all searches will fail to process. Consider adding individual error handling as suggested for the AsyncCollection version.

File: chromadb/api/models/Collection.py
Line: 366

Comment on lines +842 to +843
try:
embeddings = embedding_func.embed_query(input=[query_text])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Error handling issue: The try-except AttributeError block only catches AttributeError but the embedding function could fail with other exceptions (network errors, validation errors, etc.). Consider catching broader exceptions:

try:
    embeddings = embedding_func.embed_query(input=[query_text])
except AttributeError:
    # Fallback if embed_query doesn't exist
    embeddings = embedding_func([query_text])
except Exception as e:
    raise ValueError(
        f"Failed to embed string query '{query_text}' using embedding function: {e}"
    ) from e
Context for Agents
[**BestPractice**]

Error handling issue: The `try-except AttributeError` block only catches `AttributeError` but the embedding function could fail with other exceptions (network errors, validation errors, etc.). Consider catching broader exceptions:

```python
try:
    embeddings = embedding_func.embed_query(input=[query_text])
except AttributeError:
    # Fallback if embed_query doesn't exist
    embeddings = embedding_func([query_text])
except Exception as e:
    raise ValueError(
        f"Failed to embed string query '{query_text}' using embedding function: {e}"
    ) from e
```

File: chromadb/api/models/CollectionCommon.py
Line: 843

@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 99144a3 to 4ee74c9 Compare October 16, 2025 20:28
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 22dbb01 to a600f21 Compare October 17, 2025 00:40
Comment on lines +811 to +812
# Embed the query
sparse_embedding = self._sparse_embed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Error handling gap: Similar to the main embedding case, if self._sparse_embed() fails, the error lacks context about which query and key failed. With multiple searches containing different sparse embedding keys, this makes debugging difficult.

Consider adding context:

Suggested Change
Suggested change
# Embed the query
sparse_embedding = self._sparse_embed(
# Embed the query
try:
sparse_embedding = self._sparse_embed(
input=[query_text],
sparse_embedding_function=embedding_func,
is_query=True,
)
except Exception as e:
raise ValueError(
f"Failed to embed string query '{query_text}' for sparse key '{key}': {e}"
) from e

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

Error handling gap: Similar to the main embedding case, if `self._sparse_embed()` fails, the error lacks context about which query and key failed. With multiple searches containing different sparse embedding keys, this makes debugging difficult.

Consider adding context:

<details>
<summary>Suggested Change</summary>

```suggestion
                    # Embed the query
                    try:
                        sparse_embedding = self._sparse_embed(
                            input=[query_text],
                            sparse_embedding_function=embedding_func,
                            is_query=True,
                        )
                    except Exception as e:
                        raise ValueError(
                            f"Failed to embed string query '{query_text}' for sparse key '{key}': {e}"
                        ) from e
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: chromadb/api/models/CollectionCommon.py
Line: 812

@jairad26 jairad26 force-pushed the jai/embed-search-query branch from a600f21 to e509739 Compare October 17, 2025 00:46
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 17, 2025
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from e509739 to 662e8ad Compare October 17, 2025 03:00
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 17, 2025
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 662e8ad to 588a488 Compare October 17, 2025 04:21
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 17, 2025
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 588a488 to 3dd9ddc Compare October 20, 2025 16:42
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 20, 2025
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 3dd9ddc to 1ab6fef Compare October 20, 2025 17:06
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 1ab6fef to 091aa54 Compare October 20, 2025 17:37
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 20, 2025
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 091aa54 to 842d149 Compare October 20, 2025 18:25
@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 842d149 to 499ac84 Compare October 20, 2025 18:26
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 20, 2025
Comment on lines +779 to +785
return Knn(
query=embedding[0],
key=knn.key,
limit=knn.limit,
default=knn.default,
return_rank=knn.return_rank,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

To reduce code repetition, you could use dataclasses.replace to create new Knn instances with an updated query. This would make the code more concise and less prone to errors if the Knn dataclass changes in the future.

You would need to add from dataclasses import replace at the top of the file.

Then, this block and the similar ones at lines 824-830 and 854-860 could be simplified.

For example, this:

            return Knn(
                query=embedding[0],
                key=knn.key,
                limit=knn.limit,
                default=knn.default,
                return_rank=knn.return_rank,
            )

becomes:

            return replace(knn, query=embedding[0])

Note: dataclasses.replace() is the standard library recommended approach for creating modified copies of dataclass instances. It's safer than manual construction because it handles __post_init__ calls correctly and is more maintainable when dataclass fields change. The codebase already imports dataclasses in multiple files, so this change aligns with existing patterns.

Context for Agents
[**BestPractice**]

To reduce code repetition, you could use `dataclasses.replace` to create new `Knn` instances with an updated `query`. This would make the code more concise and less prone to errors if the `Knn` dataclass changes in the future.

You would need to add `from dataclasses import replace` at the top of the file.

Then, this block and the similar ones at lines 824-830 and 854-860 could be simplified.

For example, this:
```python
            return Knn(
                query=embedding[0],
                key=knn.key,
                limit=knn.limit,
                default=knn.default,
                return_rank=knn.return_rank,
            )
```
becomes:
```python
            return replace(knn, query=embedding[0])
```

**Note**: `dataclasses.replace()` is the standard library recommended approach for creating modified copies of dataclass instances. It's safer than manual construction because it handles `__post_init__` calls correctly and is more maintainable when dataclass fields change. The codebase already imports dataclasses in multiple files, so this change aligns with existing patterns.

File: chromadb/api/models/CollectionCommon.py
Line: 785

Comment on lines +817 to +821
const payloads = await Promise.all(
items.map(async (search) => {
const payload = toSearch(search).toPayload();
return this.embedSearchPayload(payload);
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Error Handling Gap: Embedding Failures

If any embedding operation fails during the Promise.all, the entire search operation will fail without meaningful error context. Consider adding error handling that identifies which specific query failed:

const payloads = await Promise.all(
  items.map(async (search, index) => {
    try {
      const payload = toSearch(search).toPayload();
      return await this.embedSearchPayload(payload);
    } catch (error) {
      throw new Error(
        `Failed to embed search query at index ${index}: ${error.message}`
      );
    }
  })
);

Note: Use standard Error class instead of ChromaValueError in TypeScript client, as error classes may differ between Python and TypeScript implementations.

Context for Agents
[**BestPractice**]

**Error Handling Gap: Embedding Failures**

If any embedding operation fails during the Promise.all, the entire search operation will fail without meaningful error context. Consider adding error handling that identifies which specific query failed:

```typescript
const payloads = await Promise.all(
  items.map(async (search, index) => {
    try {
      const payload = toSearch(search).toPayload();
      return await this.embedSearchPayload(payload);
    } catch (error) {
      throw new Error(
        `Failed to embed search query at index ${index}: ${error.message}`
      );
    }
  })
);
```

**Note**: Use standard `Error` class instead of `ChromaValueError` in TypeScript client, as error classes may differ between Python and TypeScript implementations.

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 821

Comment on lines +375 to +378
query = queryInput;
} else if (
isPlainObject(queryInput) &&
Array.isArray((queryInput as SparseVector).indices) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[DataTypeCheck]

Type Safety Issue: Unsafe Type Assertion

The type assertion (queryInput as SparseVector) assumes the object has the correct structure without runtime validation. If the object has indices and values properties but they're not arrays, this will cause runtime errors.

Suggested Change
Suggested change
query = queryInput;
} else if (
isPlainObject(queryInput) &&
Array.isArray((queryInput as SparseVector).indices) &&
if (
isPlainObject(queryInput) &&
Array.isArray((queryInput as any).indices) &&
Array.isArray((queryInput as any).values) &&
(queryInput as any).indices.every((i: any) => typeof i === 'number') &&
(queryInput as any).values.every((v: any) => typeof v === 'number')
) {
const sparse = queryInput as SparseVector;
query = {
indices: sparse.indices.slice(),
values: sparse.values.slice(),
};

Validation: ✅ This approach aligns with ChromaDB's internal SparseVector validation, which checks for proper array types, numeric values, and indices constraints.

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**DataTypeCheck**]

**Type Safety Issue: Unsafe Type Assertion**

The type assertion `(queryInput as SparseVector)` assumes the object has the correct structure without runtime validation. If the object has `indices` and `values` properties but they're not arrays, this will cause runtime errors.

<details>
<summary>Suggested Change</summary>

```suggestion
if (
  isPlainObject(queryInput) &&
  Array.isArray((queryInput as any).indices) &&
  Array.isArray((queryInput as any).values) &&
  (queryInput as any).indices.every((i: any) => typeof i === 'number') &&
  (queryInput as any).values.every((v: any) => typeof v === 'number')
) {
  const sparse = queryInput as SparseVector;
  query = {
    indices: sparse.indices.slice(),
    values: sparse.values.slice(),
  };
```

**Validation**: ✅ This approach aligns with ChromaDB's internal SparseVector validation, which checks for proper array types, numeric values, and indices constraints.

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: clients/new-js/packages/chromadb/src/execution/expression/rank.ts
Line: 378


search = Search().rank(Knn(query="hello world"))

print(collection.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This print statement appears to be a leftover from debugging. It should be removed to keep the test suite clean.

Context for Agents
[**BestPractice**]

This `print` statement appears to be a leftover from debugging. It should be removed to keep the test suite clean.

File: chromadb/test/api/test_schema_e2e.py
Line: 548

@jairad26 jairad26 force-pushed the jai/embed-search-query branch from 499ac84 to bf5fcfb Compare October 20, 2025 18:40
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant