-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH] Add Schema to js client #5621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
c51fb45
to
61c4404
Compare
f40f1ea
to
e3a6cb8
Compare
61c4404
to
3316d46
Compare
e3a6cb8
to
5439079
Compare
3316d46
to
80453d2
Compare
fe26daa
to
ac00bdc
Compare
8bd6af6
to
be61947
Compare
ac00bdc
to
a9944bd
Compare
Add Schema Support and Serialization/Deserialization to JS Client This pull request introduces a comprehensive schema abstraction to the ChromaDB JavaScript client, aligning its capabilities with the Python implementation. It provides infrastructure for describing, configuring, serializing, and deserializing schema information related to vector and sparse indexes at both the global (defaults) and per-key levels within a collection. The changes also integrate schema-awareness into the client APIs, handle embedding function selection via schema, and introduce full parity schema testing to ensure correct behavior. Significant refactoring to the client and collection layers ensures seamless schema roundtrip between server and client, and extensive automated tests reinforce correctness and future regressions against the reference implementation. Key Changes• Introduced new Affected Areas• This summary was automatically generated by @propel-code-bot |
data.map(async (collection) => | ||
new CollectionImpl({ | ||
chromaClient: this, | ||
apiClient: this.apiClient, | ||
name: collection.name, | ||
id: collection.id, | ||
embeddingFunction: await getEmbeddingFunction( | ||
collection.name, | ||
collection.configuration_json.embedding_function ?? undefined, | ||
), | ||
configuration: collection.configuration_json, | ||
metadata: collection.metadata ?? undefined, | ||
schema: Schema.deserializeFromJSON(collection.schema ?? undefined), | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The logic for creating a CollectionImpl
instance from an API response is repeated in listCollections
, createCollection
, getCollection
, and getOrCreateCollection
. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example _collectionFromResponseData(data, embeddingFunction?)
. This would centralize the construction of collection objects from API data.
ChromaDB Best Practice: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.
Context for Agents
[**BestPractice**]
The logic for creating a `CollectionImpl` instance from an API response is repeated in `listCollections`, `createCollection`, `getCollection`, and `getOrCreateCollection`. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example `_collectionFromResponseData(data, embeddingFunction?)`. This would centralize the construction of collection objects from API data.
**ChromaDB Best Practice**: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.
File: clients/new-js/packages/chromadb/src/chroma-client.ts
Line: 232
be61947
to
9322a2c
Compare
a9944bd
to
2bb4533
Compare
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined { | ||
const schema = this._schema; | ||
if (!schema) return undefined; | ||
|
||
const schemaOverride = schema.keys[EMBEDDING_KEY]; | ||
const overrideFunction = schemaOverride?.floatList?.vectorIndex?.config | ||
.embeddingFunction; | ||
if (overrideFunction) { | ||
return overrideFunction; | ||
} | ||
|
||
const defaultFunction = schema.defaults.floatList?.vectorIndex?.config | ||
.embeddingFunction; | ||
return defaultFunction ?? undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
This function's logic for finding the embedding function can be made more concise. The current implementation with an if
statement and fallback can be simplified by using logical operators to express the preference for the override function, then the default.
Suggested Change
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined { | |
const schema = this._schema; | |
if (!schema) return undefined; | |
const schemaOverride = schema.keys[EMBEDDING_KEY]; | |
const overrideFunction = schemaOverride?.floatList?.vectorIndex?.config | |
.embeddingFunction; | |
if (overrideFunction) { | |
return overrideFunction; | |
} | |
const defaultFunction = schema.defaults.floatList?.vectorIndex?.config | |
.embeddingFunction; | |
return defaultFunction ?? undefined; | |
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined { | |
if (!this._schema) { | |
return undefined; | |
} | |
const overrideFunction = | |
this._schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config | |
.embeddingFunction; | |
const defaultFunction = | |
this._schema.defaults.floatList?.vectorIndex?.config.embeddingFunction; | |
// Use override if available, otherwise default. Coalesce to undefined if both are null. | |
return (overrideFunction || defaultFunction) ?? undefined; | |
} |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
This function's logic for finding the embedding function can be made more concise. The current implementation with an `if` statement and fallback can be simplified by using logical operators to express the preference for the override function, then the default.
<details>
<summary>Suggested Change</summary>
```suggestion
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined {
if (!this._schema) {
return undefined;
}
const overrideFunction =
this._schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config
.embeddingFunction;
const defaultFunction =
this._schema.defaults.floatList?.vectorIndex?.config.embeddingFunction;
// Use override if available, otherwise default. Coalesce to undefined if both are null.
return (overrideFunction || defaultFunction) ?? undefined;
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 339
@@ -0,0 +1,1002 @@ | |||
import type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The new schema.ts
file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main Schema
class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules.
For example, you could structure it like this:
src/schema/constants.ts
src/schema/index-configs.ts
src/schema/value-types.ts
src/schema/utils.ts
src/schema/schema.ts
(for the main class)src/schema/index.ts
(to re-export everything)
This would make the code easier to navigate and understand.
Context for Agents
[**BestPractice**]
The new `schema.ts` file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main `Schema` class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules.
For example, you could structure it like this:
- `src/schema/constants.ts`
- `src/schema/index-configs.ts`
- `src/schema/value-types.ts`
- `src/schema/utils.ts`
- `src/schema/schema.ts` (for the main class)
- `src/schema/index.ts` (to re-export everything)
This would make the code easier to navigate and understand.
File: clients/new-js/packages/chromadb/src/schema.ts
Line: 1
9322a2c
to
8e1d146
Compare
2bb4533
to
e346e31
Compare
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> { | ||
if (!this._embeddingFunction) { | ||
const embeddingFunction = | ||
this._embeddingFunction ?? this.getSchemaEmbeddingFunction(); | ||
|
||
if (!embeddingFunction) { | ||
throw new ChromaValueError( | ||
"Embedding function must be defined for operations requiring embeddings.", | ||
); | ||
} | ||
|
||
if (this._embeddingFunction.generateForQueries && isQuery) { | ||
return await this._embeddingFunction.generateForQueries(inputs); | ||
} else { | ||
return await this._embeddingFunction.generate(inputs); | ||
} | ||
if (isQuery && embeddingFunction.generateForQueries) { | ||
return await embeddingFunction.generateForQueries(inputs); | ||
} | ||
|
||
return await embeddingFunction.generate(inputs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The embedding function resolution logic has improved error handling, but the error message should be more descriptive about resolution options. ChromaDB 3.0+ supports embedding functions at both the collection instance level and schema level, so users need clear guidance on how to resolve missing embedding function errors.
Suggested Change
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> { | |
if (!this._embeddingFunction) { | |
const embeddingFunction = | |
this._embeddingFunction ?? this.getSchemaEmbeddingFunction(); | |
if (!embeddingFunction) { | |
throw new ChromaValueError( | |
"Embedding function must be defined for operations requiring embeddings.", | |
); | |
} | |
if (this._embeddingFunction.generateForQueries && isQuery) { | |
return await this._embeddingFunction.generateForQueries(inputs); | |
} else { | |
return await this._embeddingFunction.generate(inputs); | |
} | |
if (isQuery && embeddingFunction.generateForQueries) { | |
return await embeddingFunction.generateForQueries(inputs); | |
} | |
return await embeddingFunction.generate(inputs); | |
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> { | |
const embeddingFunction = | |
this._embeddingFunction ?? this.getSchemaEmbeddingFunction(); | |
if (!embeddingFunction) { | |
throw new ChromaValueError( | |
"Embedding function must be defined for operations requiring embeddings. " + | |
"Provide an embedding function when creating the collection or configure one in the schema. " + | |
"See https://docs.trychroma.com/docs/embeddings/embedding-functions for available options.", | |
); | |
} | |
if (isQuery && embeddingFunction.generateForQueries) { | |
return await embeddingFunction.generateForQueries(inputs); | |
} | |
return await embeddingFunction.generate(inputs); | |
} |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
The embedding function resolution logic has improved error handling, but the error message should be more descriptive about resolution options. ChromaDB 3.0+ supports embedding functions at both the collection instance level and schema level, so users need clear guidance on how to resolve missing embedding function errors.
<details>
<summary>Suggested Change</summary>
```suggestion
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> {
const embeddingFunction =
this._embeddingFunction ?? this.getSchemaEmbeddingFunction();
if (!embeddingFunction) {
throw new ChromaValueError(
"Embedding function must be defined for operations requiring embeddings. " +
"Provide an embedding function when creating the collection or configure one in the schema. " +
"See https://docs.trychroma.com/docs/embeddings/embedding-functions for available options.",
);
}
if (isQuery && embeddingFunction.generateForQueries) {
return await embeddingFunction.generateForQueries(inputs);
}
return await embeddingFunction.generate(inputs);
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/chromadb/src/collection.ts
Line: 323
private validateSingleSparseVectorIndex(targetKey: string): void { | ||
for (const [existingKey, valueTypes] of Object.entries(this.keys)) { | ||
if (existingKey === targetKey) continue; | ||
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex; | ||
if (sparseIndex?.enabled) { | ||
throw new Error( | ||
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. Only one sparse vector index is allowed per collection.`, | ||
); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The validateSingleSparseVectorIndex
method enforces that only one sparse vector index can be enabled per collection, but the error message should provide clearer resolution guidance. Sparse vector indexing is an advanced ChromaDB feature with specific constraints that users need to understand.
Suggested Change
private validateSingleSparseVectorIndex(targetKey: string): void { | |
for (const [existingKey, valueTypes] of Object.entries(this.keys)) { | |
if (existingKey === targetKey) continue; | |
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex; | |
if (sparseIndex?.enabled) { | |
throw new Error( | |
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. Only one sparse vector index is allowed per collection.`, | |
); | |
} | |
} | |
} | |
private validateSingleSparseVectorIndex(targetKey: string): void { | |
for (const [existingKey, valueTypes] of Object.entries(this.keys)) { | |
if (existingKey === targetKey) continue; | |
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex; | |
if (sparseIndex?.enabled) { | |
throw new Error( | |
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. ` + | |
`Only one sparse vector index is allowed per collection. To resolve this conflict, either: ` + | |
`1) Disable the existing index using deleteIndex(new SparseVectorIndexConfig(), '${existingKey}'), or ` + | |
`2) Use a different key name for your new sparse vector index.`, | |
); | |
} | |
} | |
} |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
The `validateSingleSparseVectorIndex` method enforces that only one sparse vector index can be enabled per collection, but the error message should provide clearer resolution guidance. Sparse vector indexing is an advanced ChromaDB feature with specific constraints that users need to understand.
<details>
<summary>Suggested Change</summary>
```suggestion
private validateSingleSparseVectorIndex(targetKey: string): void {
for (const [existingKey, valueTypes] of Object.entries(this.keys)) {
if (existingKey === targetKey) continue;
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex;
if (sparseIndex?.enabled) {
throw new Error(
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. ` +
`Only one sparse vector index is allowed per collection. To resolve this conflict, either: ` +
`1) Disable the existing index using deleteIndex(new SparseVectorIndexConfig(), '${existingKey}'), or ` +
`2) Use a different key name for your new sparse vector index.`,
);
}
}
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: clients/new-js/packages/chromadb/src/schema.ts
Line: 633
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
added schema unit tests matching python ones
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?