Skip to content

Conversation

jairad26
Copy link
Contributor

@jairad26 jairad26 commented Oct 16, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • This PR adds schema support to the js client, along with tests, and logic to embed sparse vectors using efs, using the schema for dense vecs, and tests to ensure serialization and deserialization work
  • New functionality
    • ...

Test plan

How are these changes tested?

added schema unit tests matching python ones

  • [ x] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

jairad26 commented Oct 16, 2025

@jairad26 jairad26 force-pushed the jai/schema-js-impl branch 3 times, most recently from c51fb45 to 61c4404 Compare October 16, 2025 20:28
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch from f40f1ea to e3a6cb8 Compare October 16, 2025 20:33
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch from e3a6cb8 to 5439079 Compare October 16, 2025 22:33
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/schema-js-impl branch 2 times, most recently from fe26daa to ac00bdc Compare October 16, 2025 23:35
@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch 2 times, most recently from 8bd6af6 to be61947 Compare October 17, 2025 00:40
@jairad26 jairad26 marked this pull request as ready for review October 17, 2025 00:40
Copy link
Contributor

propel-code-bot bot commented Oct 17, 2025

Add Schema Support and Serialization/Deserialization to JS Client

This pull request introduces a comprehensive schema abstraction to the ChromaDB JavaScript client, aligning its capabilities with the Python implementation. It provides infrastructure for describing, configuring, serializing, and deserializing schema information related to vector and sparse indexes at both the global (defaults) and per-key levels within a collection. The changes also integrate schema-awareness into the client APIs, handle embedding function selection via schema, and introduce full parity schema testing to ensure correct behavior. Significant refactoring to the client and collection layers ensures seamless schema roundtrip between server and client, and extensive automated tests reinforce correctness and future regressions against the reference implementation.

Key Changes

• Introduced new schema.ts module providing schema/value type/config classes, serialization, deserialization, and strong typing for value/index types.
• Integrated schema propagation through all collection operations (create, get, list, etc.) in chroma-client.ts and collection.ts.
• Modified CollectionImpl to support schema object, allow fallback to schema-provided embedding functions, and enable index configuration awareness.
• Ensured schema is included and processed in all create/get/getOrCreate/list operations, with correct (de)serialization hooks.
• Added 1000+ line test suite (schema.test.ts) for schema construction, schema mutations, key and global overrides, serialization roundtrips, and edge cases.
• Expanded and formalized embedding function registry types and helpers for both dense and sparse embeddings.
• Minor API/typing updates to surface schema information throughout public client types and interfaces.
• Updated entrypoints to export the new schema types.

Affected Areas

clients/new-js/packages/chromadb/src/schema.ts (entirely new, central schema logic)
clients/new-js/packages/chromadb/test/schema.test.ts (full schema test suite)
clients/new-js/packages/chromadb/src/chroma-client.ts (createCollection, getCollection, etc., extended for schema support)
clients/new-js/packages/chromadb/src/collection.ts (schema field, embedding function fallback, API integration)
clients/new-js/packages/chromadb/src/embedding-function.ts (typed registry, AnyEmbeddingFunction)
clients/new-js/packages/chromadb/src/index.ts (now exports schema)

This summary was automatically generated by @propel-code-bot

Comment on lines +219 to +232
data.map(async (collection) =>
new CollectionImpl({
chromaClient: this,
apiClient: this.apiClient,
name: collection.name,
id: collection.id,
embeddingFunction: await getEmbeddingFunction(
collection.name,
collection.configuration_json.embedding_function ?? undefined,
),
configuration: collection.configuration_json,
metadata: collection.metadata ?? undefined,
schema: Schema.deserializeFromJSON(collection.schema ?? undefined),
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The logic for creating a CollectionImpl instance from an API response is repeated in listCollections, createCollection, getCollection, and getOrCreateCollection. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example _collectionFromResponseData(data, embeddingFunction?). This would centralize the construction of collection objects from API data.

ChromaDB Best Practice: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.

Context for Agents
[**BestPractice**]

The logic for creating a `CollectionImpl` instance from an API response is repeated in `listCollections`, `createCollection`, `getCollection`, and `getOrCreateCollection`. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example `_collectionFromResponseData(data, embeddingFunction?)`. This would centralize the construction of collection objects from API data.

**ChromaDB Best Practice**: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.

File: clients/new-js/packages/chromadb/src/chroma-client.ts
Line: 232

@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch from be61947 to 9322a2c Compare October 17, 2025 03:00
Comment on lines +326 to +339
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined {
const schema = this._schema;
if (!schema) return undefined;

const schemaOverride = schema.keys[EMBEDDING_KEY];
const overrideFunction = schemaOverride?.floatList?.vectorIndex?.config
.embeddingFunction;
if (overrideFunction) {
return overrideFunction;
}

const defaultFunction = schema.defaults.floatList?.vectorIndex?.config
.embeddingFunction;
return defaultFunction ?? undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This function's logic for finding the embedding function can be made more concise. The current implementation with an if statement and fallback can be simplified by using logical operators to express the preference for the override function, then the default.

Suggested Change
Suggested change
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined {
const schema = this._schema;
if (!schema) return undefined;
const schemaOverride = schema.keys[EMBEDDING_KEY];
const overrideFunction = schemaOverride?.floatList?.vectorIndex?.config
.embeddingFunction;
if (overrideFunction) {
return overrideFunction;
}
const defaultFunction = schema.defaults.floatList?.vectorIndex?.config
.embeddingFunction;
return defaultFunction ?? undefined;
private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined {
if (!this._schema) {
return undefined;
}
const overrideFunction =
this._schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config
.embeddingFunction;
const defaultFunction =
this._schema.defaults.floatList?.vectorIndex?.config.embeddingFunction;
// Use override if available, otherwise default. Coalesce to undefined if both are null.
return (overrideFunction || defaultFunction) ?? undefined;
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

This function's logic for finding the embedding function can be made more concise. The current implementation with an `if` statement and fallback can be simplified by using logical operators to express the preference for the override function, then the default.

<details>
<summary>Suggested Change</summary>

```suggestion
  private getSchemaEmbeddingFunction(): EmbeddingFunction | undefined {
    if (!this._schema) {
      return undefined;
    }

    const overrideFunction =
      this._schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config
        .embeddingFunction;

    const defaultFunction =
      this._schema.defaults.floatList?.vectorIndex?.config.embeddingFunction;

    // Use override if available, otherwise default. Coalesce to undefined if both are null.
    return (overrideFunction || defaultFunction) ?? undefined;
  }
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 339

@@ -0,0 +1,1002 @@
import type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The new schema.ts file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main Schema class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules.

For example, you could structure it like this:

  • src/schema/constants.ts
  • src/schema/index-configs.ts
  • src/schema/value-types.ts
  • src/schema/utils.ts
  • src/schema/schema.ts (for the main class)
  • src/schema/index.ts (to re-export everything)

This would make the code easier to navigate and understand.

Context for Agents
[**BestPractice**]

The new `schema.ts` file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main `Schema` class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules. 

For example, you could structure it like this:
- `src/schema/constants.ts`
- `src/schema/index-configs.ts`
- `src/schema/value-types.ts`
- `src/schema/utils.ts`
- `src/schema/schema.ts` (for the main class)
- `src/schema/index.ts` (to re-export everything)

This would make the code easier to navigate and understand.

File: clients/new-js/packages/chromadb/src/schema.ts
Line: 1

@jairad26 jairad26 changed the base branch from jai/schema-e2e-tests to graphite-base/5621 October 17, 2025 04:21
@jairad26 jairad26 changed the base branch from graphite-base/5621 to main October 17, 2025 04:21
Comment on lines 309 to +323
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> {
if (!this._embeddingFunction) {
const embeddingFunction =
this._embeddingFunction ?? this.getSchemaEmbeddingFunction();

if (!embeddingFunction) {
throw new ChromaValueError(
"Embedding function must be defined for operations requiring embeddings.",
);
}

if (this._embeddingFunction.generateForQueries && isQuery) {
return await this._embeddingFunction.generateForQueries(inputs);
} else {
return await this._embeddingFunction.generate(inputs);
}
if (isQuery && embeddingFunction.generateForQueries) {
return await embeddingFunction.generateForQueries(inputs);
}

return await embeddingFunction.generate(inputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The embedding function resolution logic has improved error handling, but the error message should be more descriptive about resolution options. ChromaDB 3.0+ supports embedding functions at both the collection instance level and schema level, so users need clear guidance on how to resolve missing embedding function errors.

Suggested Change
Suggested change
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> {
if (!this._embeddingFunction) {
const embeddingFunction =
this._embeddingFunction ?? this.getSchemaEmbeddingFunction();
if (!embeddingFunction) {
throw new ChromaValueError(
"Embedding function must be defined for operations requiring embeddings.",
);
}
if (this._embeddingFunction.generateForQueries && isQuery) {
return await this._embeddingFunction.generateForQueries(inputs);
} else {
return await this._embeddingFunction.generate(inputs);
}
if (isQuery && embeddingFunction.generateForQueries) {
return await embeddingFunction.generateForQueries(inputs);
}
return await embeddingFunction.generate(inputs);
private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> {
const embeddingFunction =
this._embeddingFunction ?? this.getSchemaEmbeddingFunction();
if (!embeddingFunction) {
throw new ChromaValueError(
"Embedding function must be defined for operations requiring embeddings. " +
"Provide an embedding function when creating the collection or configure one in the schema. " +
"See https://docs.trychroma.com/docs/embeddings/embedding-functions for available options.",
);
}
if (isQuery && embeddingFunction.generateForQueries) {
return await embeddingFunction.generateForQueries(inputs);
}
return await embeddingFunction.generate(inputs);
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

The embedding function resolution logic has improved error handling, but the error message should be more descriptive about resolution options. ChromaDB 3.0+ supports embedding functions at both the collection instance level and schema level, so users need clear guidance on how to resolve missing embedding function errors.

<details>
<summary>Suggested Change</summary>

```suggestion
  private async embed(inputs: string[], isQuery: boolean): Promise<number[][]> {
    const embeddingFunction =
      this._embeddingFunction ?? this.getSchemaEmbeddingFunction();

    if (!embeddingFunction) {
      throw new ChromaValueError(
        "Embedding function must be defined for operations requiring embeddings. " +
        "Provide an embedding function when creating the collection or configure one in the schema. " +
        "See https://docs.trychroma.com/docs/embeddings/embedding-functions for available options.",
      );
    }

    if (isQuery && embeddingFunction.generateForQueries) {
      return await embeddingFunction.generateForQueries(inputs);
    }

    return await embeddingFunction.generate(inputs);
  }
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 323

Comment on lines +623 to +633
private validateSingleSparseVectorIndex(targetKey: string): void {
for (const [existingKey, valueTypes] of Object.entries(this.keys)) {
if (existingKey === targetKey) continue;
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex;
if (sparseIndex?.enabled) {
throw new Error(
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. Only one sparse vector index is allowed per collection.`,
);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The validateSingleSparseVectorIndex method enforces that only one sparse vector index can be enabled per collection, but the error message should provide clearer resolution guidance. Sparse vector indexing is an advanced ChromaDB feature with specific constraints that users need to understand.

Suggested Change
Suggested change
private validateSingleSparseVectorIndex(targetKey: string): void {
for (const [existingKey, valueTypes] of Object.entries(this.keys)) {
if (existingKey === targetKey) continue;
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex;
if (sparseIndex?.enabled) {
throw new Error(
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. Only one sparse vector index is allowed per collection.`,
);
}
}
}
private validateSingleSparseVectorIndex(targetKey: string): void {
for (const [existingKey, valueTypes] of Object.entries(this.keys)) {
if (existingKey === targetKey) continue;
const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex;
if (sparseIndex?.enabled) {
throw new Error(
`Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. ` +
`Only one sparse vector index is allowed per collection. To resolve this conflict, either: ` +
`1) Disable the existing index using deleteIndex(new SparseVectorIndexConfig(), '${existingKey}'), or ` +
`2) Use a different key name for your new sparse vector index.`,
);
}
}
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

The `validateSingleSparseVectorIndex` method enforces that only one sparse vector index can be enabled per collection, but the error message should provide clearer resolution guidance. Sparse vector indexing is an advanced ChromaDB feature with specific constraints that users need to understand.

<details>
<summary>Suggested Change</summary>

```suggestion
  private validateSingleSparseVectorIndex(targetKey: string): void {
    for (const [existingKey, valueTypes] of Object.entries(this.keys)) {
      if (existingKey === targetKey) continue;
      const sparseIndex = valueTypes.sparseVector?.sparseVectorIndex;
      if (sparseIndex?.enabled) {
        throw new Error(
          `Cannot enable sparse vector index on key '${targetKey}'. A sparse vector index is already enabled on key '${existingKey}'. ` +
          `Only one sparse vector index is allowed per collection. To resolve this conflict, either: ` +
          `1) Disable the existing index using deleteIndex(new SparseVectorIndexConfig(), '${existingKey}'), or ` +
          `2) Use a different key name for your new sparse vector index.`,
        );
      }
    }
  }
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: clients/new-js/packages/chromadb/src/schema.ts
Line: 633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant