- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.9k
 
[ENH] Add local support for schema #5714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
          
 This stack of pull requests is managed by Graphite. Learn more about stacking.  | 
    
          Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
 System Compatibility
 Quality
  | 
    
26fbb95    to
    50dc87a      
    Compare
  
    50dc87a    to
    4f90e30      
    Compare
  
    d88801a    to
    a1642ad      
    Compare
  
    4f90e30    to
    4170cd0      
    Compare
  
    a1642ad    to
    d9975c0      
    Compare
  
    cc49da4    to
    82c400d      
    Compare
  
    d9975c0    to
    73d6c8c      
    Compare
  
    82c400d    to
    5a6e468      
    Compare
  
    73d6c8c    to
    f60a76e      
    Compare
  
    5a6e468    to
    8e145e7      
    Compare
  
    45ca931    to
    3e6652e      
    Compare
  
    cb029e5    to
    ebae17b      
    Compare
  
    ebae17b    to
    9bc109b      
    Compare
  
    3e6652e    to
    bdf764b      
    Compare
  
    bdf764b    to
    2def8f3      
    Compare
  
    9bc109b    to
    d22f13f      
    Compare
  
    2def8f3    to
    1695d4a      
    Compare
  
    | let schema = match first_row.get::<Option<&str>, _>(7) { | ||
| Some(json_str) if !json_str.trim().is_empty() && json_str.trim() != "null" => { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Error handling inconsistency: The schema deserialization logic handles empty strings and "null" values differently across the codebase. In the SQLite implementation, it checks for both conditions, but other parts of the code may not handle these edge cases consistently.
match first_row.get::<Option<&str>, _>(7) {
    Some(json_str) if !json_str.trim().is_empty() && json_str.trim() != "null" => {
        // This logic should be centralized in a helper function
        // to ensure consistency across all schema deserialization points
    }
    // ...
}Consider creating a centralized deserialize_schema_from_db helper function to ensure consistent handling of these edge cases.
Context for Agents
[**BestPractice**]
Error handling inconsistency: The schema deserialization logic handles empty strings and "null" values differently across the codebase. In the SQLite implementation, it checks for both conditions, but other parts of the code may not handle these edge cases consistently.
```rust
match first_row.get::<Option<&str>, _>(7) {
    Some(json_str) if !json_str.trim().is_empty() && json_str.trim() != "null" => {
        // This logic should be centralized in a helper function
        // to ensure consistency across all schema deserialization points
    }
    // ...
}
```
Consider creating a centralized `deserialize_schema_from_db` helper function to ensure consistent handling of these edge cases.
File: rust/sysdb/src/sqlite.rs
Line: 84417e078b    to
    32d4b95      
    Compare
  
            
          
                rust/sqlite/src/db.rs
              
                Outdated
          
        
      | // Check if database has more applied migrations than available source migrations | ||
| if applied_migrations.len() > source_migrations.len() { | ||
| return Ok(vec![]); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential null database migration issue: The code checks if applied_migrations.len() > source_migrations.len() and returns empty migrations, but this could mask real migration problems. If the database has more applied migrations than source migrations, this suggests a version mismatch or corrupted migration state that should be explicitly handled.
Suggested Change
| // Check if database has more applied migrations than available source migrations | |
| if applied_migrations.len() > source_migrations.len() { | |
| return Ok(vec![]); | |
| // Check if database has more applied migrations than available source migrations | |
| if applied_migrations.len() > source_migrations.len() { | |
| return Err(SqliteError::MigrationVersionMismatch( | |
| format!( | |
| "Database has {} applied migrations but only {} source migrations available", | |
| applied_migrations.len(), | |
| source_migrations.len() | |
| ) | |
| )); | |
| } | 
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
Potential null database migration issue: The code checks `if applied_migrations.len() > source_migrations.len()` and returns empty migrations, but this could mask real migration problems. If the database has more applied migrations than source migrations, this suggests a version mismatch or corrupted migration state that should be explicitly handled.
<details>
<summary>Suggested Change</summary>
```suggestion
// Check if database has more applied migrations than available source migrations
if applied_migrations.len() > source_migrations.len() {
    return Err(SqliteError::MigrationVersionMismatch(
        format!(
            "Database has {} applied migrations but only {} source migrations available", 
            applied_migrations.len(), 
            source_migrations.len()
        )
    ));
}
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
</details>
File: rust/sqlite/src/db.rs
Line: 1598ec4278    to
    bee55b4      
    Compare
  
    | .get_hnsw_config_with_legacy_fallback(segment)? | ||
| .schema | ||
| .as_ref() | ||
| .map(|schema| schema.get_internal_hnsw_config_with_legacy_fallback(segment)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanketkedia is it fine to not reconcile on the writer? i believe it should come through frontend, so it should reconcile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I think we need this here. The reconcile of schema with config happens in the handler for BackfillMessage. And then we need to reconcile this with legacy metadata here so this seems correct and necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I'd have liked all the three reconciles to happen at one place but that's not what it is now even with collection config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant reconcile schema and config. yes the legacy fallback is needed everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use validate_schema from validators.rs in bindings.rs
| 
           nvm  | 
    
bee55b4    to
    296af1b      
    Compare
  
    296af1b    to
    dd3f346      
    Compare
  
    | // This is for backwards compatibility so that users who migrate to distributed | ||
| // from local don't break their code. | ||
| KnnIndex::Spann => { | ||
| let internal_config = if let Some(space) = hnsw.space { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't you reconcile with legacy metadata here before getting the space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I feel like we should remove the blanket reconcile at the top and reconcile here in various places for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
dd3f346    to
    df8c8c8      
    Compare
  
    | 
           In general, I feel like the reconciliation logic is all over the place. But that's from before (collection config) so ok for now. But ideally once you've read from sysdb, you should assume that it has schema set and properly reconciled with both collection config and legacy metadata and just use it downstream  | 
    
## Description of changes _Summarize the changes made by this PR._ - Improvements & Bug fixes - This PR adds support for schema in sqlite sysdb, correctly reconciling with schema, legacy metadata, and supporting configuration updates. It also adds support for passing schema via bindings, to allow for local chroma support. It also updates cli usage of to allow copying of schema - New functionality - ... ## Test plan _How are these changes tested?_ expanded schema e2e tests to ensure bindings and single node all work as intended - [ x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Migration plan _Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?_ ## Observability plan _What is the plan to instrument and monitor this change?_ ## Documentation Changes _Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
expanded schema e2e tests to ensure bindings and single node all work as intended
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?