Fix #2755: Add support for custom knowledge storage with pre-existing embeddings #2756

devin-ai-integration · 2025-05-05T13:58:52Z

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings

Problem

Users want to load pre-existing vector embeddings in ChromaDB without re-embedding when using CrewAI. Currently, there's no way to pass a custom KnowledgeStorage subclass directly to a Crew's knowledge_sources parameter because it expects instances of BaseKnowledgeSource, not KnowledgeStorage.

Solution

This PR adds a new CustomStorageKnowledgeSource class that:

Inherits from BaseKnowledgeSource, so it passes the validation check in the Crew class
Has empty implementations of validate_content() and add() methods, since it doesn't need to process or add content (it uses existing embeddings)
Works with pre-existing storage by allowing the user to set its storage field to a custom KnowledgeStorage instance

Example Usage

An example file is included in docs/examples/custom_storage_knowledge_source_example.py that demonstrates how to:

Create a custom KnowledgeStorage subclass
Initialize it with a specific persistent directory
Create a CustomStorageKnowledgeSource and set its storage
Use it with a Crew

Testing

Added comprehensive tests in tests/knowledge/custom_storage_knowledge_source_test.py that verify:

A CustomStorageKnowledgeSource can be created with a pre-existing storage
It can be used with a Knowledge object
It can be used with a Crew object

All tests are passing.

Link to Devin run: https://app.devin.ai/sessions/5b9489212add47cab818588276f34883
Requested by: Joe Moura ([email protected])

… embeddings Co-Authored-By: Joe Moura <[email protected]>

devin-ai-integration · 2025-05-05T13:58:58Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

joaomdmoura · 2025-05-05T14:01:35Z

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for PR #2756

Overview

This PR introduces support for custom knowledge storage leveraging pre-existing embeddings across three files: custom_storage_knowledge_source.py, custom_storage_knowledge_source_example.py, and custom_storage_knowledge_source_test.py. While the implementation is generally clean and well-structured, several enhancements could improve maintainability, robustness, and clarity.

Detailed Feedback

1. `custom_storage_knowledge_source.py`

Strengths:

The implementation of a minimal knowledge source class is well done.
Type hints and inheritance are effectively used.
Documentation is present, aiding understanding.

Recommendations:

Comprehensive Docstrings:

Enhance docstrings to include parameter and attribute descriptions for better clarity.

class CustomStorageKnowledgeSource(BaseKnowledgeSource):
    """A knowledge source that uses a pre-existing storage with embeddings.
    
    Args:
        collection_name (Optional[str]): Name of the collection.
    
    Attributes:
        storage (KnowledgeStorage): The underlying storage implementation.
    """

Error Handling:

Implement error handling for storage initialization to catch incorrect usage:

def validate_content(self):
    """Validates that the storage is properly initialized."""
    if not hasattr(self, 'storage'):
        raise ValueError("Storage not initialized. Please set storage before use.")

2. `custom_storage_knowledge_source_example.py`

Strengths:

The example implementation is clear and properly documented.
Effective error handling demonstrates good practice.

Recommendations:

Type Hints:
- Include type hints in the main function to clarify its purpose and return type:
```
def main() -> None:
    """An example of using custom storage with CrewAI."""
```

Robust Error Handling and Cleanup:

Enhance the error management and ensure the cleanup of resources:

def get_knowledge_source_with_custom_storage(
    folder_name: str,
    embedder: Optional[Any] = None
) -> CustomStorageKnowledgeSource:
    """Create a knowledge source with a custom storage.
    
    Raises:
        Exception: If storage initialization fails.
    """
    try:
        ...
    except Exception as e:
        raise Exception(f"Failed to initialize knowledge source: {e}")

Configuration Validations:

Validate configurations to catch misconfigurations early:

class CustomKnowledgeStorage(KnowledgeStorage):
    def __init__(self, persist_directory: str, embedder=None, collection_name=None):
        if not persist_directory:
            raise ValueError("persist_directory cannot be empty")
        ...

3. `custom_storage_knowledge_source_test.py`

Strengths:

Test coverage is solid with clear organization.
Good use of pytest fixtures aids in setup.

Recommendations:

Edge Cases and Error Testing:

Add tests for edge cases where the initialization fails:

def test_custom_storage_knowledge_source_validation():
    """Test that validation fails when storage is not initialized."""
    ...

Resource Cleanup:

Ensure resources are cleaned up post-test to avoid affecting subsequent tests:

@pytest.fixture(autouse=True)
def cleanup():
    """Clean up test files after each test."""
    yield
    ...

Integration Tests:

Incorporate tests to validate full integration with existing storage:

def test_full_integration():
    """Test the full integration with actual storage."""
    ...

General Recommendations

Implement Logging:
- Adding logging throughout the codebase will help in tracking operations and potential issues:
```
import logging
logger = logging.getLogger(__name__)
```
Input Validation Decorators:
- Use decorators for input validation to reduce repetitive code:
```
from functools import wraps
```
Configuration File Support:
- Consider adding support for configuration files (e.g., YAML) to streamline setup:
```
import yaml
```

Summary

While the implementation is robust, there are opportunities for improvement in error handling, type annotations, and testing. Implementing these suggestions will enhance code quality and maintainability, ensuring a better development experience and facilitating future enhancements.

Please consider these recommendations before finalizing the pull request for a more thorough and quality implementation.

Co-Authored-By: Joe Moura <[email protected]>

…gs, and tests Co-Authored-By: Joe Moura <[email protected]>

Co-Authored-By: Joe Moura <[email protected]>

Fix #2755: Add support for custom knowledge storage with pre-existing…

51439c3

… embeddings Co-Authored-By: Joe Moura <[email protected]>

devin-ai-integration bot and others added 4 commits May 5, 2025 14:01

Fix expected_output parameter in Task example

58a1206

Co-Authored-By: Joe Moura <[email protected]>

Fix import formatting to resolve lint issues

1b9cbb6

Co-Authored-By: Joe Moura <[email protected]>

Implement PR review suggestions for improved error handling, docstrin…

29ebdbf

…gs, and tests Co-Authored-By: Joe Moura <[email protected]>

Fix import sorting to resolve lint issues

1867c79

Co-Authored-By: Joe Moura <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings #2756

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings #2756

devin-ai-integration bot commented May 5, 2025

devin-ai-integration bot commented May 5, 2025

joaomdmoura commented May 5, 2025

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings #2756

Are you sure you want to change the base?

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings #2756

Conversation

devin-ai-integration bot commented May 5, 2025

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings

Problem

Solution

Example Usage

Testing

devin-ai-integration bot commented May 5, 2025

🤖 Devin AI Engineer

joaomdmoura commented May 5, 2025

Code Review Comment for PR #2756

Overview

Detailed Feedback

1. custom_storage_knowledge_source.py

2. custom_storage_knowledge_source_example.py

3. custom_storage_knowledge_source_test.py

General Recommendations

Summary

1. `custom_storage_knowledge_source.py`

2. `custom_storage_knowledge_source_example.py`

3. `custom_storage_knowledge_source_test.py`