Skip to content

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings #2756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

devin-ai-integration[bot]
Copy link
Contributor

Fix #2755: Add support for custom knowledge storage with pre-existing embeddings

Problem

Users want to load pre-existing vector embeddings in ChromaDB without re-embedding when using CrewAI. Currently, there's no way to pass a custom KnowledgeStorage subclass directly to a Crew's knowledge_sources parameter because it expects instances of BaseKnowledgeSource, not KnowledgeStorage.

Solution

This PR adds a new CustomStorageKnowledgeSource class that:

  1. Inherits from BaseKnowledgeSource, so it passes the validation check in the Crew class
  2. Has empty implementations of validate_content() and add() methods, since it doesn't need to process or add content (it uses existing embeddings)
  3. Works with pre-existing storage by allowing the user to set its storage field to a custom KnowledgeStorage instance

Example Usage

An example file is included in docs/examples/custom_storage_knowledge_source_example.py that demonstrates how to:

  1. Create a custom KnowledgeStorage subclass
  2. Initialize it with a specific persistent directory
  3. Create a CustomStorageKnowledgeSource and set its storage
  4. Use it with a Crew

Testing

Added comprehensive tests in tests/knowledge/custom_storage_knowledge_source_test.py that verify:

  1. A CustomStorageKnowledgeSource can be created with a pre-existing storage
  2. It can be used with a Knowledge object
  3. It can be used with a Crew object

All tests are passing.

Link to Devin run: https://app.devin.ai/sessions/5b9489212add47cab818588276f34883
Requested by: Joe Moura ([email protected])

Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@joaomdmoura
Copy link
Collaborator

Disclaimer: This review was made by a crew of AI Agents.

Code Review Comment for PR #2756

Overview

This PR introduces support for custom knowledge storage leveraging pre-existing embeddings across three files: custom_storage_knowledge_source.py, custom_storage_knowledge_source_example.py, and custom_storage_knowledge_source_test.py. While the implementation is generally clean and well-structured, several enhancements could improve maintainability, robustness, and clarity.

Detailed Feedback

1. custom_storage_knowledge_source.py

Strengths:

  • The implementation of a minimal knowledge source class is well done.
  • Type hints and inheritance are effectively used.
  • Documentation is present, aiding understanding.

Recommendations:

  1. Comprehensive Docstrings:

    • Enhance docstrings to include parameter and attribute descriptions for better clarity.
      class CustomStorageKnowledgeSource(BaseKnowledgeSource):
          """A knowledge source that uses a pre-existing storage with embeddings.
          
          Args:
              collection_name (Optional[str]): Name of the collection.
          
          Attributes:
              storage (KnowledgeStorage): The underlying storage implementation.
          """
  2. Error Handling:

    • Implement error handling for storage initialization to catch incorrect usage:
      def validate_content(self):
          """Validates that the storage is properly initialized."""
          if not hasattr(self, 'storage'):
              raise ValueError("Storage not initialized. Please set storage before use.")

2. custom_storage_knowledge_source_example.py

Strengths:

  • The example implementation is clear and properly documented.
  • Effective error handling demonstrates good practice.

Recommendations:

  1. Type Hints:

    • Include type hints in the main function to clarify its purpose and return type:
      def main() -> None:
          """An example of using custom storage with CrewAI."""
  2. Robust Error Handling and Cleanup:

    • Enhance the error management and ensure the cleanup of resources:
      def get_knowledge_source_with_custom_storage(
          folder_name: str,
          embedder: Optional[Any] = None
      ) -> CustomStorageKnowledgeSource:
          """Create a knowledge source with a custom storage.
          
          Raises:
              Exception: If storage initialization fails.
          """
          try:
              ...
          except Exception as e:
              raise Exception(f"Failed to initialize knowledge source: {e}")
  3. Configuration Validations:

    • Validate configurations to catch misconfigurations early:
      class CustomKnowledgeStorage(KnowledgeStorage):
          def __init__(self, persist_directory: str, embedder=None, collection_name=None):
              if not persist_directory:
                  raise ValueError("persist_directory cannot be empty")
              ...

3. custom_storage_knowledge_source_test.py

Strengths:

  • Test coverage is solid with clear organization.
  • Good use of pytest fixtures aids in setup.

Recommendations:

  1. Edge Cases and Error Testing:

    • Add tests for edge cases where the initialization fails:
      def test_custom_storage_knowledge_source_validation():
          """Test that validation fails when storage is not initialized."""
          ...
  2. Resource Cleanup:

    • Ensure resources are cleaned up post-test to avoid affecting subsequent tests:
      @pytest.fixture(autouse=True)
      def cleanup():
          """Clean up test files after each test."""
          yield
          ...
  3. Integration Tests:

    • Incorporate tests to validate full integration with existing storage:
      def test_full_integration():
          """Test the full integration with actual storage."""
          ...

General Recommendations

  1. Implement Logging:

    • Adding logging throughout the codebase will help in tracking operations and potential issues:
      import logging
      logger = logging.getLogger(__name__)
  2. Input Validation Decorators:

    • Use decorators for input validation to reduce repetitive code:
      from functools import wraps
  3. Configuration File Support:

    • Consider adding support for configuration files (e.g., YAML) to streamline setup:
      import yaml

Summary

While the implementation is robust, there are opportunities for improvement in error handling, type annotations, and testing. Implementing these suggestions will enhance code quality and maintainability, ensuring a better development experience and facilitating future enhancements.

Please consider these recommendations before finalizing the pull request for a more thorough and quality implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] How to load pre-existing embeddings as Knowledge Source?
1 participant