Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(file-based): sync file acl permissions and identities #260

Merged
merged 51 commits into from
Feb 14, 2025

Conversation

aldogonzalez8
Copy link
Contributor

@aldogonzalez8 aldogonzalez8 commented Jan 23, 2025

What

We want to introduce logic for syncing access control lists and other metadata and Identities for connectors, specifically in this implementation for File Bases connectors.

If you are wondering what we mean by identities and ACLs here are a couple of documents that should clarify the general idea:

Relates to [source-google-drive] Add ACL: permissions colmns and identities stream

Reltaes to [source-microsoft-sharepoint] Implement Identities and Permissions

How

We will leverage DefaultFileBased stream and stream reader to reuse most of the logic for scrapping files and let connectors implement the logic from the domain they handle. This initial iteration scopes introducing ACLs and Identities for source-google-drive in this PR, but in the future, we expect to add more connectors.

In the UI we will add a new Transfer Mode to Replicate Permissions ACL

Note: This image is from source-google-drive, which introduces a domain optional field; this is not part of base config mode as the domain could not be a concept for other connectors.

image

Review guide

  1. airbyte_cdk/sources/file_based/file_based_source.py: logic to introduce new PermissionsFileBasedStream and Identities streams. We depend if use_permissoins_transfer mode is selected, if not we fallback to DefaultFileBased Stream
  2. airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py: This new stream inherits from DefaultFileBasedStream as we reuse most of the logic to find Files, but we override for providing a schema and read records.
  3. airbyte_cdk/sources/file_based/stream/identities_stream.py: Simple FileIdentities stream to read Identities, again we override, from new Identities stream at global CDK lever as in the future we expect other non-file based connectors will sync Identities.
  4. airbyte_cdk/sources/specs/transfer_modes.py: added a Mode for DeliverPermissions, again we expect this is reusable for other nont-file based stream in the future.
  5. airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py: just moved logic to determine transfer modes here as we were duplicating code for Source and StreamReader.
  6. airbyte_cdk/sources/file_based/file_based_stream_reader.py: new abstract methods for file permissions and identities so streams can use those.

User Impact

We want to enable users for knowledge retrieval scenarios where based on the hierarchy of groups, users and folders in the source, can user A see file B?

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Summary by CodeRabbit

  • New Features

    • Enhanced file transfer options now support permissions replication, enabling delivery of access permissions along with identity data.
    • Introduced an additional delivery method option to mirror source permission restrictions through identity stream inclusion.
  • Tests

    • Added comprehensive tests to ensure the reliability and effectiveness of the new permissions and identity stream functionalities.

@aldogonzalez8 aldogonzalez8 self-assigned this Jan 23, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jan 23, 2025
@aldogonzalez8
Copy link
Contributor Author

/format-fix

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Jan 23, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

Copy link
Contributor

coderabbitai bot commented Jan 23, 2025

📝 Walkthrough

Walkthrough

This pull request enhances the file-based data synchronization framework by adding support for a new permissions delivery method. A new class, DeliverPermissions, is introduced and integrated into the union type of delivery methods. Multiple components are updated to support permissions transfer and identities stream inclusion, including configuration validation, stream creation in FileBasedSource, and stream readers. Additionally, new stream classes are added for handling permissions and identities, along with corresponding unit tests to verify the changes.

Changes

File(s) Change Summary
.../config/abstract_file_based_spec.py, .../specs/transfer_modes.py Added new class DeliverPermissions and updated the delivery_method union type to include permissions.
.../file_based_source.py, .../config/validate_config_transfer_modes.py Updated FileBasedSource for enhanced configuration validation; added _make_identities_stream, _make_file_based_stream, and support for permissions and identities transfers.
.../file_based_stream_reader.py, .../in_memory_files_source.py, .../test_file_based_stream_reader.py Refactored stream reader methods to use external validation functions and added new methods/properties for ACL permissions and identity group management.
.../stream/__init__.py, .../stream/identities_stream.py, .../stream/permissions_file_based_stream.py Added new stream classes (FileIdentitiesStream and PermissionsFileBasedStream) to expose identities and ACL permissions streaming functionalities.
.../scenarios/csv_scenarios.py, .../stream/test_permissions_file_based_stream.py, .../stream/test_file_identities_stream.py Extended unit tests and configuration scenarios to support the new "Replicate Permissions ACL" delivery method.
.../streams/permissions/identities_stream.py Introduced new abstract IdentitiesStream class to manage identity synchronization with state management.

Sequence Diagram(s)

sequenceDiagram
    participant Source as FileBasedSource
    participant Config as ValidateConfigTransferModes
    participant Reader as AbstractFileBasedStreamReader
    participant PermStream as PermissionsFileBasedStream
    participant DefStream as DefaultFileBasedStream
    participant IdStream as FileIdentitiesStream

    Source->>Config: Validate delivery_method config
    alt use_permissions_transfer
        Source->>PermStream: Create PermissionsFileBasedStream
        PermStream->>Reader: get_file_acl_permissions(file)
    else use_file_transfer
        Source->>DefStream: Create DefaultFileBasedStream
    end
    alt include_identities_stream is True
        Source->>IdStream: Create FileIdentitiesStream
        IdStream->>Reader: load_identity_groups()
    end
Loading

Suggested reviewers

  • aaronsteers - wdyt?
  • maxi297 - wdyt?
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)

25-30: Consider enhancing the field description for clarity?

The description could be more specific about what metadata is being synced and how it differs from file data. Maybe something like: "If enabled, streams will sync file metadata (e.g., file path, permissions, timestamps) instead of the actual file contents.", wdyt?

airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

213-227: Consider adding validation for required metadata fields?

The metadata schema defines several fields but doesn't mark any as required. Should we ensure that critical fields like id and file_path are always present in the metadata records? We could add a required field to the schema, wdyt?

     return {
         "type": "object",
         "properties": {
             "id": {"type": "string"},
             "file_path": {"type": "string"},
             "allowed_identity_remote_ids": {"type": "array", "items": "string"},
             "is_public": {"type": "boolean"},
         },
+        "required": ["id", "file_path"]
     }
airbyte_cdk/sources/file_based/file_based_source.py (1)

391-397: Consider reusing the delivery type check logic?

The _use_records_transfer method has similar logic to _use_file_transfer. Maybe we could extract the common pattern into a helper method to reduce duplication, wdyt?

+    @staticmethod
+    def _has_delivery_type(parsed_config: AbstractFileBasedSpec, delivery_type: str) -> bool:
+        return (
+            hasattr(parsed_config.delivery_method, "delivery_type")
+            and parsed_config.delivery_method.delivery_type == delivery_type
+        )

     @staticmethod
     def _use_records_transfer(parsed_config: AbstractFileBasedSpec) -> bool:
-        use_records_transfer = (
-            hasattr(parsed_config.delivery_method, "delivery_type")
-            and parsed_config.delivery_method.delivery_type == "use_records_transfer"
-        )
-        return use_records_transfer
+        return FileBasedSource._has_delivery_type(parsed_config, "use_records_transfer")
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

512-518: The sync_metadata property looks good, but should we enhance its description?

The implementation looks correct, but we could make the description more specific about what "file metadata" includes. For example, we could mention the specific metadata fields that will be synced (like file size, creation date, etc.). wdyt?

-                                        "description": "If enabled, streams will sync files metadata instead of files data.",
+                                        "description": "If enabled, streams will sync file metadata (e.g., file size, creation date, last modified date) instead of file contents.",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec7e961 and 79c5f40.

📒 Files selected for processing (5)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (3 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (6 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 206-206: Missing return statement. If the method is meant to be abstract, use @abc.abstractmethod

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

Line range hint 1-1024: Should we add test scenarios for the new sync_metadata functionality?

The file includes comprehensive test scenarios for various CSV configurations, but I notice we're missing specific test cases for the new sync_metadata feature. Consider adding these test scenarios:

  1. A test case where sync_metadata=True to verify file metadata syncing
  2. A test case with sync_metadata=False (default case) to verify file content syncing
  3. Edge cases like empty files, invalid metadata, etc.

Would you like me to help draft these test scenarios?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (4)

138-145: Hey! What do you think about reducing code duplication here? 🤔

I notice this follows the same pattern as use_file_transfer. Maybe we could extract a helper method to handle the delivery type check? Something like this, wdyt?

+    def _check_delivery_type(self, expected_type: str) -> bool:
+        if self.config:
+            return (
+                hasattr(self.config.delivery_method, "delivery_type")
+                and self.config.delivery_method.delivery_type == expected_type
+            )
+        return False
+
     def use_file_transfer(self) -> bool:
-        if self.config:
-            use_file_transfer = (
-                hasattr(self.config.delivery_method, "delivery_type")
-                and self.config.delivery_method.delivery_type == "use_file_transfer"
-            )
-            return use_file_transfer
-        return False
+        return self._check_delivery_type("use_file_transfer")

     def use_records_transfer(self) -> bool:
-        if self.config:
-            use_records_transfer = (
-                hasattr(self.config.delivery_method, "delivery_type")
-                and self.config.delivery_method.delivery_type == "use_records_transfer"
-            )
-            return use_records_transfer
-        return False
+        return self._check_delivery_type("use_records_transfer")

158-166: Would you mind adding a docstring to clarify the behavior? 📝

The method has a dependency on use_records_transfer() that might not be immediately obvious to other developers. Maybe we could add a docstring explaining this? Here's a suggestion:

     def sync_metadata(self) -> bool:
+        """
+        Check if metadata synchronization is enabled in the config.
+        
+        Returns:
+            bool: True if metadata sync is enabled and records transfer is being used,
+                 False otherwise.
+        
+        Note:
+            This feature requires records transfer to be enabled
+            (delivery_type == "use_records_transfer").
+        """
         if (
             self.config
             and self.use_records_transfer()

214-219: Tiny typo in the docstring ✍️

There's an extra quote mark in the docstring. Would you mind fixing it?

-    def get_metadata_schema(self) -> Dict[str, Any]:
-        """ "
+    def get_metadata_schema(self) -> Dict[str, Any]:
+        """

220-228: How about adding type annotations to the schema? 🤓

The schema could be more specific about the types. Here's a suggestion that adds minItems for arrays and makes the types more explicit, wdyt?

         return {
             "type": "object",
+            "required": ["id", "file_path", "allowed_identity_remote_ids", "is_public"],
             "properties": {
                 "id": {"type": "string"},
                 "file_path": {"type": "string"},
-                "allowed_identity_remote_ids": {"type": "array", "items": "string"},
+                "allowed_identity_remote_ids": {
+                    "type": "array",
+                    "items": {"type": "string"},
+                    "minItems": 1
+                },
                 "is_public": {"type": "boolean"},
             },
         }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79c5f40 and 4638f89.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

206-212: The docstring could use more details about the return format 🎯

The previous review already caught the missing @abstractmethod decorator. Additionally, would you consider enhancing the docstring to describe the expected return format? Something like:

     @abstractmethod
     def get_file_metadata(self, file: RemoteFile, logger: logging.Logger) -> Dict[str, Any]:
         """
-        This is required for connectors that will support syncing
-        metadata from files.
+        Retrieve metadata for a given file.
+        
+        Args:
+            file: The remote file to get metadata for
+            logger: Logger for recording any issues
+        
+        Returns:
+            Dict containing file metadata with at least the following keys:
+            - id (str): Unique identifier for the file
+            - file_path (str): Path to the file
+            - allowed_identity_remote_ids (List[str]): Allowed remote IDs
+            - is_public (bool): Public access flag
         """
         ...

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Jan 23, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aaronsteers
Copy link
Contributor

@aldogonzalez8 - I'm updating from main to resolve a CI issue on the "Test Connectors" workflow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

138-146: Consider unifying the naming convention for file transfer.

Currently, we have both use_file_transfer and use_records_transfer. Would it make sense to merge them or rename them for clarity if they serve related purposes, wdyt?


158-167: Consider consolidating conditions or clarifying control flow.

sync_metadata checks multiple conditions similar to use_records_transfer. Perhaps unify or centralize these checks to avoid duplication and simplify readability, wdyt?


213-227: Evaluate if a concrete default implementation is desired.

get_metadata_schema provides a default schema. If every connector must customize this, would marking it abstract be better to ensure an override, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bfb8c3 and 7e6ca59.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

206-212: Revisit making this method abstract.

There's a past comment suggesting @abstractmethod here, ensuring that all concrete implementations must override get_file_metadata. Do we want to enforce that strictly, wdyt?

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Jan 26, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
airbyte_cdk/sources/file_based/stream/identities_stream.py (1)

80-88: Reduce overly broad exception handling to improve debugging.

Catching Exception swallows all errors, which can mask specific issues. Would you like to narrow the scope by catching only FileBasedSourceError, IO errors, or other relevant exceptions? Wdyt?

airbyte_cdk/sources/file_based/config/permissions.py (1)

28-33: Consider adding docstrings for better maintainability.

The RemoteFilePermissions model looks good, but would you consider adding docstrings to explain the purpose of each field? For example:

 class RemoteFilePermissions(BaseModel):
+    """Model representing permissions for a remote file.
+
+    Attributes:
+        id: Unique identifier for the permission set
+        file_path: Path to the file these permissions apply to
+        allowed_identity_remote_ids: List of identity IDs with access
+        denied_identity_remote_ids: List of identity IDs explicitly denied
+        publicly_accessible: Whether the file is publicly available
+    """
     id: str
     file_path: str
     allowed_identity_remote_ids: list[str] | None = None
     denied_identity_remote_ids: list[str] | None = None
     publicly_accessible: bool = False
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

138-145: Consider reducing code duplication with use_file_transfer.

Both methods follow the same pattern. What do you think about extracting a common helper? Something like:

+    def _check_delivery_type(self, expected_type: str) -> bool:
+        if self.config:
+            return (
+                hasattr(self.config.delivery_method, "delivery_type")
+                and self.config.delivery_method.delivery_type == expected_type
+            )
+        return False
+
     def use_records_transfer(self) -> bool:
-        if self.config:
-            use_records_transfer = (
-                hasattr(self.config.delivery_method, "delivery_type")
-                and self.config.delivery_method.delivery_type == "use_records_transfer"
-            )
-            return use_records_transfer
-        return False
+        return self._check_delivery_type("use_records_transfer")
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

512-545: Consider enhancing the property definitions for better clarity and validation?

A few suggestions to make these properties more robust:

  1. The description for sync_acl_permissions could be more specific about what "Document allowlists" means
  2. The domain field could benefit from a pattern or format validator to ensure valid domain values
  3. Consider adding examples in the description to illustrate the expected values

What do you think about these improvements?


Line range hint 1-1000: Consider adding test scenarios for the new features?

The test suite is comprehensive for CSV parsing features, but it might be good to add test scenarios for:

  1. The new sync_acl_permissions functionality
  2. The identities configuration with various domain values
  3. Error cases for invalid domain values in the identities configuration

Would you like me to help draft these test scenarios?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between edd6f69 and 35e0e68.

📒 Files selected for processing (10)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (2 hunks)
  • airbyte_cdk/sources/file_based/config/identities_based_stream_config.py (1 hunks)
  • airbyte_cdk/sources/file_based/config/permissions.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (7 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
  • airbyte_cdk/sources/file_based/schema_helpers.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/__init__.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (7 hunks)
  • airbyte_cdk/sources/file_based/stream/identities_stream.py (1 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/file_based/file_based_source.py
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/config/identities_based_stream_config.py

[error] 1-5: Import block is un-sorted or un-formatted

airbyte_cdk/sources/file_based/config/permissions.py

[error] 5-11: Import block is un-sorted or un-formatted

airbyte_cdk/sources/file_based/stream/identities_stream.py

[error] 5-29: Import block is un-sorted or un-formatted

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (7)
airbyte_cdk/sources/file_based/stream/__init__.py (1)

3-5: Module export looks good.

Including IdentitiesStream in the __all__ list is consistent with your design. Great job! No changes needed here.

airbyte_cdk/sources/file_based/config/permissions.py (1)

11-25: LGTM! Well-structured identity model.

The RemoteFileIdentity model has comprehensive attributes with appropriate type hints and optional fields.

airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)

28-38: LGTM! Well-configured fields with clear descriptions.

The new fields are properly configured with appropriate titles, descriptions, and default values.

airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

158-166: LGTM! Good conditional checks.

The method properly checks for both records transfer and ACL permissions configuration.


206-211: Add @AbstractMethod decorator to get_file_acl_permissions.

Since this method is required for connectors that support syncing ACL permissions, it should be marked as abstract. WDYT?

+    @abstractmethod
     def get_file_acl_permissions(self, file: RemoteFile, logger: logging.Logger) -> Dict[str, Any]:
         """
         This is required for connectors that will support syncing
         ACL Permissions from files.
         """
-        return {}

213-218: ⚠️ Potential issue

Add @AbstractMethod decorator to load_identity_groups.

Since this method is required for connectors that support syncing identities, it should be marked as abstract. WDYT?

+    @abstractmethod
     def load_identity_groups(self) -> Iterable[Dict[str, Any]]:
         """
         This is required for connectors that will support syncing
         identities.
         """
-        yield {}

Likely invalid or redundant comment.

unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

512-545: Verify the implementation of these new properties across the codebase.

Let's check how these new properties are used in the codebase:

✅ Verification successful

Properties are properly implemented across the codebase

The new properties are well-integrated with consistent implementations in the relevant files:

  • sync_acl_permissions is properly handled in stream readers, sources, and configuration
  • IdentitiesStreamConfig has all the necessary supporting implementations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of the new properties
echo "Searching for sync_acl_permissions usage..."
rg "sync_acl_permissions"

echo "Searching for identities configuration usage..."
rg "IdentitiesStreamConfig"

Length of output: 3397

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/file_based/stream/identities_stream.py (5)

30-33: Could we expand the docstring to clarify usage scenarios?

Adding a bit more detail about the intended use cases for handling identities might help new contributors. Would this be worthwhile? Wdyt?


35-35: Consider allowing is_resumable to be configurable.

Making is_resumable a dynamic property might let advanced users override it. Is this something you'd find useful? Wdyt?


53-61: Could we rename _cursor to _state for clarity?

The setter and getter methods describe “state,” but the backing field is _cursor. Renaming could reduce confusion. Wdyt?


66-90: Would you consider more specific exception handling?

Catching Exception is broad. Perhaps we could separate known errors from unexpected ones for clearer debugging. Wdyt?


98-99: Would storing a real cursor help with incremental sync?

Returning None from get_cursor means we always run a full refresh. Are there future plans for incremental logic? Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ae4267 and 43e3ea3.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/file_based_source.py (7 hunks)
  • airbyte_cdk/sources/file_based/stream/identities_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/file_based_source.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/identities_stream.py (1)

5-26: Resolve import sorting to satisfy linters.

A previous comment mentioned that your imports are not strictly sorted. Would you like to apply the suggested fix from earlier or keep this arrangement? Wdyt?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/schema_helpers.py (2)

26-34: Consider adding schema constraints for better validation. WDYT?

The schema looks good! A few suggestions to make it more robust:

  • Add required field to specify which properties are mandatory
  • Add description field to document the purpose of each property
  • Consider adding pattern constraint for file_path to validate path format

Example enhancement:

 remote_file_permissions_schema = {
     "type": "object",
     "properties": {
-        "id": {"type": "string"},
+        "id": {
+            "type": "string",
+            "description": "Unique identifier for the permission record"
+        },
         "file_path": {"type": "string"},
         "allowed_identity_remote_ids": {"type": "array", "items": {"type": "string"}},
         "publicly_accessible": {"type": "boolean"},
     },
+    "required": ["id", "file_path", "allowed_identity_remote_ids", "publicly_accessible"]
 }

36-49: Enhance array type definition and date format specification. WDYT?

A couple of suggestions to improve type safety:

  1. The member_email_addresses array is missing items type specification
  2. The modified_at field could benefit from a date-time format specification

Here's a suggested enhancement:

         "email_address": {"type": ["null", "string"]},
-        "member_email_addresses": {"type": ["null", "array"]},
+        "member_email_addresses": {
+            "type": ["null", "array"],
+            "items": {"type": "string"}
+        },
         "type": {"type": "string"},
-        "modified_at": {"type": "string"},
+        "modified_at": {
+            "type": "string",
+            "format": "date-time"
+        },
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43e3ea3 and 4aee2c9.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/schema_helpers.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py (4)

2-2: Hey there! Quick note about the copyright year 🗓️

I noticed the copyright year is set to 2025, but we're still in 2024. Should we update this to the current year? wdyt?

-# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
+# Copyright (c) 2024 Airbyte, Inc., all rights reserved.

40-53: Thoughts on enhancing the permissions schema? 🤔

I noticed that _A_RECORD includes a publicly_accessible field, but it's not defined in _A_PERMISSIONS_SCHEMA. Should we add it to ensure complete schema validation? Here's a suggestion:

 _A_PERMISSIONS_SCHEMA = {
     "type": "object",
     "properties": {
         "id": {"type": "string"},
         "remote_id": {"type": "string"},
         "parent_id": {"type": ["null", "string"]},
         "name": {"type": ["null", "string"]},
         "description": {"type": ["null", "string"]},
         "email_address": {"type": ["null", "string"]},
         "member_email_addresses": {"type": ["null", "array"]},
         "type": {"type": "string"},
         "modified_at": {"type": "string"},
+        "publicly_accessible": {"type": "boolean"},
     },
 }

83-83: Shall we clean up this commented line? 🧹

There's a commented-out line that seems unnecessary. Should we remove it to keep the code clean? wdyt?

-        # self._parser.parse_records.return_value = [self._A_RECORD]

82-91: How about adding some edge cases to strengthen the test coverage? 💪

The current test covers the happy path, but what do you think about adding tests for:

  1. Empty permissions
  2. Multiple allowed identities
  3. Error cases (e.g., when get_file_acl_permissions raises an exception)

Here's a quick example of what that might look like:

def test_when_read_records_from_slice_with_empty_permissions_then_return_empty_list(self) -> None:
    self._stream_reader.get_file_acl_permissions.return_value = {}
    messages = list(
        self._stream.read_records_from_slice(
            {"files": [RemoteFile(uri="uri", last_modified=self._NOW)]}
        )
    )
    assert len(list(messages)) == 0
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a71eb08 and a41934a.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (6)

2-2: Update the copyright year.

Hey! I noticed the copyright year is set to 2025. Should we update this to 2024? wdyt? 🤔

-# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
+# Copyright (c) 2024 Airbyte, Inc., all rights reserved.

16-19: Consider enhancing the docstring.

The current docstring is a bit brief. What do you think about adding more details about:

  • The purpose of this stream
  • What kind of permissions it handles
  • How it interacts with the stream_reader
  • Example usage or expected output format

Here's a suggestion, wdyt? 🙂

 """
-The permissions stream, stream_reader on source handles logic for schemas and ACLs permissions.
+A specialized stream for handling file-based ACL permissions.
+
+This stream works with the stream_reader to:
+1. Fetch ACL permissions for each file in the source
+2. Transform permissions into a standardized format
+3. Generate records containing permission information
+
+The stream_reader is responsible for the actual implementation of permission retrieval
+and schema definition, while this class handles the streaming interface.
 """

21-24: Add method documentation and return type hint.

Would you consider adding a docstring to explain why this method overrides the parent implementation? Also, adding a return type hint would improve type safety. Here's a suggestion:

 def _filter_schema_invalid_properties(
     self, configured_catalog_json_schema: Dict[str, Any]
-) -> Dict[str, Any]:
+) -> JsonSchema:
+    """
+    Override parent method to use the permissions schema from stream_reader instead of filtering.
+    
+    Args:
+        configured_catalog_json_schema: The schema from the configured catalog (unused)
+    Returns:
+        The file permissions schema from the stream reader
+    """
     return self.stream_reader.file_permissions_schema

31-46: Consider simplifying the permissions check flow.

The current flow using the no_permissions flag could be simplified. What do you think about this approach? 🤔

 for file in stream_slice["files"]:
-    no_permissions = False
     file_datetime_string = file.last_modified.strftime(self.DATE_TIME_FORMAT)
     try:
         permissions_record = self.stream_reader.get_file_acl_permissions(
             file, logger=self.logger
         )
         if not permissions_record:
-            no_permissions = True
-            continue
+            self.logger.warning(f"Unable to fetch permissions. stream={self.name} file={file.uri}")
+            continue
         permissions_record = self.transform_record(
             permissions_record, file, file_datetime_string
         )
         yield stream_data_to_airbyte_message(
             self.name, permissions_record, is_file_transfer_message=False
         )

47-56: Consider more specific error handling.

Would it be helpful to catch more specific exceptions? For example:

-        except Exception as e:
+        except (ValueError, AttributeError) as e:
             self.logger.error(f"Failed to retrieve permissions for file {file.uri}: {str(e)}")
             yield AirbyteMessage(
                 type=MessageType.LOG,
                 log=AirbyteLogMessage(
                     level=Level.ERROR,
                     message=f"Error retrieving files permissions: stream={self.name} file={file.uri}",
                     stack_trace=traceback.format_exc(),
                 ),
             )
+        except Exception as e:
+            # Re-raise unexpected exceptions
+            raise

67-68: Add method documentation.

Would you consider adding a docstring to explain the purpose of this method? Something like:

 def _get_raw_json_schema(self) -> JsonSchema:
+    """
+    Retrieve the raw JSON schema for file permissions from the stream reader.
+    
+    Returns:
+        The file permissions schema that defines the structure of permission records
+    """
     return self.stream_reader.file_permissions_schema
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a41934a and b764d46.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1 hunks)
  • unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (3)

2-2: Hey, noticed the copyright year is set to 2025! 🤔

The copyright year is set to 2025, which is in the future. Should we update it to the current year (2024), wdyt?

-# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
+# Copyright (c) 2024 Airbyte, Inc., all rights reserved.

34-77: Shall we simplify the logging logic? 🤔

The warning message for missing permissions is currently logged twice:

  1. First time in the try block (line 48-50)
  2. Second time in the finally block (line 69-76)

What do you think about removing the finally block since we already handle the warning in the try block? This would make the code more concise and avoid duplicate logs, wdyt?

             except Exception as e:
                 self.logger.error(f"Failed to retrieve permissions for file {file.uri}: {str(e)}")
                 yield AirbyteMessage(
                     type=MessageType.LOG,
                     log=AirbyteLogMessage(
                         level=Level.ERROR,
                         message=f"Error retrieving files permissions: stream={self.name} file={file.uri}",
                         stack_trace=traceback.format_exc(),
                     ),
                 )
-            finally:
-                if no_permissions:
-                    yield AirbyteMessage(
-                        type=MessageType.LOG,
-                        log=AirbyteLogMessage(
-                            level=Level.WARN,
-                            message=f"Unable to fetch permissions. stream={self.name} file={file.uri}",
-                        ),
-                    )

29-33: Would you mind adding a docstring here? 📝

The method could benefit from a docstring explaining its purpose and why it returns the file_permissions_schema. This would help other developers understand the method's role in schema validation, wdyt?

     def _filter_schema_invalid_properties(
         self, configured_catalog_json_schema: Dict[str, Any]
     ) -> Dict[str, Any]:
+        """
+        Filter and validate the schema for permission records.
+        
+        Args:
+            configured_catalog_json_schema: The schema from the configured catalog
+            
+        Returns:
+            The validated file permissions schema from the stream reader
+        """
         return self.stream_reader.file_permissions_schema
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b764d46 and 9566b3c.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1)

16-27: Love the clear docstring! 👏

The class structure and documentation are excellent. The docstring clearly explains the purpose and the division of responsibilities between the stream and stream_reader.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/file_based/file_based_source.py (5)

82-82: Consider using FileIdentities.IDENTITIES_STREAM_NAME instead of IDENTITIES_STREAM constant?

I notice that we have a duplicate constant definition here. Since we're already using FileIdentities.IDENTITIES_STREAM_NAME in _make_identities_stream, what do you think about removing this constant and using the one from the FileIdentities class throughout? This would help maintain a single source of truth, wdyt?

-IDENTITIES_STREAM = "identities"

174-180: Consider enhancing the identity check error message?

The error message could be more specific about what went wrong. What do you think about providing more context about why the identities couldn't be loaded? For example:

-                    errors.append(
-                        "Unable to get identities for current configuration, please check your credentials"
-                    )
+                    errors.append(
+                        "Unable to get identities for current configuration. This could be due to insufficient permissions or invalid credentials. Please verify your access rights and credentials are correct."
+                    )

185-189: Consider simplifying the availability method selection?

The condition could be more concise using the or operator. What do you think about this approach?

-                availability_method = (
-                    stream.availability_strategy.check_availability
-                    if use_file_transfer(parsed_config) or use_permissions_transfer(parsed_config)
-                    else stream.availability_strategy.check_availability_and_parsability
-                )
+                is_transfer_mode = use_file_transfer(parsed_config) or use_permissions_transfer(parsed_config)
+                availability_method = (
+                    stream.availability_strategy.check_availability if is_transfer_mode
+                    else stream.availability_strategy.check_availability_and_parsability
+                )

322-340: Consider adding return type hint to _make_default_stream?

For consistency with other methods, what do you think about adding a return type hint to this method?

-    def _make_default_stream(
+    def _make_default_stream(
         self,
         stream_config: FileBasedStreamConfig,
         cursor: Optional[AbstractFileBasedCursor],
         parsed_config: AbstractFileBasedSpec,
-    ) -> AbstractFileBasedStream:
+    ) -> DefaultFileBasedStream:

342-356: Consider adding docstrings to new stream creation methods?

The new methods could benefit from docstrings explaining their purpose and parameters. What do you think about adding them?

     def _make_permissions_stream(
         self, stream_config: FileBasedStreamConfig, cursor: Optional[AbstractFileBasedCursor]
     ) -> AbstractFileBasedStream:
+        """
+        Creates a stream for handling file permissions.
+
+        Args:
+            stream_config: Configuration for the stream
+            cursor: Optional cursor for tracking stream progress
+
+        Returns:
+            A stream instance configured for handling permissions
+        """

     def _make_identities_stream(
         self,
     ) -> Stream:
+        """
+        Creates a stream for handling identity information.
+
+        Returns:
+            A stream instance configured for handling identities
+        """

Also applies to: 372-380

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9566b3c and c94b704.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/file_based_source.py (8 hunks)
  • unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_source.py (1)

314-316:

✅ Verification successful

Verify consistency of identity stream naming.

The code looks good! Let's just verify that we're using consistent naming for the identity stream across the codebase.


🏁 Script executed:

#!/bin/bash
# Search for any references to identity/identities stream names to ensure consistency
rg -i "identit(y|ies).*stream" --type py

Length of output: 4071


Stream Naming Consistency Verified!

The identity stream naming is consistently using "identities" across the codebase—both in the permissions stream and in the file-based source. Everything looks aligned, so this part of the change can be approved. wdyt?

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Feb 13, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (3)

2-2: Update copyright year to current year.

The copyright year is set to 2025, but it should reflect the current year (2024). Wdyt about updating it?

-# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
+# Copyright (c) 2024 Airbyte, Inc., all rights reserved.

24-42: Consider using more generic test data.

The test data contains realistic-looking email addresses and UUIDs. To avoid any potential PII concerns and make tests more maintainable, would you consider using more generic test data? For example:

 _A_RECORD = {
-    "id": "923496ab-3eee-47d2-a824-b237e630082a",
-    "remote_id": "[email protected]",
-    "name": "user ond",
-    "email_address": "[email protected]",
-    "member_email_addresses": ["[email protected]", "[email protected]"],
+    "id": "test-user-id-1",
+    "remote_id": "test-user-1",
+    "name": "Test User 1",
+    "email_address": "[email protected]",
+    "member_email_addresses": ["[email protected]", "[email protected]"],
     "type": "user",
     "modified_at": "2025-02-12T23:06:45.304942+00:00",
 }

73-94: Consider adding edge case tests.

The current tests cover the happy path and basic error handling. Would you consider adding tests for edge cases? For example:

  • Empty identity groups
  • Malformed identity records
  • Different sync modes

Here's a sample test for empty groups:

def test_when_read_records_with_empty_groups_then_return_empty(self) -> None:
    self._stream_reader.load_identity_groups.return_value = []
    messages = list(self._stream.read_records(SyncMode.full_refresh))
    assert len(messages) == 0
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c94b704 and 722c7e0.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/streams/permissions/identities.py (1 hunks)
  • unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/streams/permissions/identities.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py

[warning] 5-5: Import block is un-sorted or un-formatted. Organize imports.

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (1)

18-20: Verify if MockFormat class is needed.

The MockFormat class appears to be empty and unused in the tests. Is it needed for future use or can it be removed? Wdyt?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (2)

23-73: Add docstrings to improve test documentation.

The test class and data structures are well organized! Would you consider adding docstrings to explain the test class's purpose and the significance of the test data? Something like this maybe? wdyt?

 class IdentitiesFileBasedStreamTest(unittest.TestCase):
+    """Test suite for FileIdentities stream implementation.
+    
+    Tests the functionality of reading identity records, schema validation,
+    and error handling in the FileIdentities stream.
+    """

74-96: Consider adding edge case tests.

The current test coverage looks great! 🎉 Would you consider adding a few more test cases to cover edge scenarios? Here are some ideas:

  1. Test with empty identity groups
  2. Test with malformed identity records
  3. Test with very large member_email_addresses arrays

What do you think about these additions? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54d5ec6 and afdd60a.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Build and Inspect Python Package
  • GitHub Check: preview_docs
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: MyPy Check
🔇 Additional comments (2)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (2)

5-17: Sort imports to fix linter warning.

The linter is still flagging unsorted imports. Would you like to sort them like this? wdyt?

 import unittest
 from datetime import datetime, timezone
 from unittest.mock import Mock
 
+from airbyte_protocol_dataclasses.models import SyncMode
 from airbyte_cdk.sources.file_based.discovery_policy import AbstractDiscoveryPolicy
-from airbyte_cdk.sources.file_based.exceptions import (
-    FileBasedErrorsCollector,
-)
+from airbyte_cdk.sources.file_based.exceptions import FileBasedErrorsCollector
 from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader
 from airbyte_cdk.sources.file_based.stream import FileIdentities
-from airbyte_protocol_dataclasses.models import SyncMode

19-21: Verify if MockFormat class is needed.

I notice this class is empty and doesn't seem to be used in the tests. Could we remove it if it's not needed? wdyt? 🤔

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
unit_tests/sources/file_based/stream/test_file_identities_stream.py (3)

23-59: Consider enhancing test data organization.

The test data looks great! Would you consider extracting the test records and schema into a separate test fixtures file for reusability? This could help if we need to use these test cases in other identity-related tests. wdyt?


60-73: Consider adding type hints for better maintainability.

The setup looks solid! Would you consider adding return type hints to the mocked objects for better code maintainability? For example:

-        self._catalog_schema = Mock()
+        self._catalog_schema: ConfiguredAirbyteCatalog = Mock()
-        self._stream_reader = Mock(spec=AbstractFileBasedStreamReader)
+        self._stream_reader: AbstractFileBasedStreamReader = Mock(spec=AbstractFileBasedStreamReader)

wdyt?


74-96: Consider adding more edge cases.

The test coverage looks good for the happy path and basic error handling! Would you consider adding tests for these scenarios:

  1. Empty identity groups list
  2. Malformed identity record
  3. Missing required fields
  4. Different identity types beyond user/group

This would help ensure robustness across various scenarios. wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afdd60a and 7b2ffce.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/stream/test_file_identities_stream.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/file_based/stream/test_file_identities_stream.py (1)

1-17: LGTM! Well-organized imports.

The imports are logically grouped and include all necessary dependencies for testing the FileIdentities stream.

Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to get this rolling — we might revisit some pieces when we work on sharepoint and that's okay!

@aldogonzalez8 I've left some comments, but they are all non-blocking nitpicks, up to you. Feel free to merge please!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
airbyte_cdk/sources/file_based/stream/identities_stream.py (2)

16-20: Consider renaming for consistency with other stream classes.

The class name FileIdentitiesStream breaks the pattern used by other stream classes like AbstractFileBasedStream and DefaultFileBasedStream. What do you think about renaming it to FileBasedIdentitiesStream to maintain consistency? wdyt?


17-20: Enhance docstring with more details.

The docstring could be more descriptive about the stream's purpose and behavior. Would you consider adding details about:

  • What kind of identities it syncs
  • How it interacts with the stream reader
  • When it should be used
    wdyt?
airbyte_cdk/sources/streams/permissions/identitiesstream.py (1)

56-64: Consider more specific error handling.

The catch-all Exception block could mask specific errors. Would you consider:

  1. Adding specific exception handling for common errors (e.g., network issues, authentication failures)
  2. Including the error type in the log message
    wdyt?
         except AirbyteTracedException as exc:
             raise exc
-        except Exception as e:
+        except (ConnectionError, TimeoutError) as e:
             yield AirbyteMessage(
                 type=MessageType.LOG,
                 log=AirbyteLogMessage(
                     level=Level.ERROR,
-                    message=f"Error trying to read identities: {e} stream={self.name}",
+                    message=f"Network error while reading identities: {type(e).__name__}: {e} stream={self.name}",
                     stack_trace=traceback.format_exc(),
                 ),
             )
+        except Exception as e:
+            yield AirbyteMessage(
+                type=MessageType.LOG,
+                log=AirbyteLogMessage(
+                    level=Level.ERROR,
+                    message=f"Unexpected error while reading identities: {type(e).__name__}: {e} stream={self.name}",
+                    stack_trace=traceback.format_exc(),
+                ),
+            )
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

189-205: Consider adding error handling guidance in docstring.

The docstring provides a great example of implementation but doesn't cover error handling. Would you consider adding guidance on:

  • How to handle permission denied errors
  • What to do when ACL information is unavailable
  • Rate limiting considerations
    wdyt?

208-229: Consider adding pagination guidance in docstring.

The example in the docstring is helpful but doesn't address pagination. Since identity lists can be large, would you consider adding guidance on:

  • How to handle pagination
  • Recommended batch sizes
  • Rate limiting considerations
    wdyt?

232-280: Consider adding schema validation examples.

The schema docstrings are clear but could benefit from validation examples. Would you consider adding:

  • Example of schema validation
  • Common schema validation pitfalls
  • Best practices for extending schemas
    wdyt?
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b2ffce and 1b729d7.

📒 Files selected for processing (7)
  • airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (8 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
  • airbyte_cdk/sources/file_based/stream/__init__.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/identities_stream.py (1 hunks)
  • airbyte_cdk/sources/streams/permissions/identitiesstream.py (1 hunks)
  • unit_tests/sources/file_based/stream/test_file_identities_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • unit_tests/sources/file_based/stream/test_file_identities_stream.py
  • airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
  • airbyte_cdk/sources/file_based/file_based_source.py
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/__init__.py (1)

3-6: LGTM!

The additions to __all__ correctly expose the new stream classes for external use.

Also applies to: 8-13

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
airbyte_cdk/sources/streams/permissions/identities_stream.py (6)

5-16: How about organizing imports into standard, third-party, and local sections? wdyt?

Consider grouping imports into three sections for better readability:

  1. Standard library (traceback, abc, typing)
  2. Third-party (airbyte_protocol_dataclasses)
  3. Local (airbyte_cdk)
 import traceback
 from abc import ABC, abstractmethod
 from typing import Any, Dict, Iterable, List, Mapping, MutableMapping, Optional
 
 from airbyte_protocol_dataclasses.models import SyncMode
 
+
 from airbyte_cdk.models import AirbyteLogMessage, AirbyteMessage, Level
 from airbyte_cdk.models import Type as MessageType
 from airbyte_cdk.sources.streams import Stream
 from airbyte_cdk.sources.streams.checkpoint import Cursor
 from airbyte_cdk.sources.utils.record_helper import stream_data_to_airbyte_message
 from airbyte_cdk.utils.traced_exception import AirbyteTracedException

19-23: How about enhancing the docstring for better clarity? wdyt?

The docstring could be more descriptive and grammatically correct:

     """
-    The identities stream. A full refresh stream to sync identities from a certain domain.
-    The load_identity_groups method manage the logic to get such data.
+    A stream that synchronizes identity information from a specified domain using full refresh sync mode.
+    Subclasses must implement the load_identity_groups method to provide the logic for retrieving identity data.
     """

25-27: Would you like to add type hints and documentation for class attributes? wdyt?

Consider adding type hints and docstrings for better maintainability:

-    IDENTITIES_STREAM_NAME = "identities"
+    IDENTITIES_STREAM_NAME: str = "identities"
+    """The name of the stream used for identity synchronization."""
 
-    is_resumable = False
+    is_resumable: bool = False
+    """Indicates whether the stream supports resuming from a checkpoint."""

29-41: Shall we add type hints for the _cursor attribute? wdyt?

Consider adding type hints for better type safety:

     def __init__(self) -> None:
         super().__init__()
-        self._cursor: MutableMapping[str, Any] = {}
+        self._cursor: MutableMapping[str, Any] = {}
+        """The cursor state for tracking sync progress."""

42-65: The error handling looks good! How about making the error message more specific? wdyt?

Consider enhancing the error message to include more context:

                 log=AirbyteLogMessage(
                     level=Level.ERROR,
-                    message=f"Error trying to read identities: {e} stream={self.name}",
+                    message=f"Failed to read identity records from domain: {e}. Stream: {self.name}",
                     stack_trace=traceback.format_exc(),
                 ),

66-75: How about adding more detailed docstrings for the methods? wdyt?

Consider enhancing the docstrings to better describe the methods:

     @abstractmethod
     def load_identity_groups(self) -> Iterable[Dict[str, Any]]:
-        raise NotImplementedError("Implement this method to read identity records")
+        """
+        Load identity groups from the domain.
+        
+        Returns:
+            Iterable[Dict[str, Any]]: An iterator of identity records, where each record
+            contains the identity information in a dictionary format.
+        
+        Raises:
+            NotImplementedError: This method must be implemented by subclasses.
+        """
+        raise NotImplementedError("Subclasses must implement load_identity_groups")

     def get_cursor(self) -> Optional[Cursor]:
+        """
+        Get the cursor for this stream.
+        
+        Returns:
+            Optional[Cursor]: Always returns None as this is a full refresh stream.
+        """
         return None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b729d7 and 7924c3d.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/stream/identities_stream.py (1 hunks)
  • airbyte_cdk/sources/streams/permissions/identities_stream.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/stream/identities_stream.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/streams/permissions/identities_stream.py (1)

1-76: Great implementation! The code looks solid and well-structured.

The implementation provides a robust foundation for identity synchronization with proper error handling, state management, and extensibility through the abstract method pattern. The code follows good practices and is maintainable.

@aldogonzalez8 aldogonzalez8 dismissed aaronsteers’s stale review February 14, 2025 16:26

I have already implemented the three-radio option that AJ suggested.

@aldogonzalez8 aldogonzalez8 merged commit 10c1085 into main Feb 14, 2025
24 checks passed
@aldogonzalez8 aldogonzalez8 deleted the aldogonzalez8/sync-metadata-records2 branch February 14, 2025 17:31
rpopov added a commit to rpopov/airbyte-python-cdk that referenced this pull request Mar 5, 2025
* main:
  fix: update cryptography package to latest version to address CVE (airbytehq#377)
  fix: (CDK) (HttpRequester) - Make the `HttpRequester.path` optional (airbytehq#370)
  feat: improved custom components handling (airbytehq#350)
  feat: add microseconds timestamp format (airbytehq#373)
  fix: Replace Unidecode with anyascii for permissive license (airbytehq#367)
  feat: add IncrementingCountCursor (airbytehq#346)
  feat: (low-code cdk)  datetime format with milliseconds (airbytehq#369)
  fix: (CDK) (AsyncRetriever) - Improve UX on variable naming and interpolation (airbytehq#368)
  fix: (CDK) (AsyncRetriever) - Add the `request` and `response` to each `async` operations (airbytehq#356)
  fix: (CDK) (ConnectorBuilder) - Add `auxiliary requests` to slice; support `TestRead` for AsyncRetriever (part 1/2) (airbytehq#355)
  feat(concurrent perpartition cursor): Add parent state updates (airbytehq#343)
  fix: update csv parser for builder compatibility (airbytehq#364)
  feat(low-code cdk): add interpolation for limit field in Rate (airbytehq#353)
  feat(low-code cdk): add AbstractStreamFacade processing as concurrent streams in declarative source (airbytehq#347)
  fix: (CDK) (CsvParser) - Fix the `\\` escaping when passing the `delimiter` from Builder's UI (airbytehq#358)
  feat: expose `str_to_datetime` jinja macro (airbytehq#351)
  fix: update CDK migration for 6.34.0 (airbytehq#348)
  feat: Removes `stream_state` interpolation from CDK (airbytehq#320)
  fix(declarative): Pass `extra_fields` in `global_substream_cursor` (airbytehq#195)
  feat(concurrent perpartition cursor): Refactor ConcurrentPerPartitionCursor (airbytehq#331)
  feat(HttpMocker): adding support for PUT requests and bytes responses (airbytehq#342)
  chore: use certified source for manifest-only test (airbytehq#338)
  feat: check for request_option mapping conflicts in individual components (airbytehq#328)
  feat(file-based): sync file acl permissions and identities (airbytehq#260)
  fix: (CDK) (Connector Builder) - refactor the `MessageGrouper` > `TestRead` (airbytehq#332)
  fix(low code): Fix missing cursor for ClientSideIncrementalRecordFilterDecorator (airbytehq#334)
  feat(low-code): Add API Budget (airbytehq#314)
  chore(decoder): clean decoders and make csvdecoder available (airbytehq#326)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants