-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(file-based): sync file acl permissions and identities #260
Conversation
/format-fix |
/autofix
|
📝 WalkthroughWalkthroughThis pull request enhances the file-based data synchronization framework by adding support for a new permissions delivery method. A new class, DeliverPermissions, is introduced and integrated into the union type of delivery methods. Multiple components are updated to support permissions transfer and identities stream inclusion, including configuration validation, stream creation in FileBasedSource, and stream readers. Additionally, new stream classes are added for handling permissions and identities, along with corresponding unit tests to verify the changes. Changes
Sequence Diagram(s)sequenceDiagram
participant Source as FileBasedSource
participant Config as ValidateConfigTransferModes
participant Reader as AbstractFileBasedStreamReader
participant PermStream as PermissionsFileBasedStream
participant DefStream as DefaultFileBasedStream
participant IdStream as FileIdentitiesStream
Source->>Config: Validate delivery_method config
alt use_permissions_transfer
Source->>PermStream: Create PermissionsFileBasedStream
PermStream->>Reader: get_file_acl_permissions(file)
else use_file_transfer
Source->>DefStream: Create DefaultFileBasedStream
end
alt include_identities_stream is True
Source->>IdStream: Create FileIdentitiesStream
IdStream->>Reader: load_identity_groups()
end
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)
25-30
: Consider enhancing the field description for clarity?The description could be more specific about what metadata is being synced and how it differs from file data. Maybe something like: "If enabled, streams will sync file metadata (e.g., file path, permissions, timestamps) instead of the actual file contents.", wdyt?
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
213-227
: Consider adding validation for required metadata fields?The metadata schema defines several fields but doesn't mark any as required. Should we ensure that critical fields like
id
andfile_path
are always present in the metadata records? We could add arequired
field to the schema, wdyt?return { "type": "object", "properties": { "id": {"type": "string"}, "file_path": {"type": "string"}, "allowed_identity_remote_ids": {"type": "array", "items": "string"}, "is_public": {"type": "boolean"}, }, + "required": ["id", "file_path"] }
airbyte_cdk/sources/file_based/file_based_source.py (1)
391-397
: Consider reusing the delivery type check logic?The
_use_records_transfer
method has similar logic to_use_file_transfer
. Maybe we could extract the common pattern into a helper method to reduce duplication, wdyt?+ @staticmethod + def _has_delivery_type(parsed_config: AbstractFileBasedSpec, delivery_type: str) -> bool: + return ( + hasattr(parsed_config.delivery_method, "delivery_type") + and parsed_config.delivery_method.delivery_type == delivery_type + ) @staticmethod def _use_records_transfer(parsed_config: AbstractFileBasedSpec) -> bool: - use_records_transfer = ( - hasattr(parsed_config.delivery_method, "delivery_type") - and parsed_config.delivery_method.delivery_type == "use_records_transfer" - ) - return use_records_transfer + return FileBasedSource._has_delivery_type(parsed_config, "use_records_transfer")unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)
512-518
: The sync_metadata property looks good, but should we enhance its description?The implementation looks correct, but we could make the description more specific about what "file metadata" includes. For example, we could mention the specific metadata fields that will be synced (like file size, creation date, etc.). wdyt?
- "description": "If enabled, streams will sync files metadata instead of files data.", + "description": "If enabled, streams will sync file metadata (e.g., file size, creation date, last modified date) instead of file contents.",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_source.py
(3 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
(6 hunks)unit_tests/sources/file_based/scenarios/csv_scenarios.py
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py
[error] 206-206: Missing return statement. If the method is meant to be abstract, use @abc.abstractmethod
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)
Line range hint
1-1024
: Should we add test scenarios for the new sync_metadata functionality?The file includes comprehensive test scenarios for various CSV configurations, but I notice we're missing specific test cases for the new
sync_metadata
feature. Consider adding these test scenarios:
- A test case where
sync_metadata=True
to verify file metadata syncing- A test case with
sync_metadata=False
(default case) to verify file content syncing- Edge cases like empty files, invalid metadata, etc.
Would you like me to help draft these test scenarios?
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (4)
138-145
: Hey! What do you think about reducing code duplication here? 🤔I notice this follows the same pattern as
use_file_transfer
. Maybe we could extract a helper method to handle the delivery type check? Something like this, wdyt?+ def _check_delivery_type(self, expected_type: str) -> bool: + if self.config: + return ( + hasattr(self.config.delivery_method, "delivery_type") + and self.config.delivery_method.delivery_type == expected_type + ) + return False + def use_file_transfer(self) -> bool: - if self.config: - use_file_transfer = ( - hasattr(self.config.delivery_method, "delivery_type") - and self.config.delivery_method.delivery_type == "use_file_transfer" - ) - return use_file_transfer - return False + return self._check_delivery_type("use_file_transfer") def use_records_transfer(self) -> bool: - if self.config: - use_records_transfer = ( - hasattr(self.config.delivery_method, "delivery_type") - and self.config.delivery_method.delivery_type == "use_records_transfer" - ) - return use_records_transfer - return False + return self._check_delivery_type("use_records_transfer")
158-166
: Would you mind adding a docstring to clarify the behavior? 📝The method has a dependency on
use_records_transfer()
that might not be immediately obvious to other developers. Maybe we could add a docstring explaining this? Here's a suggestion:def sync_metadata(self) -> bool: + """ + Check if metadata synchronization is enabled in the config. + + Returns: + bool: True if metadata sync is enabled and records transfer is being used, + False otherwise. + + Note: + This feature requires records transfer to be enabled + (delivery_type == "use_records_transfer"). + """ if ( self.config and self.use_records_transfer()
214-219
: Tiny typo in the docstring ✍️There's an extra quote mark in the docstring. Would you mind fixing it?
- def get_metadata_schema(self) -> Dict[str, Any]: - """ " + def get_metadata_schema(self) -> Dict[str, Any]: + """
220-228
: How about adding type annotations to the schema? 🤓The schema could be more specific about the types. Here's a suggestion that adds
minItems
for arrays and makes the types more explicit, wdyt?return { "type": "object", + "required": ["id", "file_path", "allowed_identity_remote_ids", "is_public"], "properties": { "id": {"type": "string"}, "file_path": {"type": "string"}, - "allowed_identity_remote_ids": {"type": "array", "items": "string"}, + "allowed_identity_remote_ids": { + "type": "array", + "items": {"type": "string"}, + "minItems": 1 + }, "is_public": {"type": "boolean"}, }, }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
206-212
: The docstring could use more details about the return format 🎯The previous review already caught the missing
@abstractmethod
decorator. Additionally, would you consider enhancing the docstring to describe the expected return format? Something like:@abstractmethod def get_file_metadata(self, file: RemoteFile, logger: logging.Logger) -> Dict[str, Any]: """ - This is required for connectors that will support syncing - metadata from files. + Retrieve metadata for a given file. + + Args: + file: The remote file to get metadata for + logger: Logger for recording any issues + + Returns: + Dict containing file metadata with at least the following keys: + - id (str): Unique identifier for the file + - file_path (str): Path to the file + - allowed_identity_remote_ids (List[str]): Allowed remote IDs + - is_public (bool): Public access flag """ ...
/autofix
|
@aldogonzalez8 - I'm updating from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
138-146
: Consider unifying the naming convention for file transfer.Currently, we have both
use_file_transfer
anduse_records_transfer
. Would it make sense to merge them or rename them for clarity if they serve related purposes, wdyt?
158-167
: Consider consolidating conditions or clarifying control flow.
sync_metadata
checks multiple conditions similar touse_records_transfer
. Perhaps unify or centralize these checks to avoid duplication and simplify readability, wdyt?
213-227
: Evaluate if a concrete default implementation is desired.
get_metadata_schema
provides a default schema. If every connector must customize this, would marking it abstract be better to ensure an override, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
206-212
: Revisit making this method abstract.There's a past comment suggesting
@abstractmethod
here, ensuring that all concrete implementations must overrideget_file_metadata
. Do we want to enforce that strictly, wdyt?
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (5)
airbyte_cdk/sources/file_based/stream/identities_stream.py (1)
80-88
: Reduce overly broad exception handling to improve debugging.Catching
Exception
swallows all errors, which can mask specific issues. Would you like to narrow the scope by catching onlyFileBasedSourceError
, IO errors, or other relevant exceptions? Wdyt?airbyte_cdk/sources/file_based/config/permissions.py (1)
28-33
: Consider adding docstrings for better maintainability.The RemoteFilePermissions model looks good, but would you consider adding docstrings to explain the purpose of each field? For example:
class RemoteFilePermissions(BaseModel): + """Model representing permissions for a remote file. + + Attributes: + id: Unique identifier for the permission set + file_path: Path to the file these permissions apply to + allowed_identity_remote_ids: List of identity IDs with access + denied_identity_remote_ids: List of identity IDs explicitly denied + publicly_accessible: Whether the file is publicly available + """ id: str file_path: str allowed_identity_remote_ids: list[str] | None = None denied_identity_remote_ids: list[str] | None = None publicly_accessible: bool = Falseairbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
138-145
: Consider reducing code duplication with use_file_transfer.Both methods follow the same pattern. What do you think about extracting a common helper? Something like:
+ def _check_delivery_type(self, expected_type: str) -> bool: + if self.config: + return ( + hasattr(self.config.delivery_method, "delivery_type") + and self.config.delivery_method.delivery_type == expected_type + ) + return False + def use_records_transfer(self) -> bool: - if self.config: - use_records_transfer = ( - hasattr(self.config.delivery_method, "delivery_type") - and self.config.delivery_method.delivery_type == "use_records_transfer" - ) - return use_records_transfer - return False + return self._check_delivery_type("use_records_transfer")unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)
512-545
: Consider enhancing the property definitions for better clarity and validation?A few suggestions to make these properties more robust:
- The description for
sync_acl_permissions
could be more specific about what "Document allowlists" means- The
domain
field could benefit from a pattern or format validator to ensure valid domain values- Consider adding examples in the description to illustrate the expected values
What do you think about these improvements?
Line range hint
1-1000
: Consider adding test scenarios for the new features?The test suite is comprehensive for CSV parsing features, but it might be good to add test scenarios for:
- The new
sync_acl_permissions
functionality- The
identities
configuration with various domain values- Error cases for invalid domain values in the identities configuration
Would you like me to help draft these test scenarios?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py
(2 hunks)airbyte_cdk/sources/file_based/config/identities_based_stream_config.py
(1 hunks)airbyte_cdk/sources/file_based/config/permissions.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_source.py
(7 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)airbyte_cdk/sources/file_based/schema_helpers.py
(1 hunks)airbyte_cdk/sources/file_based/stream/__init__.py
(1 hunks)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
(7 hunks)airbyte_cdk/sources/file_based/stream/identities_stream.py
(1 hunks)unit_tests/sources/file_based/scenarios/csv_scenarios.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/file_based/file_based_source.py
- airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/config/identities_based_stream_config.py
[error] 1-5: Import block is un-sorted or un-formatted
airbyte_cdk/sources/file_based/config/permissions.py
[error] 5-11: Import block is un-sorted or un-formatted
airbyte_cdk/sources/file_based/stream/identities_stream.py
[error] 5-29: Import block is un-sorted or un-formatted
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (7)
airbyte_cdk/sources/file_based/stream/__init__.py (1)
3-5
: Module export looks good.Including
IdentitiesStream
in the__all__
list is consistent with your design. Great job! No changes needed here.airbyte_cdk/sources/file_based/config/permissions.py (1)
11-25
: LGTM! Well-structured identity model.The RemoteFileIdentity model has comprehensive attributes with appropriate type hints and optional fields.
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)
28-38
: LGTM! Well-configured fields with clear descriptions.The new fields are properly configured with appropriate titles, descriptions, and default values.
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
158-166
: LGTM! Good conditional checks.The method properly checks for both records transfer and ACL permissions configuration.
206-211
: Add @AbstractMethod decorator to get_file_acl_permissions.Since this method is required for connectors that support syncing ACL permissions, it should be marked as abstract. WDYT?
+ @abstractmethod def get_file_acl_permissions(self, file: RemoteFile, logger: logging.Logger) -> Dict[str, Any]: """ This is required for connectors that will support syncing ACL Permissions from files. """ - return {}
213-218
:⚠️ Potential issueAdd @AbstractMethod decorator to load_identity_groups.
Since this method is required for connectors that support syncing identities, it should be marked as abstract. WDYT?
+ @abstractmethod def load_identity_groups(self) -> Iterable[Dict[str, Any]]: """ This is required for connectors that will support syncing identities. """ - yield {}Likely invalid or redundant comment.
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)
512-545
: Verify the implementation of these new properties across the codebase.Let's check how these new properties are used in the codebase:
✅ Verification successful
Properties are properly implemented across the codebase ✅
The new properties are well-integrated with consistent implementations in the relevant files:
sync_acl_permissions
is properly handled in stream readers, sources, and configurationIdentitiesStreamConfig
has all the necessary supporting implementations🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for usage of the new properties echo "Searching for sync_acl_permissions usage..." rg "sync_acl_permissions" echo "Searching for identities configuration usage..." rg "IdentitiesStreamConfig"Length of output: 3397
airbyte_cdk/sources/file_based/config/identities_based_stream_config.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/file_based/stream/identities_stream.py (5)
30-33
: Could we expand the docstring to clarify usage scenarios?Adding a bit more detail about the intended use cases for handling identities might help new contributors. Would this be worthwhile? Wdyt?
35-35
: Consider allowingis_resumable
to be configurable.Making
is_resumable
a dynamic property might let advanced users override it. Is this something you'd find useful? Wdyt?
53-61
: Could we rename_cursor
to_state
for clarity?The setter and getter methods describe “state,” but the backing field is
_cursor
. Renaming could reduce confusion. Wdyt?
66-90
: Would you consider more specific exception handling?Catching
Exception
is broad. Perhaps we could separate known errors from unexpected ones for clearer debugging. Wdyt?
98-99
: Would storing a real cursor help with incremental sync?Returning
None
fromget_cursor
means we always run a full refresh. Are there future plans for incremental logic? Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/file_based_source.py
(7 hunks)airbyte_cdk/sources/file_based/stream/identities_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/file_based/file_based_source.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/identities_stream.py (1)
5-26
: Resolve import sorting to satisfy linters.A previous comment mentioned that your imports are not strictly sorted. Would you like to apply the suggested fix from earlier or keep this arrangement? Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/schema_helpers.py (2)
26-34
: Consider adding schema constraints for better validation. WDYT?The schema looks good! A few suggestions to make it more robust:
- Add
required
field to specify which properties are mandatory- Add
description
field to document the purpose of each property- Consider adding
pattern
constraint forfile_path
to validate path formatExample enhancement:
remote_file_permissions_schema = { "type": "object", "properties": { - "id": {"type": "string"}, + "id": { + "type": "string", + "description": "Unique identifier for the permission record" + }, "file_path": {"type": "string"}, "allowed_identity_remote_ids": {"type": "array", "items": {"type": "string"}}, "publicly_accessible": {"type": "boolean"}, }, + "required": ["id", "file_path", "allowed_identity_remote_ids", "publicly_accessible"] }
36-49
: Enhance array type definition and date format specification. WDYT?A couple of suggestions to improve type safety:
- The
member_email_addresses
array is missing items type specification- The
modified_at
field could benefit from a date-time format specificationHere's a suggested enhancement:
"email_address": {"type": ["null", "string"]}, - "member_email_addresses": {"type": ["null", "array"]}, + "member_email_addresses": { + "type": ["null", "array"], + "items": {"type": "string"} + }, "type": {"type": "string"}, - "modified_at": {"type": "string"}, + "modified_at": { + "type": "string", + "format": "date-time" + },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/schema_helpers.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py (4)
2-2
: Hey there! Quick note about the copyright year 🗓️I noticed the copyright year is set to 2025, but we're still in 2024. Should we update this to the current year? wdyt?
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
40-53
: Thoughts on enhancing the permissions schema? 🤔I noticed that
_A_RECORD
includes apublicly_accessible
field, but it's not defined in_A_PERMISSIONS_SCHEMA
. Should we add it to ensure complete schema validation? Here's a suggestion:_A_PERMISSIONS_SCHEMA = { "type": "object", "properties": { "id": {"type": "string"}, "remote_id": {"type": "string"}, "parent_id": {"type": ["null", "string"]}, "name": {"type": ["null", "string"]}, "description": {"type": ["null", "string"]}, "email_address": {"type": ["null", "string"]}, "member_email_addresses": {"type": ["null", "array"]}, "type": {"type": "string"}, "modified_at": {"type": "string"}, + "publicly_accessible": {"type": "boolean"}, }, }
83-83
: Shall we clean up this commented line? 🧹There's a commented-out line that seems unnecessary. Should we remove it to keep the code clean? wdyt?
- # self._parser.parse_records.return_value = [self._A_RECORD]
82-91
: How about adding some edge cases to strengthen the test coverage? 💪The current test covers the happy path, but what do you think about adding tests for:
- Empty permissions
- Multiple allowed identities
- Error cases (e.g., when get_file_acl_permissions raises an exception)
Here's a quick example of what that might look like:
def test_when_read_records_from_slice_with_empty_permissions_then_return_empty_list(self) -> None: self._stream_reader.get_file_acl_permissions.return_value = {} messages = list( self._stream.read_records_from_slice( {"files": [RemoteFile(uri="uri", last_modified=self._NOW)]} ) ) assert len(list(messages)) == 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (6)
2-2
: Update the copyright year.Hey! I noticed the copyright year is set to 2025. Should we update this to 2024? wdyt? 🤔
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
16-19
: Consider enhancing the docstring.The current docstring is a bit brief. What do you think about adding more details about:
- The purpose of this stream
- What kind of permissions it handles
- How it interacts with the stream_reader
- Example usage or expected output format
Here's a suggestion, wdyt? 🙂
""" -The permissions stream, stream_reader on source handles logic for schemas and ACLs permissions. +A specialized stream for handling file-based ACL permissions. + +This stream works with the stream_reader to: +1. Fetch ACL permissions for each file in the source +2. Transform permissions into a standardized format +3. Generate records containing permission information + +The stream_reader is responsible for the actual implementation of permission retrieval +and schema definition, while this class handles the streaming interface. """
21-24
: Add method documentation and return type hint.Would you consider adding a docstring to explain why this method overrides the parent implementation? Also, adding a return type hint would improve type safety. Here's a suggestion:
def _filter_schema_invalid_properties( self, configured_catalog_json_schema: Dict[str, Any] -) -> Dict[str, Any]: +) -> JsonSchema: + """ + Override parent method to use the permissions schema from stream_reader instead of filtering. + + Args: + configured_catalog_json_schema: The schema from the configured catalog (unused) + Returns: + The file permissions schema from the stream reader + """ return self.stream_reader.file_permissions_schema
31-46
: Consider simplifying the permissions check flow.The current flow using the
no_permissions
flag could be simplified. What do you think about this approach? 🤔for file in stream_slice["files"]: - no_permissions = False file_datetime_string = file.last_modified.strftime(self.DATE_TIME_FORMAT) try: permissions_record = self.stream_reader.get_file_acl_permissions( file, logger=self.logger ) if not permissions_record: - no_permissions = True - continue + self.logger.warning(f"Unable to fetch permissions. stream={self.name} file={file.uri}") + continue permissions_record = self.transform_record( permissions_record, file, file_datetime_string ) yield stream_data_to_airbyte_message( self.name, permissions_record, is_file_transfer_message=False )
47-56
: Consider more specific error handling.Would it be helpful to catch more specific exceptions? For example:
- except Exception as e: + except (ValueError, AttributeError) as e: self.logger.error(f"Failed to retrieve permissions for file {file.uri}: {str(e)}") yield AirbyteMessage( type=MessageType.LOG, log=AirbyteLogMessage( level=Level.ERROR, message=f"Error retrieving files permissions: stream={self.name} file={file.uri}", stack_trace=traceback.format_exc(), ), ) + except Exception as e: + # Re-raise unexpected exceptions + raise
67-68
: Add method documentation.Would you consider adding a docstring to explain the purpose of this method? Something like:
def _get_raw_json_schema(self) -> JsonSchema: + """ + Retrieve the raw JSON schema for file permissions from the stream reader. + + Returns: + The file permissions schema that defines the structure of permission records + """ return self.stream_reader.file_permissions_schema
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py
(1 hunks)unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (3)
2-2
: Hey, noticed the copyright year is set to 2025! 🤔The copyright year is set to 2025, which is in the future. Should we update it to the current year (2024), wdyt?
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
34-77
: Shall we simplify the logging logic? 🤔The warning message for missing permissions is currently logged twice:
- First time in the try block (line 48-50)
- Second time in the finally block (line 69-76)
What do you think about removing the finally block since we already handle the warning in the try block? This would make the code more concise and avoid duplicate logs, wdyt?
except Exception as e: self.logger.error(f"Failed to retrieve permissions for file {file.uri}: {str(e)}") yield AirbyteMessage( type=MessageType.LOG, log=AirbyteLogMessage( level=Level.ERROR, message=f"Error retrieving files permissions: stream={self.name} file={file.uri}", stack_trace=traceback.format_exc(), ), ) - finally: - if no_permissions: - yield AirbyteMessage( - type=MessageType.LOG, - log=AirbyteLogMessage( - level=Level.WARN, - message=f"Unable to fetch permissions. stream={self.name} file={file.uri}", - ), - )
29-33
: Would you mind adding a docstring here? 📝The method could benefit from a docstring explaining its purpose and why it returns the file_permissions_schema. This would help other developers understand the method's role in schema validation, wdyt?
def _filter_schema_invalid_properties( self, configured_catalog_json_schema: Dict[str, Any] ) -> Dict[str, Any]: + """ + Filter and validate the schema for permission records. + + Args: + configured_catalog_json_schema: The schema from the configured catalog + + Returns: + The validated file permissions schema from the stream reader + """ return self.stream_reader.file_permissions_schema
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
(1 hunks)airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1)
16-27
: Love the clear docstring! 👏The class structure and documentation are excellent. The docstring clearly explains the purpose and the division of responsibilities between the stream and stream_reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/file_based/file_based_source.py (5)
82-82
: Consider using FileIdentities.IDENTITIES_STREAM_NAME instead of IDENTITIES_STREAM constant?I notice that we have a duplicate constant definition here. Since we're already using
FileIdentities.IDENTITIES_STREAM_NAME
in_make_identities_stream
, what do you think about removing this constant and using the one from theFileIdentities
class throughout? This would help maintain a single source of truth, wdyt?-IDENTITIES_STREAM = "identities"
174-180
: Consider enhancing the identity check error message?The error message could be more specific about what went wrong. What do you think about providing more context about why the identities couldn't be loaded? For example:
- errors.append( - "Unable to get identities for current configuration, please check your credentials" - ) + errors.append( + "Unable to get identities for current configuration. This could be due to insufficient permissions or invalid credentials. Please verify your access rights and credentials are correct." + )
185-189
: Consider simplifying the availability method selection?The condition could be more concise using the
or
operator. What do you think about this approach?- availability_method = ( - stream.availability_strategy.check_availability - if use_file_transfer(parsed_config) or use_permissions_transfer(parsed_config) - else stream.availability_strategy.check_availability_and_parsability - ) + is_transfer_mode = use_file_transfer(parsed_config) or use_permissions_transfer(parsed_config) + availability_method = ( + stream.availability_strategy.check_availability if is_transfer_mode + else stream.availability_strategy.check_availability_and_parsability + )
322-340
: Consider adding return type hint to _make_default_stream?For consistency with other methods, what do you think about adding a return type hint to this method?
- def _make_default_stream( + def _make_default_stream( self, stream_config: FileBasedStreamConfig, cursor: Optional[AbstractFileBasedCursor], parsed_config: AbstractFileBasedSpec, - ) -> AbstractFileBasedStream: + ) -> DefaultFileBasedStream:
342-356
: Consider adding docstrings to new stream creation methods?The new methods could benefit from docstrings explaining their purpose and parameters. What do you think about adding them?
def _make_permissions_stream( self, stream_config: FileBasedStreamConfig, cursor: Optional[AbstractFileBasedCursor] ) -> AbstractFileBasedStream: + """ + Creates a stream for handling file permissions. + + Args: + stream_config: Configuration for the stream + cursor: Optional cursor for tracking stream progress + + Returns: + A stream instance configured for handling permissions + """ def _make_identities_stream( self, ) -> Stream: + """ + Creates a stream for handling identity information. + + Returns: + A stream instance configured for handling identities + """Also applies to: 372-380
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/file_based_source.py
(8 hunks)unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- unit_tests/sources/file_based/stream/test_permissions_file_based_stream.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_source.py (1)
314-316
:✅ Verification successful
Verify consistency of identity stream naming.
The code looks good! Let's just verify that we're using consistent naming for the identity stream across the codebase.
🏁 Script executed:
#!/bin/bash # Search for any references to identity/identities stream names to ensure consistency rg -i "identit(y|ies).*stream" --type pyLength of output: 4071
Stream Naming Consistency Verified!
The identity stream naming is consistently using
"identities"
across the codebase—both in the permissions stream and in the file-based source. Everything looks aligned, so this part of the change can be approved. wdyt?
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (3)
2-2
: Update copyright year to current year.The copyright year is set to 2025, but it should reflect the current year (2024). Wdyt about updating it?
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
24-42
: Consider using more generic test data.The test data contains realistic-looking email addresses and UUIDs. To avoid any potential PII concerns and make tests more maintainable, would you consider using more generic test data? For example:
_A_RECORD = { - "id": "923496ab-3eee-47d2-a824-b237e630082a", - "remote_id": "[email protected]", - "name": "user ond", - "email_address": "[email protected]", - "member_email_addresses": ["[email protected]", "[email protected]"], + "id": "test-user-id-1", + "remote_id": "test-user-1", + "name": "Test User 1", + "email_address": "[email protected]", + "member_email_addresses": ["[email protected]", "[email protected]"], "type": "user", "modified_at": "2025-02-12T23:06:45.304942+00:00", }
73-94
: Consider adding edge case tests.The current tests cover the happy path and basic error handling. Would you consider adding tests for edge cases? For example:
- Empty identity groups
- Malformed identity records
- Different sync modes
Here's a sample test for empty groups:
def test_when_read_records_with_empty_groups_then_return_empty(self) -> None: self._stream_reader.load_identity_groups.return_value = [] messages = list(self._stream.read_records(SyncMode.full_refresh)) assert len(messages) == 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/streams/permissions/identities.py
(1 hunks)unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/streams/permissions/identities.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py
[warning] 5-5: Import block is un-sorted or un-formatted. Organize imports.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (1)
18-20
: Verify if MockFormat class is needed.The
MockFormat
class appears to be empty and unused in the tests. Is it needed for future use or can it be removed? Wdyt?
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (2)
23-73
: Add docstrings to improve test documentation.The test class and data structures are well organized! Would you consider adding docstrings to explain the test class's purpose and the significance of the test data? Something like this maybe? wdyt?
class IdentitiesFileBasedStreamTest(unittest.TestCase): + """Test suite for FileIdentities stream implementation. + + Tests the functionality of reading identity records, schema validation, + and error handling in the FileIdentities stream. + """
74-96
: Consider adding edge case tests.The current test coverage looks great! 🎉 Would you consider adding a few more test cases to cover edge scenarios? Here are some ideas:
- Test with empty identity groups
- Test with malformed identity records
- Test with very large member_email_addresses arrays
What do you think about these additions? 🤔
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Build and Inspect Python Package
- GitHub Check: preview_docs
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: MyPy Check
🔇 Additional comments (2)
unit_tests/sources/file_based/stream/concurrent/test_file_identities_stream.py (2)
5-17
: Sort imports to fix linter warning.The linter is still flagging unsorted imports. Would you like to sort them like this? wdyt?
import unittest from datetime import datetime, timezone from unittest.mock import Mock +from airbyte_protocol_dataclasses.models import SyncMode from airbyte_cdk.sources.file_based.discovery_policy import AbstractDiscoveryPolicy -from airbyte_cdk.sources.file_based.exceptions import ( - FileBasedErrorsCollector, -) +from airbyte_cdk.sources.file_based.exceptions import FileBasedErrorsCollector from airbyte_cdk.sources.file_based.file_based_stream_reader import AbstractFileBasedStreamReader from airbyte_cdk.sources.file_based.stream import FileIdentities -from airbyte_protocol_dataclasses.models import SyncMode
19-21
: Verify if MockFormat class is needed.I notice this class is empty and doesn't seem to be used in the tests. Could we remove it if it's not needed? wdyt? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
unit_tests/sources/file_based/stream/test_file_identities_stream.py (3)
23-59
: Consider enhancing test data organization.The test data looks great! Would you consider extracting the test records and schema into a separate test fixtures file for reusability? This could help if we need to use these test cases in other identity-related tests. wdyt?
60-73
: Consider adding type hints for better maintainability.The setup looks solid! Would you consider adding return type hints to the mocked objects for better code maintainability? For example:
- self._catalog_schema = Mock() + self._catalog_schema: ConfiguredAirbyteCatalog = Mock() - self._stream_reader = Mock(spec=AbstractFileBasedStreamReader) + self._stream_reader: AbstractFileBasedStreamReader = Mock(spec=AbstractFileBasedStreamReader)wdyt?
74-96
: Consider adding more edge cases.The test coverage looks good for the happy path and basic error handling! Would you consider adding tests for these scenarios:
- Empty identity groups list
- Malformed identity record
- Missing required fields
- Different identity types beyond user/group
This would help ensure robustness across various scenarios. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/file_based/stream/test_file_identities_stream.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
unit_tests/sources/file_based/stream/test_file_identities_stream.py (1)
1-17
: LGTM! Well-organized imports.The imports are logically grouped and include all necessary dependencies for testing the FileIdentities stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to get this rolling — we might revisit some pieces when we work on sharepoint and that's okay!
@aldogonzalez8 I've left some comments, but they are all non-blocking nitpicks, up to you. Feel free to merge please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
airbyte_cdk/sources/file_based/stream/identities_stream.py (2)
16-20
: Consider renaming for consistency with other stream classes.The class name
FileIdentitiesStream
breaks the pattern used by other stream classes likeAbstractFileBasedStream
andDefaultFileBasedStream
. What do you think about renaming it toFileBasedIdentitiesStream
to maintain consistency? wdyt?
17-20
: Enhance docstring with more details.The docstring could be more descriptive about the stream's purpose and behavior. Would you consider adding details about:
- What kind of identities it syncs
- How it interacts with the stream reader
- When it should be used
wdyt?airbyte_cdk/sources/streams/permissions/identitiesstream.py (1)
56-64
: Consider more specific error handling.The catch-all
Exception
block could mask specific errors. Would you consider:
- Adding specific exception handling for common errors (e.g., network issues, authentication failures)
- Including the error type in the log message
wdyt?except AirbyteTracedException as exc: raise exc - except Exception as e: + except (ConnectionError, TimeoutError) as e: yield AirbyteMessage( type=MessageType.LOG, log=AirbyteLogMessage( level=Level.ERROR, - message=f"Error trying to read identities: {e} stream={self.name}", + message=f"Network error while reading identities: {type(e).__name__}: {e} stream={self.name}", stack_trace=traceback.format_exc(), ), ) + except Exception as e: + yield AirbyteMessage( + type=MessageType.LOG, + log=AirbyteLogMessage( + level=Level.ERROR, + message=f"Unexpected error while reading identities: {type(e).__name__}: {e} stream={self.name}", + stack_trace=traceback.format_exc(), + ), + )airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
189-205
: Consider adding error handling guidance in docstring.The docstring provides a great example of implementation but doesn't cover error handling. Would you consider adding guidance on:
- How to handle permission denied errors
- What to do when ACL information is unavailable
- Rate limiting considerations
wdyt?
208-229
: Consider adding pagination guidance in docstring.The example in the docstring is helpful but doesn't address pagination. Since identity lists can be large, would you consider adding guidance on:
- How to handle pagination
- Recommended batch sizes
- Rate limiting considerations
wdyt?
232-280
: Consider adding schema validation examples.The schema docstrings are clear but could benefit from validation examples. Would you consider adding:
- Example of schema validation
- Common schema validation pitfalls
- Best practices for extending schemas
wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_source.py
(8 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)airbyte_cdk/sources/file_based/stream/__init__.py
(1 hunks)airbyte_cdk/sources/file_based/stream/identities_stream.py
(1 hunks)airbyte_cdk/sources/streams/permissions/identitiesstream.py
(1 hunks)unit_tests/sources/file_based/stream/test_file_identities_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- unit_tests/sources/file_based/stream/test_file_identities_stream.py
- airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
- airbyte_cdk/sources/file_based/file_based_source.py
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/__init__.py (1)
3-6
: LGTM!The additions to
__all__
correctly expose the new stream classes for external use.Also applies to: 8-13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
airbyte_cdk/sources/streams/permissions/identities_stream.py (6)
5-16
: How about organizing imports into standard, third-party, and local sections? wdyt?Consider grouping imports into three sections for better readability:
- Standard library (
traceback
,abc
,typing
)- Third-party (
airbyte_protocol_dataclasses
)- Local (
airbyte_cdk
)import traceback from abc import ABC, abstractmethod from typing import Any, Dict, Iterable, List, Mapping, MutableMapping, Optional from airbyte_protocol_dataclasses.models import SyncMode + from airbyte_cdk.models import AirbyteLogMessage, AirbyteMessage, Level from airbyte_cdk.models import Type as MessageType from airbyte_cdk.sources.streams import Stream from airbyte_cdk.sources.streams.checkpoint import Cursor from airbyte_cdk.sources.utils.record_helper import stream_data_to_airbyte_message from airbyte_cdk.utils.traced_exception import AirbyteTracedException
19-23
: How about enhancing the docstring for better clarity? wdyt?The docstring could be more descriptive and grammatically correct:
""" - The identities stream. A full refresh stream to sync identities from a certain domain. - The load_identity_groups method manage the logic to get such data. + A stream that synchronizes identity information from a specified domain using full refresh sync mode. + Subclasses must implement the load_identity_groups method to provide the logic for retrieving identity data. """
25-27
: Would you like to add type hints and documentation for class attributes? wdyt?Consider adding type hints and docstrings for better maintainability:
- IDENTITIES_STREAM_NAME = "identities" + IDENTITIES_STREAM_NAME: str = "identities" + """The name of the stream used for identity synchronization.""" - is_resumable = False + is_resumable: bool = False + """Indicates whether the stream supports resuming from a checkpoint."""
29-41
: Shall we add type hints for the _cursor attribute? wdyt?Consider adding type hints for better type safety:
def __init__(self) -> None: super().__init__() - self._cursor: MutableMapping[str, Any] = {} + self._cursor: MutableMapping[str, Any] = {} + """The cursor state for tracking sync progress."""
42-65
: The error handling looks good! How about making the error message more specific? wdyt?Consider enhancing the error message to include more context:
log=AirbyteLogMessage( level=Level.ERROR, - message=f"Error trying to read identities: {e} stream={self.name}", + message=f"Failed to read identity records from domain: {e}. Stream: {self.name}", stack_trace=traceback.format_exc(), ),
66-75
: How about adding more detailed docstrings for the methods? wdyt?Consider enhancing the docstrings to better describe the methods:
@abstractmethod def load_identity_groups(self) -> Iterable[Dict[str, Any]]: - raise NotImplementedError("Implement this method to read identity records") + """ + Load identity groups from the domain. + + Returns: + Iterable[Dict[str, Any]]: An iterator of identity records, where each record + contains the identity information in a dictionary format. + + Raises: + NotImplementedError: This method must be implemented by subclasses. + """ + raise NotImplementedError("Subclasses must implement load_identity_groups") def get_cursor(self) -> Optional[Cursor]: + """ + Get the cursor for this stream. + + Returns: + Optional[Cursor]: Always returns None as this is a full refresh stream. + """ return None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/stream/identities_stream.py
(1 hunks)airbyte_cdk/sources/streams/permissions/identities_stream.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/file_based/stream/identities_stream.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/streams/permissions/identities_stream.py (1)
1-76
: Great implementation! The code looks solid and well-structured.The implementation provides a robust foundation for identity synchronization with proper error handling, state management, and extensibility through the abstract method pattern. The code follows good practices and is maintainable.
I have already implemented the three-radio option that AJ suggested.
* main: fix: update cryptography package to latest version to address CVE (airbytehq#377) fix: (CDK) (HttpRequester) - Make the `HttpRequester.path` optional (airbytehq#370) feat: improved custom components handling (airbytehq#350) feat: add microseconds timestamp format (airbytehq#373) fix: Replace Unidecode with anyascii for permissive license (airbytehq#367) feat: add IncrementingCountCursor (airbytehq#346) feat: (low-code cdk) datetime format with milliseconds (airbytehq#369) fix: (CDK) (AsyncRetriever) - Improve UX on variable naming and interpolation (airbytehq#368) fix: (CDK) (AsyncRetriever) - Add the `request` and `response` to each `async` operations (airbytehq#356) fix: (CDK) (ConnectorBuilder) - Add `auxiliary requests` to slice; support `TestRead` for AsyncRetriever (part 1/2) (airbytehq#355) feat(concurrent perpartition cursor): Add parent state updates (airbytehq#343) fix: update csv parser for builder compatibility (airbytehq#364) feat(low-code cdk): add interpolation for limit field in Rate (airbytehq#353) feat(low-code cdk): add AbstractStreamFacade processing as concurrent streams in declarative source (airbytehq#347) fix: (CDK) (CsvParser) - Fix the `\\` escaping when passing the `delimiter` from Builder's UI (airbytehq#358) feat: expose `str_to_datetime` jinja macro (airbytehq#351) fix: update CDK migration for 6.34.0 (airbytehq#348) feat: Removes `stream_state` interpolation from CDK (airbytehq#320) fix(declarative): Pass `extra_fields` in `global_substream_cursor` (airbytehq#195) feat(concurrent perpartition cursor): Refactor ConcurrentPerPartitionCursor (airbytehq#331) feat(HttpMocker): adding support for PUT requests and bytes responses (airbytehq#342) chore: use certified source for manifest-only test (airbytehq#338) feat: check for request_option mapping conflicts in individual components (airbytehq#328) feat(file-based): sync file acl permissions and identities (airbytehq#260) fix: (CDK) (Connector Builder) - refactor the `MessageGrouper` > `TestRead` (airbytehq#332) fix(low code): Fix missing cursor for ClientSideIncrementalRecordFilterDecorator (airbytehq#334) feat(low-code): Add API Budget (airbytehq#314) chore(decoder): clean decoders and make csvdecoder available (airbytehq#326)
What
We want to introduce logic for syncing access control lists and other metadata and Identities for connectors, specifically in this implementation for File Bases connectors.
If you are wondering what we mean by identities and ACLs here are a couple of documents that should clarify the general idea:
Relates to [source-google-drive] Add ACL: permissions colmns and identities stream
Reltaes to [source-microsoft-sharepoint] Implement Identities and Permissions
How
We will leverage DefaultFileBased stream and stream reader to reuse most of the logic for scrapping files and let connectors implement the logic from the domain they handle. This initial iteration scopes introducing ACLs and Identities for source-google-drive in this PR, but in the future, we expect to add more connectors.
In the UI we will add a new Transfer Mode to Replicate Permissions ACL
Note: This image is from source-google-drive, which introduces a domain optional field; this is not part of base config mode as the domain could not be a concept for other connectors.
Review guide
airbyte_cdk/sources/file_based/file_based_source.py
: logic to introduce new PermissionsFileBasedStream and Identities streams. We depend if use_permissoins_transfer mode is selected, if not we fallback to DefaultFileBased Streamairbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py
: This new stream inherits from DefaultFileBasedStream as we reuse most of the logic to find Files, but we override for providing a schema and read records.airbyte_cdk/sources/file_based/stream/identities_stream.py
: Simple FileIdentities stream to read Identities, again we override, from new Identities stream at global CDK lever as in the future we expect other non-file based connectors will sync Identities.airbyte_cdk/sources/specs/transfer_modes.py
: added a Mode for DeliverPermissions, again we expect this is reusable for other nont-file based stream in the future.airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py
: just moved logic to determine transfer modes here as we were duplicating code for Source and StreamReader.airbyte_cdk/sources/file_based/file_based_stream_reader.py
: new abstract methods for file permissions and identities so streams can use those.User Impact
We want to enable users for knowledge retrieval scenarios where based on the hierarchy of groups, users and folders in the source, can user A see file B?
Can this PR be safely reverted and rolled back?
Summary by CodeRabbit
New Features
Tests