Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(decoder): clean decoders and make csvdecoder available #326

Merged
merged 11 commits into from
Feb 12, 2025

Conversation

maxi297
Copy link
Contributor

@maxi297 maxi297 commented Feb 7, 2025

What

https://github.com/airbytehq/airbyte-internal-issues/issues/11616

This is a breaking change but only for an experimental component or one that is only used in source-amplitude so I'm fine keeping this a minor change.

Note that this means we will start parsing twice instead of relying on the in-memory value of response.json() from the requests library but we expect the parsing done by orjson to be twice as fast which means that we don't expect a performance hit even with the parsing twice.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced a unified decoding framework supporting multiple data types (e.g., CSV, JSON, and compressed formats) for enhanced flexibility.
    • Added new decoders for handling CSV and Gzip decoding.
  • Refactor

    • Streamlined the data extraction process by consolidating redundant decoding components and improving handling for both streamed and non-streamed responses.
  • Tests

    • Expanded test coverage to validate improved response processing and enhanced memory usage efficiency under various conditions.
    • Added new test cases for the CompositeRawDecoder to ensure correct behavior with consumed and non-streamed responses.
    • Updated tests to reflect changes in decoder implementations and removed obsolete tests.

@github-actions github-actions bot added the chore label Feb 7, 2025
Copy link
Contributor

coderabbitai bot commented Feb 7, 2025

📝 Walkthrough

Walkthrough

This PR refactors the decoding and parsing architecture. It removes several deprecated decoders and parsers (e.g., GzipJsonDecoder, JsonParser, JsonLineParser, CsvParser) and introduces a unified approach with a new GzipDecoder and renamed CsvDecoder. The CompositeRawDecoder now supports configurable streaming via a new stream_response flag, and the JsonDecoder has been restructured to delegate to it. Updates are applied in component schemas, the ModelToComponentFactory, and multiple test files to align with the new decoder interface.

Changes

File(s) Change Summary
airbyte_cdk/.../declarative_component_schema.yaml, airbyte_cdk/.../declarative_component_schema.py Removed obsolete decoder/parser components (GzipJsonDecoder, JsonParser, JsonLineParser, CsvParser) and introduced new ones (GzipDecoder, CsvDecoder). Updated ZipfileDecoder, SimpleRetriever, AsyncRetriever, and SessionTokenAuthenticator to reference the new decoder properties.
airbyte_cdk/.../decoders/{init.py, composite_raw_decoder.py, json_decoder.py} Removed deprecated decoders from public API; added a stream_response flag to CompositeRawDecoder; restructured JsonDecoder by removing the @DataClass decorator, defining an explicit constructor, delegating streaming logic, and simplifying error handling.
airbyte_cdk/.../parsers/model_to_component_factory.py Consolidated decoder creation methods: introduced create_csv_decoder and updated create_json_decoder and create_zipfile_decoder to use the new unified decoder interface.
unit_tests/.../(auth/test_token_provider.py, decoders/{test_composite_decoder.py, test_decoders_memory_usage.py, test_json_decoder.py}, extractors/test_dpath_extractor.py) Updated tests to reflect the new decoding architecture: replaced deprecated decoders with CompositeRawDecoder where applicable, adjusted response handling (using json.dumps() for content), and removed tests for obsolete gzip decoding functionality.

Sequence Diagram(s)

sequenceDiagram
    participant C as Caller
    participant CRD as CompositeRawDecoder
    participant P as Parser

    C->>CRD: decode(response)
    alt stream_response is True
       CRD->>P: parse(response.raw)
    else stream_response is False
       CRD->>CRD: wrap response.content in BytesIO
       CRD->>P: parse(wrapped content)
    end
    P-->>CRD: return parsed data
    CRD-->>C: yield decoded data
Loading

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • artem1205

How does this updated setup look to you? Any tweaks you'd like to make?

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (15)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)

2035-2036: Consider leveraging the model parameters or removing the unused argument.

Right now, this method always returns a new JsonDecoder with empty parameters, ignoring the passed-in model. Would it make sense to incorporate model parameters or drop the unused argument to avoid confusion, wdyt?


2039-2040: Make stream_response configurable or confirm it’s always false.

Here, you set stream_response=False for CSV. Are you certain that no streaming scenario is needed for CSV data, or would making it configurable benefit some use cases, wdyt?


2061-2061: Check for ZipfileDecoder parameters.

Currently, the created ZipfileDecoder ignores additional parameters in model.decoder or model.parameters. Do you want to forward them to the parser, or is this intentional, wdyt?


2064-2077: Consider exposing parameter checks & fallback for decoders.

  1. The _get_parser method doesn't incorporate model.parameters. If additional settings (like encoding) are required, you might unify that logic here.
  2. There's a potential for infinitely nested GzipParser if user misconfigures the inner_decoder repeatedly. A recursion limit or check might help.
    Wdyt about adding these safeguards?
airbyte_cdk/sources/declarative/decoders/json_decoder.py (2)

24-25: Consider making 'stream_response' a parameter.
It's currently hardcoded to False. Would you like to introduce a parameter to toggle streaming for future flexibility, wdyt?


36-41: Catching broad exceptions.
Catching Exception might mask unexpected errors. Would you like to handle a more specific exception type, wdyt?

unit_tests/sources/declarative/decoders/test_json_decoder.py (2)

11-13: Great alignment with the new composite decoders!
This import approach looks consistent. Would you consider adding more test coverage to verify interplay between CompositeRawDecoder and JsonDecoder, wdyt?


44-45: Testing partial streaming scenarios?
We now set stream=True. Would you like to add tests confirming that partial lines or chunked responses are handled gracefully, wdyt?

unit_tests/sources/declarative/auth/test_token_provider.py (1)

58-60: Testing updated token response.
This properly simulates a new token. Maybe we could also test invalid JSON scenarios to ensure robustness, wdyt?

unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)

24-24: Consider adding a comment explaining the stream_response flag.

The initialization looks good, but since this is a test file, it might be helpful to add a comment explaining why stream_response=True is needed here, wdyt?

-decoder_jsonl = CompositeRawDecoder(parser=JsonLineParser(), stream_response=True)
+# stream_response=True is required for JSONL parsing to handle streaming responses correctly
+decoder_jsonl = CompositeRawDecoder(parser=JsonLineParser(), stream_response=True)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)

142-145: Consider adding docstring for the decode method.

The implementation looks good, but since this is a significant change in behavior, would you consider adding a docstring explaining the difference between streaming and non-streaming modes, wdyt?

 def decode(
     self, response: requests.Response
 ) -> Generator[MutableMapping[str, Any], None, None]:
+    """Decode the response based on stream_response setting.
+    
+    When stream_response is True:
+      - Uses response.raw for streaming parsing
+      - Suitable for large responses or JSONL format
+    When stream_response is False:
+      - Uses response.content with BytesIO
+      - Suitable for responses that need to be parsed multiple times
+    """
     if self.is_stream_response():
         yield from self.parser.parse(data=response.raw)  # type: ignore[arg-type]
     else:
         yield from self.parser.parse(data=io.BytesIO(response.content))

134-134: Nice addition of streaming control! Consider adding docstring?

The new stream_response flag and its implementation look good. Would you consider adding a docstring to explain when to use each mode? For example:

 stream_response: bool = True
+    """
+    Controls how responses are processed:
+    - True: Streams response.raw directly (memory efficient for large responses)
+    - False: Loads response.content into memory (allows multiple iterations)
+    """

Also applies to: 136-137, 142-145

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1268-1272: Consider adding docstring for CsvDecoder.

The implementation looks good, but would you consider adding a docstring explaining the purpose and configuration options of the CSV decoder, wdyt?

 class CsvDecoder(BaseModel):
     type: Literal["CsvDecoder"]
+    """Decoder for CSV formatted data.
+    
+    Attributes:
+        encoding: The character encoding to use (default: utf-8)
+        delimiter: The character used to separate fields (default: comma)
+    """
     encoding: Optional[str] = "utf-8"
     delimiter: Optional[str] = ","
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

3012-3025: CsvDecoder – Making CSV decoding available
Introducing the CsvDecoder with clear defaults (utf-8 encoding and a comma delimiter) is a clean and welcome addition. It looks like it accomplishes the PR objective to make CSV decoding available while cleaning up the decoders. Would you be open to adding some tests for different CSV configurations to ensure robustness? wdyt?

unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

203-213: Great test for stream consumption! Consider adding error message check?

The test for streamed response consumption looks good. Would you consider also asserting the specific error message to ensure the right error is being raised? Something like:

-    with pytest.raises(Exception):
+    with pytest.raises(Exception, match="Response body has already been consumed"):
         list(composite_raw_decoder.decode(response))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6260248 and 6e79ecf.

📒 Files selected for processing (11)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/decoders/__init__.py (0 hunks)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2 hunks)
  • airbyte_cdk/sources/declarative/decoders/json_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (8 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
  • unit_tests/sources/declarative/auth/test_token_provider.py (3 hunks)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py (0 hunks)
  • unit_tests/sources/declarative/decoders/test_json_decoder.py (2 hunks)
  • unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1 hunks)
💤 Files with no reviewable changes (2)
  • airbyte_cdk/sources/declarative/decoders/init.py
  • unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/decoders/test_json_decoder.py

[warning] 1-1: Code would be reformatted to adhere to style guidelines.

unit_tests/sources/declarative/auth/test_token_provider.py

[warning] 1-1: Code would be reformatted to adhere to style guidelines.

unit_tests/sources/declarative/decoders/test_composite_decoder.py

[warning] 1-1: Code would be reformatted to adhere to style guidelines.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

[warning] 1-1: Code would be reformatted to adhere to style guidelines.

airbyte_cdk/sources/declarative/decoders/json_decoder.py

[warning] 1-1: Code would be reformatted to adhere to style guidelines.

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (21)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

152-152: Looks good!

The added import for CsvDecoderModel synchronizes well with the rest of the codebase. No issues found, wdyt?


227-227: Nice addition for GzipDecoderModel.

This import appears consistent with your usage in _get_parser. No concerns, wdyt?


2045-2046: Validate streaming approach for JSONL.

Here, the method sets stream_response=True for JSONL. This is likely correct given JSON lines are commonly processed in a streaming manner. Have you tested large JSONL data with this approach, wdyt?

airbyte_cdk/sources/declarative/decoders/json_decoder.py (3)

13-13: Thank you for adopting the composite approach.
This new import ensures we unify decoding logic with JsonParser. Would you like to verify usage in other parts of the codebase for consistency, wdyt?


28-28: Pass-through of 'is_stream_response' looks good.
No issues here!


44-44: Verify empty response behavior.
We yield an empty dict when nothing was decoded. Are we certain we want a single empty mapping rather than not yielding at all or returning an empty list, wdyt?

unit_tests/sources/declarative/auth/test_token_provider.py (2)

4-4: Importing 'json' is good.
This helps us easily create mock responses. No concerns here!


21-21: Switching to '.content' is more realistic.
Setting the token via encoded JSON simulates real response behavior. Would you like to confirm that bytes-to-JSON decoding logic is correctly handled in production, wdyt?

unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)

12-13: LGTM! Clean import changes.

The imports are correctly updated to use the new decoder architecture.


12-13: LGTM! Nice refactoring of the JsonlDecoder.

The change to use CompositeRawDecoder with JsonLineParser looks good and aligns with the decoder cleanup objectives. The test cases continue to pass with the new implementation.

Also applies to: 24-24

airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)

134-134: LGTM! Good default value choice.

Setting stream_response=True as default maintains backward compatibility while allowing opt-out when needed.


3-3: LGTM! Added required import.

The addition of io import is necessary for using BytesIO in the non-streaming mode.

unit_tests/sources/declarative/decoders/test_composite_decoder.py (3)

203-213: LGTM! Good test for consumed stream behavior.

The test verifies that attempting to decode an already consumed stream raises an exception, which is the expected behavior.


215-223: LGTM! Good test for non-streaming mode.

The test verifies that non-streaming mode allows multiple decodes of the same response.


215-223: LGTM! Good test for non-streamed mode.

The test effectively verifies that non-streamed responses can be decoded multiple times.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (4)

1664-1666: LGTM! Good design for GzipDecoder.

The GzipDecoder with inner_decoder support allows for flexible composition of decoders.


1268-1272: LGTM! Clean CsvDecoder implementation.

The CsvDecoder class looks good with appropriate default values for encoding and delimiter.


1664-1666: LGTM! Nice GzipDecoder implementation.

The GzipDecoder class with inner_decoder support looks good and aligns with the decoder cleanup objectives.


1912-1914: LGTM! Simplified decoder options.

The update to SessionTokenAuthenticator's decoder field to only allow JsonDecoder and XmlDecoder makes sense.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2141-2155: ZipfileDecoder – Updated property from "parser" to "decoder"
The new changes now require a "decoder" property (instead of the old "parser") and correctly reference the unified decoders (CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder). Would you consider renaming the inner title from "Parser" to "Decoder" for consistency with the rest of the schema? wdyt?


2997-3011: GzipDecoder – Flexible inner_decoder implementation
The new GzipDecoder correctly requires an inner_decoder that can be one of CsvDecoder, GzipDecoder, JsonDecoder, or JsonlDecoder, which aligns with the unified decoding strategy. Have you thought about potential recursion issues if a GzipDecoder is nested inside another? It might be useful to add a note or safeguard if deep-nested gzip decoding isn’t expected. wdyt?

@maxi297
Copy link
Contributor Author

maxi297 commented Feb 7, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (2)

24-25: Consider documenting the stream_response=False choice?

The initialization looks good! Would it be helpful to add a comment explaining why stream_response=False is the default here? This could help future maintainers understand the design decision, wdyt?


30-45: Consider more specific error handling?

The code looks good overall! A few thoughts:

  1. The generic Exception catch might hide specific issues. Would it be helpful to catch and log specific exceptions like JSONDecodeError separately, wdyt?
  2. The empty dict fallback is a nice safety net, but should we log a warning when this happens to help with debugging?
     try:
         for element in self._decoder.decode(response):
             yield element
             has_yielded = True
-    except Exception:
+    except json.JSONDecodeError as e:
+        logger.warning(f"Failed to decode JSON response: {e}")
+        yield {}
+    except Exception as e:
+        logger.warning(f"Unexpected error while decoding response: {e}")
         yield {}
unit_tests/sources/declarative/decoders/test_json_decoder.py (1)

44-48: Consider adding error case tests?

The happy path tests look good! Would it be valuable to add some error case tests, wdyt? For example:

  1. Malformed JSON lines
  2. Mixed valid/invalid JSON lines
  3. Empty lines between valid JSON
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

2035-2048: LGTM! Consider adding docstrings for better maintainability?

The implementation looks good. The stream_response flag is correctly set based on the decoder type. Would you consider adding docstrings to explain the purpose and behavior of each decoder method? This could help future maintainers understand the differences between them, wdyt?

Example docstring for create_csv_decoder:

def create_csv_decoder(model: CsvDecoderModel, config: Config, **kwargs: Any) -> Decoder:
    """Creates a CSV decoder using CompositeRawDecoder with CsvParser.
    
    Args:
        model: The CSV decoder model containing encoding and delimiter settings.
        config: The connector configuration.
        **kwargs: Additional keyword arguments.
        
    Returns:
        A CompositeRawDecoder instance configured for CSV parsing.
    """

2066-2083: LGTM! Consider enhancing error messages?

The implementation is clean and handles all decoder types appropriately. Would you consider making the error messages more specific by including the list of supported decoders in the error message? This could help users quickly understand what decoders are available, wdyt?

Example enhanced error message:

-        raise ValueError(f"Decoder type {model} does not have parser associated to it")
+        raise ValueError(f"Decoder type {model} does not have parser associated to it. Supported decoders are: JsonDecoder, JsonlDecoder, CsvDecoder, and GzipDecoder")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e79ecf and 555646d.

📒 Files selected for processing (5)
  • airbyte_cdk/sources/declarative/decoders/json_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
  • unit_tests/sources/declarative/auth/test_token_provider.py (3 hunks)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (1 hunks)
  • unit_tests/sources/declarative/decoders/test_json_decoder.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py
  • unit_tests/sources/declarative/auth/test_token_provider.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)

13-13: LGTM! Nice simplification of the decoder structure.

The removal of the dataclass decorator and delegation to CompositeRawDecoder makes the code more maintainable and follows the composition over inheritance principle.

Also applies to: 19-22

unit_tests/sources/declarative/decoders/test_json_decoder.py (1)

11-13: LGTM! Clean import updates.

The imports are properly aligned with the new decoder structure.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

152-152: LGTM! Clean addition of CsvDecoderModel.

The import and mapping follow the established pattern in the codebase.

Also applies to: 521-521


227-227: LGTM! Clean addition of GzipDecoderModel.

The import follows the established pattern in the codebase.


2063-2063: LGTM! Clean refactor of create_zipfile_decoder.

The change nicely leverages the new _get_parser method, making the code more maintainable and consistent.

Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into what that would mean for our connectors and published manifests — I think the only concern is around GzipJsonDecoder (not used?) and CompositeRawDecoded that I believe is used in a few spots, but it's a very simple manifest change to update, right?

Copy link
Contributor Author

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding comments on the code review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)

43-44: Would you consider being more specific with error handling? 🤔

Currently, we're catching all exceptions and returning an empty dict. Maybe we could catch specific exceptions (e.g., orjson.JSONDecodeError) to avoid masking unexpected errors, wdyt?

-        except Exception:
+        except (orjson.JSONDecodeError, UnicodeDecodeError) as e:
+            logger.debug(f"Failed to decode response: {e}")
             yield {}
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1268-1272: Should we add validation for encoding and delimiter fields?

The CsvDecoder class looks good, but we could enhance it by adding:

  1. Field descriptions and examples
  2. Validation for supported encodings
  3. Common delimiter options

What do you think about adding these improvements? They would make the schema more user-friendly and help prevent configuration errors. wdyt?

 class CsvDecoder(BaseModel):
     type: Literal["CsvDecoder"]
-    encoding: Optional[str] = "utf-8"
-    delimiter: Optional[str] = ","
+    encoding: Optional[str] = Field(
+        "utf-8",
+        description="Character encoding to use when reading CSV files.",
+        examples=["utf-8", "ascii", "iso-8859-1"],
+        title="Character Encoding",
+    )
+    delimiter: Optional[str] = Field(
+        ",",
+        description="Character used to separate fields in the CSV file.",
+        examples=[",", ";", "\t"],
+        title="Field Delimiter",
+    )
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2066-2083: Consider enhancing error messages and documentation?

The parser selection logic is well-structured, but what do you think about these potential improvements? wdyt?

  1. The error message could be more specific about which decoder types are supported:
-            raise ValueError(f"Decoder type {model} does not have parser associated to it")
+            raise ValueError(f"Decoder type {model} does not support parsing. Supported decoders: JsonDecoder, JsonlDecoder, CsvDecoder, GzipDecoder")
  1. The comment about JsonDecoder logic could be expanded to explain the specific error cases:
-            # Note that the logic is a bit different from the JsonDecoder as there is some legacy that is maintained to return {} on error cases
+            # Note: JsonParser differs from JsonDecoder in error handling:
+            # - JsonParser returns {} on parsing errors to maintain backward compatibility
+            # - JsonDecoder raises exceptions for better error visibility
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2142-2155: ZipfileDecoder Update – New "decoder" Field and References
The ZipfileDecoder component now requires a "decoder" field (instead of the old "parser") and its properties section has been updated accordingly. The "anyOf" list now includes references to CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder. Could you please confirm that including all these decoders (especially the inclusion of GzipDecoder within ZipfileDecoder) is intentional for handling decompressed zipfile data? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 555646d and 29f5529.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/decoders/json_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (8 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (12)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (3)

22-23: Great job documenting the historical context! 🎉

The documentation clearly explains the rationale behind using JsonDecoder instead of CompositeRawDecoder, which will be super helpful for future maintainers.


26-27: Nice refactor using composition! 👍

The initialization is clean and follows the Single Responsibility Principle by delegating to CompositeRawDecoder.


38-47: Love the robust implementation! ✨

The has_yielded flag ensures we maintain the contract of always yielding at least one item, even when the decoder returns nothing. This is a great defensive programming practice!

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (4)

1664-1666: LGTM! The GzipDecoder implementation looks good.

The recursive decoder pattern allows for flexible handling of nested formats, and the naming is consistent with other decoders.


1704-1708: The field name change from parser to decoder looks good!

This change aligns with the previous review comment about naming consistency between GzipDecoder.inner_decoder and ZipfileDecoder.decoder.


1912-1914: LGTM! The SessionTokenAuthenticator decoder field update is correct.

The change simplifies the decoder options to just JsonDecoder and XmlDecoder, which makes sense for session token responses.


2109-2123: LGTM! The decoder field updates in SimpleRetriever and AsyncRetriever are consistent.

The changes consistently use the new CsvDecoder across both retrievers, maintaining uniformity in the codebase.

Also applies to: 2186-2215

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

152-152: LGTM! Clean addition of CsvDecoderModel.

The import and constructor mapping follow the established pattern in the codebase.

Also applies to: 521-521


227-227: LGTM! Clean addition of GzipDecoderModel.

The import follows the established pattern in the codebase.


2035-2048: LGTM! Clean implementation of decoder creation methods.

The methods follow a consistent pattern using CompositeRawDecoder with appropriate parsers. Nice job on keeping the implementations concise and similar in structure.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2997-3011: GzipDecoder Enhancements – Recursive Decoder Reference Check
The new GzipDecoder now requires a "decoder" field and its "anyOf" list includes references to CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder. I noticed that GzipDecoder is allowed to reference itself, which could enable chaining of gzip decoders. Is this recursive configuration intentional and aligned with your design objectives? wdyt?


3012-3025: CsvDecoder Definition Verification
The CsvDecoder component now explicitly sets its type with enum [CsvDecoder] and provides default values for encoding and delimiter. This looks clear and consistent with the unified decoder design. Does this meet your requirements for CSV parsing? wdyt?

Copy link
Contributor

@pnilan pnilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved -- just want to confirm if we should add GzipDecoder as a "top level" decoder in the SimpleRetriever and AsyncRetriever.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1664-1666: Consider making the decoder field name consistent with other decoders, wdyt?

Based on the past review comments about naming consistency, should we rename decoder to inner_decoder to match other similar components? This would help maintain a consistent naming convention across the codebase.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

2141-2155: Refactor in ZipfileDecoder’s "decoder" Field
It looks like you’ve replaced the old “parser” property with a unified “decoder” property in the ZipfileDecoder definition. This change now allows the use of CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder. Would you consider adding some inline examples or extra documentation here to clarify for users how to leverage these options? wdyt?


2997-3012: Addition of the GzipDecoder Component
You’ve introduced a new GzipDecoder that requires both a “type” and a nested “decoder” field to define how to interpret the decompressed data. I did notice that within its “decoder” field, you also allow a reference to a GzipDecoder via "#/definitions/GzipDecoder." Is this recursive configuration intentional (for chained decompression) or might it lead to unexpected recursion? Perhaps a comment or guard would help clarify its intended use. wdyt?


3013-3026: Introduction of the CsvDecoder Component
The CsvDecoder definition is straightforward, with reasonable defaults for encoding ("utf-8") and delimiter (","). Would you consider including one or two usage examples (or references to documentation) directly within the schema to help users understand how to correctly configure CSV decoding in practice? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29f5529 and 47d2036.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5)

1268-1272: LGTM! The CsvDecoder class looks well-defined.

The class has sensible defaults for encoding (utf-8) and delimiter (,).


1704-1708: LGTM! The ZipfileDecoder's decoder field update looks good.

The field has been updated to use the new decoder types (CsvDecoder, GzipDecoder, JsonDecoder, JsonlDecoder) consistently.


1912-1914: LGTM! The SessionTokenAuthenticator's decoder field update is correct.

The field has been correctly restricted to only JsonDecoder and XmlDecoder, which aligns with the typical response formats for session token authentication.


2109-2124: LGTM! The SimpleRetriever's decoder field update is comprehensive.

The field now includes all available decoders (CsvDecoder, GzipDecoder, JsonDecoder, etc.) with proper documentation.


2187-2218: LGTM! The AsyncRetriever's decoder field updates are thorough.

Both decoder and download_decoder fields have been updated consistently to include all available decoders.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2037-2056: Consider extracting common decoder creation logic?

The create_csv_decoder, create_jsonl_decoder, and create_gzip_decoder methods follow a similar pattern of creating a CompositeRawDecoder with a parser. What do you think about extracting this common logic into a private helper method to reduce code duplication? Something like:

@staticmethod
def _create_composite_decoder(model: BaseModel, config: Config, stream_response: bool) -> Decoder:
    return CompositeRawDecoder(
        parser=ModelToComponentFactory._get_parser(model, config),
        stream_response=stream_response
    )

This would make the code more DRY and easier to maintain, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47d2036 and 2ed8636.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

105-105: LGTM!

The GzipDecoder import is correctly placed in alphabetical order.


522-522: LGTM!

The new decoder models are correctly mapped to their creation methods in the PYDANTIC_MODEL_TO_CONSTRUCTOR dictionary.

Also applies to: 552-552


2074-2092: Verify encoding handling in GzipParser with inner parsers

Based on a past review comment, there was an issue where GzipDecoder passes bytes to the inner_parser, which caused problems with non-standard (utf) encoding. Could you verify that this is now handled correctly, especially for cases like:

GzipDecoder(decoder=JsonDecoder())  # GzipParser passes bytes to JsonParser
GzipDecoder(decoder=CsvDecoder(encoding='utf-16'))  # Non-standard encoding

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2041-2044: Consider extracting common decoder creation logic?

The CSV, JSONL, and GZIP decoder implementations share the same pattern of creating a CompositeRawDecoder with stream_response=True. Would extracting this into a helper method make sense to reduce duplication, wdyt?

+    @staticmethod
+    def _create_composite_decoder(model: BaseModel, config: Config) -> Decoder:
+        return CompositeRawDecoder(
+            parser=ModelToComponentFactory._get_parser(model, config), 
+            stream_response=True
+        )

     @staticmethod
     def create_csv_decoder(model: CsvDecoderModel, config: Config, **kwargs: Any) -> Decoder:
-        return CompositeRawDecoder(
-            parser=ModelToComponentFactory._get_parser(model, config), stream_response=True
-        )
+        return ModelToComponentFactory._create_composite_decoder(model, config)

     @staticmethod
     def create_jsonl_decoder(model: JsonlDecoderModel, config: Config, **kwargs: Any) -> Decoder:
-        return CompositeRawDecoder(
-            parser=ModelToComponentFactory._get_parser(model, config), stream_response=True
-        )
+        return ModelToComponentFactory._create_composite_decoder(model, config)

     @staticmethod
     def create_gzip_decoder(model: GzipDecoderModel, config: Config, **kwargs: Any) -> Decoder:
-        return CompositeRawDecoder(
-            parser=ModelToComponentFactory._get_parser(model, config), stream_response=True
-        )
+        return ModelToComponentFactory._create_composite_decoder(model, config)

Also applies to: 2047-2050, 2053-2056

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ed8636 and 643d950.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)

105-105: LGTM!

The GzipDecoder import is correctly placed in alphabetical order.


2037-2038: LGTM!

The JSON decoder implementation is clean and straightforward.


2074-2091: LGTM! The parser selection logic is well-structured.

The implementation:

  • Handles each decoder type appropriately
  • Provides clear error messages for unsupported decoders
  • Correctly wraps inner parsers for GzipParser

Based on the past review comments, I see that you've already addressed the issue with GzipParser and JsonLineParser that was discussed between @artem1205 and @maxi297. The current implementation looks good.


2071-2071: LGTM!

The ZipfileDecoder now uses the centralized parser selection logic, maintaining consistency with other decoders.

Copy link
Contributor

@artem1205 artem1205 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@maxi297 maxi297 merged commit cb5a921 into main Feb 12, 2025
23 checks passed
@maxi297 maxi297 deleted the maxi297/clean-decoders-set-stream-response branch February 12, 2025 04:10
@ChristoGrab
Copy link
Collaborator

Adding a link to the issue to add this new functionality to the builder, just for x-referencing purposes:

https://github.com/airbytehq/airbyte-internal-issues/issues/11679

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants