Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(low-code): Add parent state migration from global state #322

Merged
merged 2 commits into from
Feb 7, 2025

Conversation

tolik0
Copy link
Contributor

@tolik0 tolik0 commented Feb 7, 2025

Summary by CodeRabbit

  • Refactor
    • Improved state migration during incremental synchronization to ensure only essential state data is retained, enhancing overall consistency.
  • Tests
    • Expanded the test suite with new scenarios and parameterized cases to validate incremental state handling and robust error management.
    • Updated expected outputs in existing tests to align with new state management logic.

@github-actions github-actions bot added the bug Something isn't working label Feb 7, 2025
Copy link
Contributor

coderabbitai bot commented Feb 7, 2025

📝 Walkthrough

Walkthrough

The changes update the state migration logic within the SubstreamPartitionRouter’s _migrate_child_state_to_parent_state method. The method’s internal documentation and implementation have been refined to extract only global state (ignoring per-partition state) when converting child stream state for parent streams with incremental syncing. Additionally, the test suite for ConcurrentPerPartitionCursor has been expanded with new parameterized tests and modifications to existing ones to validate various incremental state scenarios.

Changes

File Path Summary
airbyte_cdk/.../substream_partition_router.py Updated _migrate_child_state_to_parent_state to filter out per-partition states, use global state if available, and improve documentation with examples.
unit_tests/.../test_concurrent_perpartitioncursor.py Added new test cases and updated existing ones to cover incremental state migration scenarios, including error handling and refined parameterization.
unit_tests/.../test_substream_partition_router.py Updated expected outputs in tests to align with new state management logic regarding cursor handling in SubstreamPartitionRouter.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant SubstreamPartitionRouter
    participant StateInput

    Caller ->> SubstreamPartitionRouter: _migrate_child_state_to_parent_state(stream_state)
    Note right of SubstreamPartitionRouter: Check for key "state" and extract global state
    SubstreamPartitionRouter ->> StateInput: Evaluate input state details
    StateInput -->> SubstreamPartitionRouter: Return global state (ignore per-partition data)
    SubstreamPartitionRouter -->> Caller: Return mapped parent state with cursor values
Loading

Possibly related PRs

Suggested reviewers

  • maxi297 - would you like to add them as a reviewer? wdyt?
  • aaronsteers - would you like to add them as a reviewer? wdyt?
  • darynaishchenko - would you like to add them as a reviewer? wdyt?

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0a444a5 and 3e37a72.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/partition_routers/test_substream_partition_router.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/declarative/partition_routers/test_substream_partition_router.py (1)

418-418: LGTM! The test cases correctly validate parent state migration.

The changes ensure that both regular and declarative global cursor states are properly migrated to the parent stream's cursor field. The test cases provide good coverage for this behavior.

Let's verify if we have test coverage for all state migration scenarios:

Also applies to: 474-474

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check test coverage for state migration scenarios

# Search for test cases related to state migration
rg -A 5 'test_set_initial_state.*\[' unit_tests/sources/declarative/partition_routers/test_substream_partition_router.py

Length of output: 122


I'll run a couple of additional commands to gather more context on the changes. In particular, I'd like to inspect the definition of the test_set_initial_state function and review the context around line 418 where the parent state is asserted. This will help verify that the modifications correctly handle state migration. wdyt?


🏁 Script executed:

#!/bin/bash
# Display the definition of test_set_initial_state to see all test cases.
rg -n 'def test_set_initial_state' unit_tests/sources/declarative/partition_routers/test_substream_partition_router.py

# Display a snippet around line 418 to verify the assertion with the parent_stream_cursor.
sed -n '410,430p' unit_tests/sources/declarative/partition_routers/test_substream_partition_router.py

Length of output: 1177


Test state migration verified!

The snippets confirm that both the "updated_at" migration and global "state" migration scenarios are now correctly asserting the parent state with the "parent_stream_cursor". It looks like the changes effectively cover the intended state migration behavior. wdyt?

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (2)

302-316: Great improvements to the docstring! Consider adding validation examples?

The docstring now clearly explains the state migration behavior. Would it be helpful to add examples of invalid formats that are ignored during migration, wdyt?

 This method converts the child stream state—or, if present, the global state—into a format that is
 compatible with parent streams that use incremental synchronization. The migration occurs only for
 parent streams with incremental dependencies. It filters out per-partition states and retains only the
 global state in the form {cursor_field: cursor_value}.

 The method supports multiple input formats:
   - A simple global state, e.g.:
         {"updated_at": "2023-05-27T00:00:00Z"}
   - A state object that contains a "state" key (which is assumed to hold the global state), e.g.:
         {"state": {"updated_at": "2023-05-27T00:00:00Z"}, ...}
     In this case, the migration uses the first value from the "state" dictionary.
   - Any per-partition state formats or other non-simple structures are ignored during migration.
+
+ Examples of ignored formats:
+   - Per-partition state:
+         {"states": [{"partition": {"id": 1}, "cursor": {"updated_at": "2023-05-27T00:00:00Z"}}]}
+   - Invalid state structure:
+         {"updated_at": ["2023-05-27T00:00:00Z"]}

339-345: Consider adding validation for empty state dictionary.

The code now handles global state under the "state" key, but what if the state dictionary is empty? Should we add a check for that case, wdyt?

 # Ignore per-partition states or invalid formats.
 if isinstance(substream_state, (list, dict)) or len(substream_state_values) != 1:
     # If a global state is present under the key "state", use its first value.
-    if "state" in stream_state and isinstance(stream_state["state"], dict):
+    if "state" in stream_state and isinstance(stream_state["state"], dict) and stream_state["state"]:
         substream_state = list(stream_state["state"].values())[0]
     else:
         return {}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e38f914 and 0a444a5.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (2 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: MyPy Check
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Ruff Lint Check
  • GitHub Check: Build and Inspect Python Package
  • GitHub Check: preview_docs
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Ruff Format Check
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)

1453-1465: LGTM! Great test coverage for state migration.

The new test case for global state without parent state helps ensure robust state migration. The descriptive test IDs also make it easier to understand test failures.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear that this might not be 100% true in the following case:

  • T0: HTTP request for parent stream. To be synced are parent_1 and parent_2
  • T1: Parent stream emit parent_1
  • T2: HTTP request for child stream on slice parent_1
  • T3: parent_3 is also updated which would now make it available in the parent stream HTTP request
  • T4: a new record is added on child stream for parent_2
  • T5: finished processing parent_1 on child level
  • T6: process child stream for slice parent_2

In this case, we would set the parent state as T4 and the update on parent_3 would not be picked up. That being said, the {<cursor field>: <cursor value>} format was not safe for very similar reasons so I'm good with this as a migration.

@tolik0
Copy link
Contributor Author

tolik0 commented Feb 7, 2025

@maxi297 I agree. Ideally, we should apply the lookback window to the parent stream as well. However, this migration is only meant for transitioning from the legacy format, so there’s no need to overcomplicate the logic since the lookback window won’t be present in that case.

@tolik0 tolik0 merged commit f396439 into main Feb 7, 2025
24 checks passed
@tolik0 tolik0 deleted the tolik0/add-parent-state-migration branch February 7, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants