fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first #487

brianjlai · 2025-04-16T13:07:39Z

What

While implementing the 3 Hubspot property history streams which require usage of property chunking, I noticed some strange behavior where there would be a very long pause between the start of the sync and the first record emitted whenever property chunking was enabled.

This change refactors the SimpleRetriever so we process each of the property chunks of the current page horizontally instead of vertically

How

The way I had originally written the code where the property chunk is the outer loop and the inner loop is the pagination iteration leads to us taking the first property chunk and paginating all the way to the end. We then iterate to the second property chunk and paginate all the way to the end. And this continues for as many chunks. However, as you can see, this means that we effectively cannot emit any records until we reach the last property chunk.

This change reworks the order of how we process pages and chunks so that we start with the current page, fetch each property chunk for that page, then emit the complete merged record. We then continue to the next page. This is done by moving where we perform property chunking from SimpleRetriever.read_records() and into SimpleRetriever._read_pages()

Note:
You'll see there are no changes to tests, I had already written test cases for each of the property chunking and no chunking scenarios and since this is a refactor the intent is for there to be no functional impact on what records are emitted.

Summary by CodeRabbit

New Features
- Enhanced support for additional query property chunking to improve data retrieval flexibility.
Refactor
- Simplified record reading logic by centralizing property chunking and merging, improving efficiency and maintainability.
- Adjusted property chunk size calculation to better handle character limits by accounting for delimiters.
Tests
- Updated tests to include nested dictionary and list structures for more comprehensive data handling validation.
- Modified property chunking tests to reflect updated chunking behavior with character limits.

…ks first instead of pages first

coderabbitai · 2025-04-16T13:11:59Z

📝 Walkthrough

Walkthrough

The changes focus on the SimpleRetriever class within the Airbyte CDK. The _read_pages method has been updated to handle chunking of additional query properties and to merge records based on a specified merge key when necessary. The read_records method has been refactored to remove redundant property chunking and merging logic, delegating these responsibilities to _read_pages. Additionally, read_records now includes logic to handle early termination when using a ResumableFullRefreshCursor. Overall, the property chunking and merging logic is now centralized, and record reading is streamlined.

Changes

File(s)	Change Summary
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py	Refactored `_read_pages` to support additional query property chunking and merging; simplified `read_records` to delegate chunking and merging to `_read_pages`; added early termination for full refresh cursor; added `_deep_merge` helper function.
unit_tests/sources/declarative/retrievers/test_simple_retriever.py	Updated test data in `test_simple_retriever_with_additional_query_properties` to include nested dictionary and list fields in records, reflecting changes in record merging logic.
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py	Adjusted property size calculation in `get_request_property_chunks` to add 1 character per property when limiting by characters, accounting for delimiters.
unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py	Modified expected output for character limit chunking test to expect different chunk groupings reflecting updated limits and delimiter handling; added new test case for delimiter impact on chunking.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant SimpleRetriever
    participant Cursor

    User->>SimpleRetriever: read_records()
    alt Using ResumableFullRefreshCursor
        SimpleRetriever->>Cursor: is_full_refresh_sync_complete()
        alt Sync complete
            SimpleRetriever-->>User: return (no records)
        else Not complete
            SimpleRetriever->>SimpleRetriever: _read_pages() (single page)
            SimpleRetriever-->>User: yield records
        end
    else Standard Mode
        SimpleRetriever->>SimpleRetriever: _read_pages()
        loop For each page
            SimpleRetriever->>Cursor: observe(record)
            SimpleRetriever-->>User: yield record
        end
        SimpleRetriever->>Cursor: close_slice()
    end

sequenceDiagram
    participant SimpleRetriever
    participant PropertyChunker
    participant PageFetcher

    SimpleRetriever->>PropertyChunker: get property chunks
    loop For each property chunk
        PropertyChunker->>PageFetcher: fetch page with chunk
        PageFetcher-->>SimpleRetriever: return records
        alt Merge key enabled
            SimpleRetriever->>SimpleRetriever: merge records by key (using _deep_merge)
        else No merge key
            SimpleRetriever-->>User: yield records
        end
    end
    alt Merge key enabled
        SimpleRetriever-->>User: yield merged records
    end

Possibly related PRs

feat(Low-Code CDK Property Chunking): Allow fetching query properties and property chunking for low-code sources #452: Introduces the QueryProperties abstraction and integrates it into SimpleRetriever for enhanced property chunking and merging, closely related to this PR's enhancements.

Suggested labels

enhancement

Suggested reviewers

tolik0
maxi297

Does this updated summary and diagrams look good to you, or would you like me to emphasize any part more? Wdyt?

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)

We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2478af and 5c6ff1c.

📒 Files selected for processing (2)

airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py (1 hunks)
unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

unit_tests/sources/declarative/requesters/query_properties/test_property_chunking.py
airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-amplitude' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: SDM Docker Image Build
GitHub Check: Analyze (python)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2)

387-394: Clarify handling of stream_slice for each property chunk.

Here, we reassign stream_slice if properties is present. If the original slice is needed elsewhere, a fresh slice instance might be clearer. Would you be open to creating a new variable for the updated slice to avoid overshadowing the original? wdyt?

514-536: Check property chunking coverage with ResumableFullRefreshCursor.

These lines cleanly handle the RFR scenario. However, would you consider adding a test case ensuring property chunking also works when resuming a full refresh? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf998bd and 98583c4.

📒 Files selected for processing (1)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2 hunks)

🧰 Additional context used

🪛 GitHub Actions: Linters

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

[error] 378-378: mypy: List item 0 has incompatible type "None"; expected "list[str]". (list-item)

[error] 382-382: mypy: Need type annotation for "records_without_merge_key" (hint: "records_without_merge_key: list[] = ..."). (var-annotated)

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-amplitude' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Analyze (python)

🔇 Additional comments (4)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (4)

395-418: Verify usage of records_without_merge_key.

We instantiate records_without_merge_key but never add any records to it. Would you consider either removing it or integrating it into the record merging logic to store records lacking a merge key? wdyt?

420-430: Confirm aggregation of merged records.

The logic for yielding merged records looks good. One small question: could partial merges occur if a record with the same merge key is split across multiple pages? If so, do we need to persist partial merges across pages? wdyt?

505-512: No concerns with setting default _slice.

Defining _slice when stream_slice is None and partially applying _parse_records is a clean solution. Great job!

540-540: No further changes needed.

This return statement appears to be a standard final exit. Looks good!

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
382-382: Would you consider adding a type annotation for merged_records?

Mypy might flag this in the future. Adding a type annotation would improve code clarity and help with static analysis.
- merged_records: MutableMapping[str, Any] = defaultdict(dict)
+ merged_records: MutableMapping[str, MutableMapping[str, Any]] = defaultdict(dict)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98583c4 and fe7c458.

📒 Files selected for processing (1)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (8)

GitHub Check: Check: 'source-pokeapi' (skip=false)
GitHub Check: Check: 'source-amplitude' (skip=false)
GitHub Check: Check: 'source-shopify' (skip=false)
GitHub Check: Check: 'source-hardcoded-records' (skip=false)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: SDM Docker Image Build

🔇 Additional comments (8)

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (8)

378-379: Good fix for the type issue.

Your change from [None] to [] correctly addresses the mypy error flagged in the previous review. This ensures proper typing compatibility for the property chunks.

370-380: Improved property chunks initialization approach.

The refactored property chunking logic looks good. By moving this initialization to the beginning of the pagination loop, you're setting up the horizontal processing pattern that improves record emission timing.

386-417: Well-implemented property chunking loop.

The horizontal chunking approach (processing all chunks for the current page before moving to the next page) is well implemented. I like how you:

Properly update the stream_slice with the current properties

Handle records with and without merge keys differently

Yield records without merge keys immediately while accumulating those with merge keys

This change achieves the PR's goal of allowing earlier record emission.

418-429: Good handling of merged records after chunk processing.

This code properly handles the emission of merged records after all chunks for a page have been processed. This ensures records with the same merge key are properly combined before moving to the next page.

505-512: Nice refactoring of record parsing logic.

Creating a partial function for record parsing makes the code cleaner and removes duplication. Good improvement!

513-522: Good addition of early termination check for ResumableFullRefreshCursor.

The addition of the early termination check for streams with a ResumableFullRefreshCursor is a nice enhancement that prevents unnecessary processing when a sync is already complete.

523-535: Clean refactoring of the main record processing logic.

The refactored code is cleaner and more focused. By moving the property chunking and merging logic to _read_pages, you've simplified this method while maintaining the same functionality.

370-429: Implementation aligns perfectly with PR objective.

Your refactoring successfully changes the property chunking approach from vertical (all pages for a chunk) to horizontal (all chunks for a page), which should significantly improve the time to first record emission. Nice work!

airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py

switch the ordering page iteration and property chunking process chun…

98583c4

…ks first instead of pages first

brianjlai requested review from darynaishchenko, maxi297 and tolik0 April 16, 2025 13:07

brianjlai changed the title ~~fix(property chunking) Switch the ordering page iteration and property chunking process chunks first instead of pages first~~ fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first Apr 16, 2025

coderabbitai bot requested changes Apr 16, 2025

View reviewed changes

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py Outdated Show resolved Hide resolved

fix mypy

fe7c458

github-actions bot added the bug Something isn't working label Apr 16, 2025

coderabbitai bot reviewed Apr 16, 2025

View reviewed changes

coderabbitai bot approved these changes Apr 16, 2025

View reviewed changes

tolik0 approved these changes Apr 17, 2025

View reviewed changes

tolik0 and others added 3 commits April 17, 2025 15:15

Fix merging records for property chunking

ea877a8

extra test cases and change types

90a247f

Fix property field size counting by counting delimiter too

a2478af

brianjlai commented Apr 17, 2025

View reviewed changes

airbyte_cdk/sources/declarative/requesters/query_properties/property_chunking.py Outdated Show resolved Hide resolved

adjust property chunking splitting count and tests

5c6ff1c

brianjlai merged commit 58a89d6 into main Apr 17, 2025
26 checks passed

brianjlai deleted the brian/property_chunking_merge_fix_page_read_ordering branch April 17, 2025 22:16

This was referenced Oct 15, 2025

fix: hubspot property chunking #797

Merged

feat: cache properties from endpoint #808

Merged

coderabbitai bot mentioned this pull request Oct 22, 2025

feat(query_properties): Add configured catalog to SimpleRetriever and only fetch properties that are included in the stream schema #788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first #487

fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first #487

Uh oh!

brianjlai commented Apr 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 16, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first #487

fix(property chunking): Switch the ordering page iteration and property chunking process chunks first instead of pages first #487

Uh oh!

Conversation

brianjlai commented Apr 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brianjlai commented Apr 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 16, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)