Skip to content

fix(source-postgres): use REPEATABLE READ snapshot for CTID batch queries to prevent silent row drops#74062

Draft
devin-ai-integration[bot] wants to merge 7 commits intomasterfrom
devin/1772110320-fix-ctid-snapshot-isolation
Draft

fix(source-postgres): use REPEATABLE READ snapshot for CTID batch queries to prevent silent row drops#74062
devin-ai-integration[bot] wants to merge 7 commits intomasterfrom
devin/1772110320-fix-ctid-snapshot-isolation

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Feb 26, 2026

What

Resolves https://github.com/airbytehq/oncall/issues/11440:

The Postgres source connector's CTID-based full refresh scan silently drops rows that are updated by concurrent transactions during multi-batch scans. Each batch query previously opened a new connection (and therefore a new MVCC snapshot) via database.unsafeQuery(). Under PostgreSQL's default READ COMMITTED isolation, a row updated mid-scan could become invisible to all batches: the old CTID is dead in later snapshots, and the new CTID may fall in a range already scanned.

How

Core fix: InitialSyncCtidIterator now opens a single dedicated connection with REPEATABLE READ isolation before the first batch, and reuses it for all subsequent batch queries. This guarantees all batches see the same consistent MVCC snapshot.

DataSource threading: The DataSource is passed explicitly as a constructor parameter through the chain: PostgresSourceCtidUtils.createInitialLoaderPostgresCtidHandlerInitialSyncCtidIterator. The iterator uses dataSource.getConnection() directly to obtain the snapshot connection, avoiding any need to cast or access CDK internals.

CDK change: ⚠️ The diff also contains a one-line CDK visibility change (protected val dataSourceval dataSource in DefaultJdbcDatabase.kt) from an earlier commit. This change is now redundant since the DataSource is threaded via constructor params instead. Reviewer should decide whether to keep or revert this.

VACUUM handling: On VACUUM-triggered resets (resetSubQueries), the snapshot connection is closed and a fresh one is opened, since VACUUM invalidates the physical page layout anyway.

Streaming: A custom toStream() replaces database.unsafeQuery() to stream ResultSet rows without closing the shared connection on stream completion. A fixed fetch size of 10,000 is used.

Version: Bumped to 3.7.3-rc.1 (PATCH, with -rc.1 suffix for progressive rollout).

Review guide

  1. DefaultJdbcDatabase.kt⚠️ Redundant change: Makes dataSource public. This was part of an earlier approach but is no longer needed since DataSource is now threaded via constructor params. Consider reverting this change to avoid expanding CDK API surface unnecessarily.

  2. PostgresSource.java — Stores the most recently created DataSource in currentDataSource field and passes it to all createInitialLoader call sites (3 locations). Note: This field is not synchronized; if createDatabase() is called concurrently, race conditions could occur (though unlikely in practice).

  3. CtidUtils.java — Adds DataSource parameter to createInitialLoader() and passes it to PostgresCtidHandler constructor.

  4. PostgresCtidHandler.java — Adds DataSource field and constructor parameter, passes it to InitialSyncCtidIterator constructor.

  5. PostgresCdcCtidInitializer.java — Adds DataSource parameter to cdcCtidIteratorsCombined() and passes it to createInitialLoader().

  6. InitialSyncCtidIterator.java — The main fix. Key areas:

    • openSnapshotConnection() / closeSnapshotConnection() — Connection lifecycle management. Uses dataSource.getConnection() directly (no more instanceof cast).
    • getStream() — Now uses the shared snapshotConnection directly instead of database.unsafeQuery().
    • toStream() — Custom spliterator. Note: PreparedStatements are not explicitly closed per-batch; they are cleaned up when the snapshot connection closes. Verify this is acceptable for scans with many batches (potential memory accumulation).
    • SNAPSHOT_FETCH_SIZE = 10_000Replaces adaptive fetch size tuning from AdaptiveStreamingQueryConfig. For tables with very wide rows this could increase memory usage vs. the previous adaptive approach.
    • initSubQueries() and resetSubQueries() — Snapshot connection is opened at scan start and reopened on VACUUM resets. Both now throw SQLException.
    • close() — Properly closes both the current iterator and the snapshot connection.
  7. metadata.yaml / postgres.md — Version bump to 3.7.3-rc.1 and changelog entry.

⚠️ Items for human reviewer

  • Redundant CDK change: The DefaultJdbcDatabase.kt change is no longer needed. Should it be reverted to avoid expanding CDK API surface?
  • No new tests: This PR does not add unit/integration tests for the snapshot connection behavior. Reproduction requires concurrent writes during a multi-batch CTID scan, which is difficult to test in isolation. Consider whether tests should be added.
  • Long-running REPEATABLE READ transactions: For very large tables, the snapshot connection may be held open for minutes to hours, preventing VACUUM from cleaning dead tuples created during the scan. This is a known trade-off for correctness but could cause table bloat.
  • Thread safety: PostgresSource.currentDataSource is not synchronized. Verify that createDatabase() is never called concurrently in production.

User Impact

Positive: Eliminates silent data loss for full refresh syncs on tables with concurrent writes. Rows updated during the scan will no longer be dropped.

Negative:

  • Long-running transactions: The REPEATABLE READ transaction is held open for the entire CTID scan (potentially minutes to hours for large tables). This prevents PostgreSQL from VACUUMing dead tuples created during the scan, which could cause table bloat if scans are frequent and long-running. This is a known trade-off for correctness.
  • Fixed fetch size: Loss of adaptive fetch size tuning may increase memory usage for tables with very wide rows compared to the previous implementation.

Can this PR be safely reverted and rolled back?

  • YES 💚

The change is isolated to CTID scanning logic. Reverting would restore the previous behavior (separate snapshots per batch, with the silent row drop bug).


Session: https://app.devin.ai/sessions/be8cc3db060a4bc3b6e8d4b5b271879a
Requested by: bot_apk (apk@cognition.ai)

…ries

Previously, each CTID batch query in InitialSyncCtidIterator opened a new
database connection via database.unsafeQuery(), which meant each batch got
its own MVCC snapshot under PostgreSQL's default READ COMMITTED isolation.
When rows were updated by concurrent transactions during a multi-batch scan,
the old CTID became dead in later snapshots and the new CTID could land in
a range already scanned — causing rows to be silently dropped.

This fix opens a single dedicated connection with REPEATABLE READ isolation
at the start of the CTID scan and reuses it for all batch queries. This
ensures all batches see the same consistent snapshot, eliminating the
window where concurrent updates can cause rows to be missed.

Key changes:
- Add getDataSource() to DefaultJdbcDatabase for connection access
- Open a REPEATABLE READ snapshot connection in initSubQueries()
- Reuse the snapshot connection for all CTID batch queries
- Properly close and reopen on VACUUM resets
- Close the snapshot connection on iterator close

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
    • /bump-progressive-rollout-version - Bumps connector version with an RC suffix (2.16.10-rc.1) for progressive rollouts (enableProgressiveRollout: true).
      • Example: /bump-progressive-rollout-version changelog="Add new feature for progressive rollout"
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
    • /bump-bulk-cdk-version bump=patch changelog='foo' - Bump the Bulk CDK's version. bump can be major/minor/patch.
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 26, 2026

source-postgres Connector Test Results

331 tests   330 ✅  13m 26s ⏱️
 38 suites    1 💤
 38 files      0 ❌

Results for commit 27f2f6f.

♻️ This comment has been updated with latest results.

devin-ai-integration bot and others added 3 commits February 26, 2026 13:01
- Change dataSource from protected to public in DefaultJdbcDatabase.kt
  (Kotlin auto-generates getDataSource() from val property; explicit
  getDataSource() method caused JVM signature clash)
- Apply Google Java Format conventions to InitialSyncCtidIterator.java
  (line wrapping, Javadoc formatting)

Co-Authored-By: bot_apk <apk@cognition.ai>
Version bump for CTID snapshot isolation fix. Uses -rc.1 suffix
for progressive rollout.

Co-Authored-By: bot_apk <apk@cognition.ai>
Co-Authored-By: bot_apk <apk@cognition.ai>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 26, 2026

Deploy preview for airbyte-docs ready!

✅ Preview
https://airbyte-docs-eh3o6q9k3-airbyte-growth.vercel.app

Built with commit 27f2f6f.
This pull request is being automatically deployed with vercel-action

devin-ai-integration bot and others added 3 commits February 26, 2026 13:09
…pply)

Co-Authored-By: bot_apk <apk@cognition.ai>
…otected field

Thread DataSource through the constructor chain:
PostgresSource → CtidUtils.createInitialLoader → PostgresCtidHandler → InitialSyncCtidIterator

This avoids the CDK dependency issue where the connector CI builds against
published CDK artifacts that don't have the getDataSource() method public.

Co-Authored-By: bot_apk <apk@cognition.ai>
The DataSource is now threaded via constructor params, so the CDK
visibility change is no longer needed.

Co-Authored-By: bot_apk <apk@cognition.ai>
@octavia-bot
Copy link
Contributor

octavia-bot bot commented Feb 26, 2026

🔍 AI Prove Fix session starting... Running readiness checks and testing against customer connections. View playbook

Devin AI session created successfully!

@devin-ai-integration
Copy link
Contributor Author

↪️ Triggering /ai-prove-fix per Hands-Free AI Triage Project triage next step.

Reason: Draft PR with CI passing (36/36 checks, 331 tests passed). Fix uses REPEATABLE READ snapshot isolation for CTID batch queries to prevent silent row drops during concurrent writes.
https://github.com/airbytehq/oncall/issues/11440

Devin session

@devin-ai-integration
Copy link
Contributor Author

devin-ai-integration bot commented Feb 26, 2026

Fix Validation Evidence

Outcome: Could not Run Tests (live connection testing blocked on approval; regression tests passed)

Evidence Summary

Regression tests ran the pre-release version (3.7.3-rc.1-preview.27f2f6f) against the current production version in comparison mode. All 4 commands passed (SPEC, CHECK, DISCOVER, READ), confirming no regressions introduced by the fix.

Live connection testing on 2 qualified internal canary connections was blocked because the set_cloud_connector_version_override tool requires an approval comment from a GitHub user with a public @airbyte.io email, and no such comment exists on the PR or linked oncall issue. An approval request has been posted to the oncall issue.

Next Steps
  1. To unblock live testing: An Airbyte team member with a public @airbyte.io email should reply to oncall issue 11440 approving the evidence plan.
  2. Re-run /ai-prove-fix on this PR after approval is posted — the pre-release image and qualified connections are ready.
  3. Alternatively, for broader validation, run /ai-canary-prerelease to test on additional connections.
  4. The weekly /ai-release-manager will automatically monitor the release rollout after merge.

Connector & PR Details

Connector: source-postgres
PR: #74062
Pre-release Version Tested: 3.7.3-rc.1-preview.27f2f6f
Pre-release Publish: https://github.com/airbytehq/airbyte/actions/runs/22447948102 (completed successfully)
Detailed Results: https://github.com/airbytehq/oncall/issues/11440#issuecomment-3967330020

Evidence Plan

Proving Criteria

  1. Regression test passes in comparison mode (target vs control) — MET
  2. Live connection sync succeeds on internal canary — NOT TESTED (blocked on approval)
  3. Sync logs show REPEATABLE READ snapshot connection — NOT TESTED (blocked on approval)

Disproving Criteria

  1. Regression test fails or shows different output — NOT triggered
  2. Live sync fails with new errors — NOT TESTED
  3. Missing snapshot log message — NOT TESTED

Cases Attempted

  • Regression Tests: All 4 commands passed (SPEC, CHECK, DISCOVER, READ) — workflow
  • Case 1 (Internal AWS canary): Qualified, ready for pinning — blocked on approval
  • Case 2 (Internal EU canary): Qualified, ready for pinning — blocked on approval
Pre-flight Checks
  • Viability: Fix addresses the reported issue — opens single REPEATABLE READ snapshot connection for all CTID batches instead of per-batch READ COMMITTED connections
  • Safety: No malicious code or dangerous patterns — changes scoped to CTID scanning logic and DataSource threading
  • Breaking Change: No breaking changes detected — PATCH version bump (3.7.2-rc.1 → 3.7.3-rc.1), no schema/spec/state/stream changes
  • Reversibility: Can be safely downgraded/reverted — no state or config format changes

Design Intent Note: The current behavior (READ COMMITTED per batch) is clearly unintentional. No existing code uses REPEATABLE READ or snapshot isolation for CTID scans.

Detailed Evidence Log

Regression Tests (2026-02-26T15:10–15:17 UTC)

Command Target (pre-release) Control (production) Comparison
SPEC success success PASS
CHECK success success PASS
DISCOVER success success PASS
READ success success PASS

Workflow: https://github.com/airbytehq/airbyte-ops-mcp/actions/runs/22448037363

Live Connection Tests

Not executed — blocked on approval comment from @airbyte.io user. Two internal canary connections qualified and ready. See oncall issue for details.


Devin session

@github-actions
Copy link
Contributor

github-actions bot commented Feb 26, 2026

Pre-release Connector Publish Started

Publishing pre-release build for connector source-postgres.
PR: #74062

Pre-release versions will be tagged as {version}-preview.27f2f6f
and are available for version pinning via the scoped_configuration API.

View workflow run
Pre-release Publish: SUCCESS

Docker image (pre-release):
airbyte/source-postgres:3.7.3-rc.1-preview.27f2f6f

Docker Hub: https://hub.docker.com/layers/airbyte/source-postgres/3.7.3-rc.1-preview.27f2f6f

Registry JSON:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant