-
Notifications
You must be signed in to change notification settings - Fork 67
feat: Adding pg_legacy_replication verified source using decoderbufs #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
neuromantik33
wants to merge
88
commits into
dlt-hub:master
Choose a base branch
from
neuromantik33:pg-legacy-replcation
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
88 commits
Select commit
Hold shift + click to select a range
59e7557
fix: finally got pg_replication tests working as is
79220b7
feat: got decoderbufs to run and compile in docker
9de0835
chore: updated protobuf to latest compatible version
75a0f7f
chore: copying all files from pg_replication; format-lint is reformat…
73704af
wip: saving work
7d1b8e7
wip: saving work
ecbf98d
wip: saving work
3ed14da
wip: removed all references to publications
9fe0301
fix: applied suggested changes mentioned here https://github.com/dlt-…
197ba82
wip: saving work
c897ee0
wip: finally got snapshot to work
d303c04
chore: simply cleaning up
6566fe4
chore: need to find a better way to clean up the underlying engine
70d40a0
wip: handling begin/commit
f703431
wip: saving work
f001633
wip: saving work
c0df7c9
wip: saving work
aa464d5
wip: saving work
db09568
wip: making progress
c3c0518
wip: saving work
a5b1a87
refactor: some test refactoring
7fad621
wip: saving work
fbc65bc
wip: saving work
1299b60
wip: cleaning up + refactor
f44853b
wip: cleaning up + refactor
46200ca
wip: cleaning up + refactor
f0f0146
wip: slowly progressing
cd8d906
wip: all tests pass now to update docs and cleanup
02851f4
wip: still trying to get it work with all versions of dlt
beef6ea
wip
77242e8
wip: changing signature
f9cdf78
wip: finally got rid of those errors
327b44c
wip: correcting failing tests
a9a9bb7
wip: fixed working examples
37acc35
wip: more refactoring now docs... -_-
a90acee
wip: cleaning up
f927f13
wip: cleaning up
cc7ad61
wip: attempting to refactor to use dlt resources
f9c7694
wip: second test passing
927ae03
wip: all tests pass again now for refactoring
3f31752
wip: init_replication is now a dlt source
637a6e9
wip: more refactoring
8a8134b
wip: saving work until I can get hinting to work
ee3cb9c
wip: finally got something somewhat working
1727456
wip: done with coding now docs
81fdce8
fix: various performance improvements
8fbfc62
fix: minor corrections to handle old versions of postgres
fd4638b
fix: small type corrections for pg9.6
526eff3
fix: exposing table options for later arrow support
2f5ad15
wip: saving work for arrow
32063e2
wip: first test with arrow passing
28f463d
wip: almost done passing all tests
385e8a6
wip: some arrow tests are still not passing
a291b69
fix: done with pyarrow; too many issues with duckdb atm
ba23505
wip: some bug fixes
5993fb4
wip: small refactoring
6db693a
wip: duckdb needs patching, trying out new max_lsn
c53c9f9
wip: some refactoring of options to make certain features togglable
ba1c3fc
wip: lsn and deleted ts are optional
6b960df
feat: added optional transaction id
9fa9d98
feat: added optional commit timestamp
1947029
fix: never handled missing type and added text oid mapping
7a7ba30
fix: added some logging and bug fixes
a752581
chore: basic refactoring
4184ca9
fix: minor corrections
3c7232f
chore: reverting back to prev state
c8f1ad2
chore: rebasing 1.x branch onto my own
7024ce7
fix: corrected bug regarding column names
63b1de0
chore: minor fixes
e8b2a0c
chore: small perf fixes and aligning with more adt
4c33129
chore: refactoring and cleaning
0b7c151
chore: finished docstrings
ec72e36
bugfix: misuse of defaultdict
ecc6089
Finally done with docs
dd5a63b
fix: wasn't able to execute local tests without these settings
d377423
feat: added basic support for scalar array types
acdf446
chore: slight perf improvments for pg_arrays
a3dc99d
fix: it turns out pg_arrays are annoying found temp workaround
c9c5bcb
refactor: all sqlalchemy event code is done at engine configuration
d695afb
chore: bumped python to 3.9; small refactorings
8f45283
refactor: init_replication is now in pkg ns
41f8ded
fix: corrected bugs regarding inferring nullability wrong; refactored…
129b18a
fix: rolling back on managing conn lifecycle using context mgrs: it d…
9083611
fix: corrected regression with occasional datum_missinng values
864b746
fix: add support for ordinary json pg_type
a591618
fix: various fixes of bugs encountered during production
5d6790e
fix: various fixes related to pyarrow backends
f276b4a
chore: updating poetry + dlt
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
# Postgres legacy replication | ||
[Postgres](https://www.postgresql.org/) is one of the most popular relational database management systems. This verified source uses Postgres' replication functionality to efficiently process changes | ||
in tables (a process often referred to as _Change Data Capture_ or CDC). It uses [logical decoding](https://www.postgresql.org/docs/current/logicaldecoding.html) and the optional `decoderbufs` | ||
[output plugin](https://github.com/debezium/postgres-decoderbufs), which is a shared library which must be built or enabled. | ||
|
||
| Source | Description | | ||
|---------------------|-------------------------------------------------| | ||
| replication_source | Load published messages from a replication slot | | ||
|
||
## Install decoderbufs | ||
|
||
Instructions can be found [here](https://github.com/debezium/postgres-decoderbufs?tab=readme-ov-file#building) | ||
|
||
Below is an example installation in a docker image: | ||
```Dockerfile | ||
FROM postgres:14 | ||
|
||
# Install dependencies required to build decoderbufs | ||
RUN apt-get update | ||
RUN apt-get install -f -y \ | ||
software-properties-common \ | ||
build-essential \ | ||
pkg-config \ | ||
git | ||
|
||
RUN apt-get install -f -y \ | ||
postgresql-server-dev-14 \ | ||
libprotobuf-c-dev && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
ARG decoderbufs_version=v1.7.0.Final | ||
RUN git clone https://github.com/debezium/postgres-decoderbufs -b $decoderbufs_version --single-branch && \ | ||
cd postgres-decoderbufs && \ | ||
make && make install && \ | ||
cd .. && \ | ||
rm -rf postgres-decoderbufs | ||
``` | ||
|
||
## Initialize the pipeline | ||
|
||
```bash | ||
$ dlt init pg_legacy_replication duckdb | ||
``` | ||
|
||
This uses `duckdb` as destination, but you can choose any of the supported [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/). | ||
|
||
## Set up user | ||
|
||
The Postgres user needs to have the `LOGIN` and `REPLICATION` attributes assigned: | ||
|
||
```sql | ||
CREATE ROLE replication_user WITH LOGIN REPLICATION; | ||
``` | ||
|
||
It also needs various read only privileges on the database (by first connecting to the database): | ||
|
||
```sql | ||
\connect dlt_data | ||
GRANT USAGE ON SCHEMA schema_name TO replication_user; | ||
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user; | ||
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user; | ||
``` | ||
|
||
## Add credentials | ||
1. Open `.dlt/secrets.toml`. | ||
2. Enter your Postgres credentials: | ||
|
||
```toml | ||
[sources.pg_legacy_replication] | ||
credentials="postgresql://replication_user:<<password>>@localhost:5432/dlt_data" | ||
``` | ||
3. Enter credentials for your chosen destination as per the [docs](https://dlthub.com/docs/dlt-ecosystem/destinations/). | ||
|
||
## Run the pipeline | ||
|
||
1. Install the necessary dependencies by running the following command: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
1. Now the pipeline can be run by using the command: | ||
|
||
```bash | ||
python pg_legacy_replication_pipeline.py | ||
``` | ||
|
||
1. To make sure that everything is loaded as expected, use the command: | ||
|
||
```bash | ||
dlt pipeline pg_replication_pipeline show | ||
``` | ||
|
||
# Differences between `pg_legacy_replication` and `pg_replication` | ||
|
||
## Overview | ||
|
||
`pg_legacy_replication` is a fork of the verified `pg_replication` source. The primary goal of this fork is to provide logical replication capabilities for Postgres instances running versions | ||
earlier than 10, when the `pgoutput` plugin was not yet available. This fork draws inspiration from the original `pg_replication` source and the `decoderbufs` library, | ||
which is actively maintained by Debezium. | ||
|
||
## Key Differences from `pg_replication` | ||
|
||
### Replication User Ownership Requirements | ||
One of the limitations of native Postgre replication is that the replication user must **own** the tables in order to add them to a **publication**. | ||
Additionally, once a table is added to a publication, it cannot be removed, requiring the creation of a new replication slot, which results in the loss of any state tracking. | ||
|
||
### Limitations in `pg_replication` | ||
The current pg_replication implementation has several limitations: | ||
- It supports only a single initial snapshot of the data. | ||
- It requires `CREATE` access to the source database in order to perform the initial snapshot. | ||
- **Superuser** access is required to replicate entire Postgres schemas. | ||
While the `pg_legacy_replication` source theoretically reads the entire WAL across all schemas, the current implementation using dlt transformers restricts this functionality. | ||
In practice, this has not been a common use case. | ||
- The implementation is opinionated in its approach to data transfer. Specifically, when updates or deletes are required, it defaults to a `merge` write disposition, | ||
which replicates live data without tracking changes over time. | ||
|
||
### Features of `pg_legacy_replication` | ||
|
||
This fork of `pg_replication` addresses the aforementioned limitations and introduces the following improvements: | ||
- Adheres to the dlt philosophy by treating the WAL as an upstream resources. This replication stream is then transformed into various DLT resources, with customizable options for write disposition, | ||
file formats, type hints, etc., specified at the resource level rather than at the source level. | ||
- Supports an initial snapshot of all tables using the transaction slot isolation level. Additionally, ad-hoc snapshots can be performed using the serializable deferred isolation level, | ||
similar to `pg_dump`. | ||
- Emphasizes the use of `pyarrow` and parquet formats for efficient data storage and transfer. A dedicated backend has been implemented to support these formats. | ||
- Replication messages are decoded using Protocol Buffers (protobufs) in C, rather than relying on native Python byte buffer parsing. This ensures greater efficiency and performance. | ||
|
||
## Next steps | ||
- Add support for the [wal2json](https://github.com/eulerto/wal2json) replication plugin. This is particularly important for environments such as **Amazon RDS**, which supports `wal2json`, | ||
- as opposed to on-premise or Google Cloud SQL instances that support `decoderbufs`. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean that
decoderbufs
is optional? if not present, we decode on the client?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The heavy lifting is done by the decoderbufs extension, which must be added if using a managed postgres like Cloud SQL, or compiled and installed on a self hosted postgres installation.
Detailed instructions can be found here : https://debezium.io/documentation/reference/stable/postgres-plugins.html#logical-decoding-output-plugin-installation
FYI decoderbufs is the default logical replication plugin used by Debezium.