Loading partitioned dataset from S3 to Redshift #1712

dhatch-niv · 2022-10-25T15:02:44Z

dhatch-niv
Oct 25, 2022

Hi all -

I have a dataset that I am looking to incrementally extract to redshift. I have broken this into two steps:

For each new period of data:

Write new partitions of the dataset to S3.
Copy the newly written partitions to Redshift.

The implementation roughly looks like:

import awswrangler as wr

df = get_incremental_df() # Get the batch of data to load
# Write new data to s3 as partitions.
res = wr.s3.to_parquet(
    df=df,
    path="s3://bucket/path",
    dataset=True,
    database="db",
    table="table",
    partition_cols=["month"],
    compression="gzip",
    mode="overwrite_partitions",
)
# Write to redshift.
for path in res['paths']:
    wr.redshift.copy_from_files(
        path=path,
        con=redshift_conn,
        table="table",
        schema="schema",
        mode="upsert",
        primary_keys=["month", "id"],
    )

The copy_from_files function fails because Redshift does not understand the partitions created by to_parquet, and wr.s3.to_parquet strips the partition column from the data files when writing to S3.

A possible solution would be to add a keyword argument to to_parquet, something like preserve_partition_columns=True which would keep the partition columns in the Parquet file instead of dropping them.

Is there a better way to achieve this without changes to the library?

Thanks.

kukushking · 2022-11-01T23:26:41Z

kukushking
Nov 1, 2022
Maintainer

Unfotunately COPY command itself does not support partitioned parquet at the moment.

Instead of using wr.redshift.copy_from_files did you consider wr.redshift.copy?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading partitioned dataset from S3 to Redshift #1712

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Loading partitioned dataset from S3 to Redshift #1712

dhatch-niv Oct 25, 2022

Replies: 1 comment

kukushking Nov 1, 2022 Maintainer

dhatch-niv
Oct 25, 2022

kukushking
Nov 1, 2022
Maintainer