-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[copy_from] AWS source #31144
[copy_from] AWS source #31144
Conversation
7defd19
to
58f62b2
Compare
58f62b2
to
b9715b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I think we should add tests for this.
src/aws-util/src/s3.rs
Outdated
use aws_types::sdk_config::SdkConfig; | ||
|
||
pub use aws_sdk_s3::Client; | ||
use bytes::Bytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is meant to go in the other import block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yeah I should move this up, thanks!
* add a new OneshotSource implementation * support the FILES and PATTERN options for COPY FROM
b9715b2
to
4f8ab12
Compare
I have a bunch of tests in a stacked PR, will get those up for review! |
_Stacked on top of_ #31144 This PR refactors ranged get requests so the logic for forming the header value is shared for both the HTTP and S3 source. It also fixes a bug where sometimes `HEAD` requests (which we use to get metadata for a file) are not supported, so we fallback to a `GET` request but quickly drop the body. ### Motivation Fix a bug in ranged HTTP requests I found in the demo today ### Tips for reviewer review only the final commit, the one titled "start, fix ranged requests" ### Checklist - [x] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [x] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [x] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [x] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [x] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.
_Stacked on top of_ #31144 This PR adds a new implementation of a `OneshotFormat` that supports reading in Parquet files. The decoding is built on top of the `ArrowReader` implemented in #30958. The strategy we use for reading and decoding Parquet files is the "split work" stage of a oneshot source will read the footer metadata from a Parquet file to determine the [Row Group](https://parquet.apache.org/docs/concepts/) boundaries. The Row Groups are then distributed among timely works for fetching and eventual decoding. Note: Through experimentation I found that Row Groups seem to typically be 10s of MB large, which makes them a pretty good unit of parallelization. ### Motivation Fixes MaterializeInc/database-issues#8853 ### Tips for reviewer Review on the final commit, the one titled "start, support Parquet for COPY FROM" ### Checklist - [x] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [x] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [x] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [x] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [x] If this PR includes major [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note), I have pinged the relevant PM to schedule a changelog post.
Stacked on top of: #30956
This PR implements a new AwsS3
OneshotSource
that allows copying in files from S3, e.g.Along with
FILES = [<files>]
we also support aPATTERN = <glob>
option which allows copying multiple files all at once.Motivation
Fixes https://github.com/MaterializeInc/database-issues/issues/8860
Fixes https://github.com/MaterializeInc/database-issues/issues/8855
Tips for reviewer
Review only the final commit, the one titled "start, implementation of an S3 oneshot source"
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.