-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Modular decryption support #6637
base: main
Are you sure you want to change the base?
Conversation
Currently this is a rough rebase of work done by @ggershinsky. As |
@rok let me know if you want any help shoehorning this into |
Is there any help, input or contribution needed here? |
Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push. |
fe488b3
to
d263510
Compare
@etseidl could you please do a quick pass to say if this makes sense in respect to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!
f90d8b4
to
29d55eb
Compare
27e77ad
to
7db06cc
Compare
a4105d5
to
3e7646d
Compare
deedba9
to
951f2fa
Compare
f6b9e88
to
23375d1
Compare
7f94e39
to
177d826
Compare
ac4ac21
to
3241425
Compare
Co-authored-by: Adam Reeve <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
7872490
to
d3df0ab
Compare
Using a feature flag to gate signature changes should mean this isn't a breaking change, but generally I would consider it a bit of a code smell. Generally feature flags should be additive, and not change method signatures. I've not looked closely at this PR, and tbh am unlikely to have time in the near future, but I would imagine there are ways to avoid this. |
@tustvold we could use more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rok @etseidl and @tustvold -- it is pretty amazing to see this feature being added after so many years
I am somewhat concerned about the API -- I realize the current structure of the code doesn't lend itself well to adding new features easily, so what this PR does is understandable
However, I think if we really want to roll this out in a way that is easy to use and also won't require non trivial API rework in the future is to encapsulate the footer decoding logic more.
I can try to help with this in a few weeks time, but realistically I won't be able to for several weeks.
Let me know if this doesn't make sense
@@ -125,6 +128,8 @@ sysinfo = ["dep:sysinfo"] | |||
crc = ["dep:crc32fast"] | |||
# Enable SIMD UTF-8 validation | |||
simdutf8 = ["dep:simdutf8"] | |||
# Enable Parquet modular encryption support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also document this new flag here
https://github.com/apache/arrow-rs/tree/main/parquet#feature-flags
Maybe we should update the feature support matrix as well
@@ -127,13 +128,26 @@ impl<F: MetadataFetch> MetadataLoader<F> { | |||
let (metadata, remainder) = if length > suffix_len - FOOTER_SIZE { | |||
let metadata_start = file_size - length - FOOTER_SIZE; | |||
let meta = fetch.fetch(metadata_start..file_size - FOOTER_SIZE).await?; | |||
(ParquetMetaDataReader::decode_metadata(&meta)?, None) | |||
( | |||
ParquetMetaDataReader::decode_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tustvold and @etseidl here -- using #cfg(...)
to change function signatures seems uncommon (I haven't run across it before in Rust) and I don't think there are any precedents in this crate either`
I think a more standard API would be to make a structure that has fields which could be controlled. Given there is now more state / options needed to decode the footer we could put the details into the struct
let decoder = ParquetMetaDataDecoder::new(&footer)
.read_meta(fetch)
...
Or something 🤔
fn get_metadata<'a>( | ||
&'a mut self, | ||
#[cfg(feature = "encryption")] file_decryption_properties: Option< | ||
&'a FileDecryptionProperties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make more sense to me if FileDecryptionProperties
was a field on the reader.
pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> { | ||
pub fn decode_metadata( | ||
buf: &[u8], | ||
encrypted_footer: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically speaking encypted_footer
here is a new parameter and thus this would be a breaking API change
As I mentioned above, since the decoder / decoding is becoming more stateful, I think it is probably time to wrap the decoding logic into a more encapsulated structure, which would likely also reduce the number of distrinct APIs / #cfgs needed
Which issue does this PR close?
This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.
Rationale for this change
See #3511.
What changes are included in this PR?
This introduces AesGcmV1 cypher decryption to
ArrowReaderMetadata
andParquetRecordBatchReader
. Introduced classes and functions are tested on sample files fromparquet-dataset
.Are there any user-facing changes?
Several new classes and method parameters are introduced. If project is compiled without
encryption
flag changes are not breaking. Ifencryption
flag is on some methods and constructors (e.g.ParquetMetaData::new
) will require new parameters which would be a breaking change.