Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Modular decryption support #6637

Open
wants to merge 70 commits into
base: main
Choose a base branch
from

Conversation

rok
Copy link
Member

@rok rok commented Oct 28, 2024

Which issue does this PR close?

This PR is based on branch and an internal patch and aims to provide basic modular decryption support. Partially closes #3511. We decided to split encryption work into a separate PR.

Rationale for this change

See #3511.

What changes are included in this PR?

This introduces AesGcmV1 cypher decryption to ArrowReaderMetadata and ParquetRecordBatchReader. Introduced classes and functions are tested on sample files from parquet-dataset.

Are there any user-facing changes?

Several new classes and method parameters are introduced. If project is compiled without encryption flag changes are not breaking. If encryption flag is on some methods and constructors (e.g. ParquetMetaData::new) will require new parameters which would be a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 28, 2024
@rok
Copy link
Member Author

rok commented Oct 28, 2024

Currently this is a rough rebase of work done by @ggershinsky. As ParquetMetaDataReader is now available some refactoring will be required.

@etseidl
Copy link
Contributor

etseidl commented Oct 28, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@brainslush
Copy link

Is there any help, input or contribution needed here?

@rok
Copy link
Member Author

rok commented Nov 21, 2024

Thanks for the offer @etseidl & @brainslush! I'm making some progress and would definitely appreciate a review! I'll ping once I push.

@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from fe488b3 to d263510 Compare November 23, 2024 23:06
@rok
Copy link
Member Author

rok commented Dec 4, 2024

As ParquetMetaDataReader is now available some refactoring will be required.

@rok let me know if you want any help shoehorning this into ParquetMetaDataReader.

@etseidl could you please do a quick pass to say if this makes sense in respect to ParquetMetaDataReader?
I'll continue with data decryption.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looking at the metadata bits for now...looks good to me so far. Just a few minor nits. Thanks @rok!

@rok rok force-pushed the decryption-basics-fork branch from f90d8b4 to 29d55eb Compare December 16, 2024 23:51
@adamreeve adamreeve force-pushed the decryption-basics-fork branch from 27e77ad to 7db06cc Compare December 20, 2024 02:15
@rok rok force-pushed the decryption-basics-fork branch from a4105d5 to 3e7646d Compare January 9, 2025 21:59
@rok rok force-pushed the decryption-basics-fork branch 4 times, most recently from deedba9 to 951f2fa Compare January 21, 2025 20:35
@rok rok changed the title Parquet Modular Encryption support Parquet Modular decryption support Jan 21, 2025
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from f6b9e88 to 23375d1 Compare January 23, 2025 18:17
@adamreeve adamreeve force-pushed the decryption-basics-fork branch 3 times, most recently from 7f94e39 to 177d826 Compare January 24, 2025 02:46
@rok rok force-pushed the decryption-basics-fork branch 2 times, most recently from ac4ac21 to 3241425 Compare February 5, 2025 10:01
@rok rok force-pushed the decryption-basics-fork branch from 7872490 to d3df0ab Compare February 17, 2025 17:42
@etseidl
Copy link
Contributor

etseidl commented Feb 18, 2025

It seems like this is ready. @alamb and @tustvold do you agree that this does not constitute a breaking change?

@tustvold
Copy link
Contributor

tustvold commented Feb 18, 2025

Using a feature flag to gate signature changes should mean this isn't a breaking change, but generally I would consider it a bit of a code smell. Generally feature flags should be additive, and not change method signatures.

I've not looked closely at this PR, and tbh am unlikely to have time in the near future, but I would imagine there are ways to avoid this.

@rok
Copy link
Member Author

rok commented Feb 18, 2025

@tustvold we could use more with_file_decryption_properties -like methods I suppose. Changing signatures indeed seem like a bad idea.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rok @etseidl and @tustvold -- it is pretty amazing to see this feature being added after so many years

I am somewhat concerned about the API -- I realize the current structure of the code doesn't lend itself well to adding new features easily, so what this PR does is understandable

However, I think if we really want to roll this out in a way that is easy to use and also won't require non trivial API rework in the future is to encapsulate the footer decoding logic more.

I can try to help with this in a few weeks time, but realistically I won't be able to for several weeks.

Let me know if this doesn't make sense

@@ -125,6 +128,8 @@ sysinfo = ["dep:sysinfo"]
crc = ["dep:crc32fast"]
# Enable SIMD UTF-8 validation
simdutf8 = ["dep:simdutf8"]
# Enable Parquet modular encryption support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also document this new flag here
https://github.com/apache/arrow-rs/tree/main/parquet#feature-flags

Maybe we should update the feature support matrix as well

@@ -127,13 +128,26 @@ impl<F: MetadataFetch> MetadataLoader<F> {
let (metadata, remainder) = if length > suffix_len - FOOTER_SIZE {
let metadata_start = file_size - length - FOOTER_SIZE;
let meta = fetch.fetch(metadata_start..file_size - FOOTER_SIZE).await?;
(ParquetMetaDataReader::decode_metadata(&meta)?, None)
(
ParquetMetaDataReader::decode_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold and @etseidl here -- using #cfg(...) to change function signatures seems uncommon (I haven't run across it before in Rust) and I don't think there are any precedents in this crate either`

I think a more standard API would be to make a structure that has fields which could be controlled. Given there is now more state / options needed to decode the footer we could put the details into the struct

let decoder = ParquetMetaDataDecoder::new(&footer)
  .read_meta(fetch)
  ...

Or something 🤔

fn get_metadata<'a>(
&'a mut self,
#[cfg(feature = "encryption")] file_decryption_properties: Option<
&'a FileDecryptionProperties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make more sense to me if FileDecryptionProperties was a field on the reader.

pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> {
pub fn decode_metadata(
buf: &[u8],
encrypted_footer: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically speaking encypted_footer here is a new parameter and thus this would be a breaking API change

As I mentioned above, since the decoder / decoding is becoming more stateful, I think it is probably time to wrap the decoding logic into a more encapsulated structure, which would likely also reduce the number of distrinct APIs / #cfgs needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Modular Encryption support
8 participants