Skip to content

Remove redundant locations when constructing access policies #2149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Jul 21, 2025

Iceberg tables can technically store data across any number of paths, but Polaris currently uses 3 different locations for credential vending:

  1. The table's base location
  2. The table's write.data.path, if set
  3. The table's write.metadata.path, if set

This was intended to capture scenarios where e.g. (2) is not a child path of (1), so that the vended credentials can still be valid for reading the entire table. However, there are systems that seem to always set (2) and (3), such as:

  1. s3:/my-bucket/base/iceberg
  2. s3:/my-bucket/base/iceberg/data
  3. s3:/my-bucket/base/iceberg/metadata

In such cases the extra paths (e.g. extra resources in the AWS Policy) are redundant. In one such case, these redundant paths caused the policy to exceed the maximum allowable 2048 characters.

This PR removes redundant paths -- those that are the child of another path -- from the list of accessible locations tracked for a given table and does some slight refactoring to consolidate the logic for extracting these paths from a TableMetadata.


/** Removes "redundant" locations, so {/a/b/, /a/b/c, /a/b/d} will be reduced to just {/a/b/} */
private static @Nonnull Set<String> removeRedundantLocations(Set<String> locationStrings) {
HashSet<String> result = new HashSet<>(locationStrings);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new collection, it can be produced by

locationStrings.stream()
  .filter(Objects::nonNull)
  .map(StorageLocation::of)
  .collect(Collectors.toCollection(HashSet::new));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would remove duplicate locations, but not redundant locations like we want to. We'd still need to loop over the collection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you'd save the exponential instantiation of SotrageLocation objects

Copy link
Member

@snazy snazy Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you could safe the inner loop with a sorted collection, if the locations end with a /.

@@ -2612,16 +2572,6 @@ protected FileIO loadFileIO(String ioImpl, Map<String, String> properties) {
callContext, ioImpl, properties, identifier, locations, storageActions, resolvedPath);
}

private void blockedUserSpecifiedWriteLocation(Map<String, String> properties) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadFileIO is also unused and should be removed as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it can be removed, but unlike blockedUserSpecifiedWriteLocation it's not related to this PR so I think we should do that separately.

@eric-maynard eric-maynard requested a review from snazy July 30, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants