-
Notifications
You must be signed in to change notification settings - Fork 338
Add additional storage config for PolarisCredentialProperty #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open Question: The PolarisMetaStoreManager can set any string as ADDITIONAL_STORAGE_CONFIG, and FileIO will handle it. However, in most cases, this string is expected to be a JSON object. Should we consider deserializing ADDITIONAL_STORAGE_CONFIG into a string map and merging it with other properties within the convertToMapOfString method?
4447b90 to
02c79e2
Compare
dimas-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@XJDKC : Thanks for your contribution. It is great to see Polaris being used and customized :)
| ADDITIONAL_STORAGE_CONFIG( | ||
| String.class, | ||
| "additional-storage-config", | ||
| "the additional storage config for the cloud storage"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a fairly generic property... why does it need to be part of "credential" properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only place where we can dynamically override some storage config used by the FileIO. The key idea is that the getScopedCredentials response should be capable of carrying additional storage config settings that can be interpreted later in a context-specific way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully support the idea of allowing custom (configurable) additional properties to be sent to Iceberg clients for their FileIO configuration. However, it might be preferable to refactor Polaris code to avoid mixing "credentials" and general properties. IIRC, the "load table" and "config" responses in the Iceberg REST API are not limited to just "credential" properties. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed @dimas-b -- it makes sense for the catalog to be able to inject additional properties, but I think these may not necessarily be credentials and that it's better to avoid baking that assumption into the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to expose these additional properties to the iceberg client, this is for BasePolarisCatalog to access the storage (doCommit, doRefresh, registerTable). The PolarisMetaStoreManager we implements will provide some additional storage config for custom FileIO to access the storage.
BasePolarisCatalog.java#L1607-L1631
But thanks for pointing this out, it seems that these properties will be also passed to the client. Maybe we should filter this property when we return the creds in the loadTable response.
For combining the additional properties with creds, the getSubscopedCredentials api actually currently represents a request for a dynamic connection-configuration, which happens to mostly be credentials right now. We should treat it as an api that we can use to get storage config so the Polaris's FileIO can use these info to access the storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misunderstood the intent of this PR.
I'm not sure about exposing general FileIO properties for users to configure the internal Polaris access path to table storage. Polaris may not always use Iceberg FileIO for its internal access, but if general properties are exposed, refactoring that area will be much harder.
@XJDKC : Would you be able to share some more details about your intended use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context
getSubscopedCreds returns an EnumMap<PolarisCredentialProperty, String>, it serves two purposes:
- For Polaris: Supplies credentials to initialize FileIO, enabling storage access for Polaris's operations such as
doCommit,doRefresh, andregisterTable - For REST Client: Vends credentials to the REST Client, enabling it to access storage.
Our Intention
The PolarisMetaStoreManager we implement may need to dynamically select AWS S3 or Azure storage endpoints at runtime based on routing requirements (e.g., proxy, PrivateLink). It interprets these details contextually. This requires updating the routing logic to ensure Polaris can access the storage by sending requests to the correct Hostnames/IPs, so we implement our own FileIO.
To achieve this, we need to pass additional information along with credentials to Polaris (From PolarisMetaStoreManager to FileIO). We consider getSubscopedCreds a suitable mechanism for providing these details, as it effectively informs Polaris how to access storage. But right now, this api happens to mostly be credentials.
Problem
However, as you mentioned, including additional props in getSubscopedCreds also exposes them to the REST client, which uses the same API for a different purpose. Since the client (query engine) operates in the user’s environment, these info is unnecessary and should not be shared with the REST client due to security and relevance concerns.
Proposed Solution
To resolve this, we can filter out the additional storage configuration from the loadTable response, ensuring that it is only utilized internally by Polaris and not sent to the REST client.
cc: @dimas-b: Hope this answers your question!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, @XJDKC !
The use case makes total sense from my POV. However, I'm still not sure about overloading credential properties with this information.
Do you think a larger refactoring could be beneficial to allow customizing storage access inside Polaris, without affecting credentials specifically? Perhaps with such a refactoring having to excluding the new information from REST client responses will not be necessary... WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#724 certainly looks preferable from my POV. Let's continue the discussion there.
...ore/src/test/java/org/apache/polaris/core/storage/cache/StorageCredentialCacheEntryTest.java
Outdated
Show resolved
Hide resolved
|
Closed this PR as we have another proposal: #724 |
Context
For customized
PolarisMetaStoreManager, sometimes we need to add additional storage config along with theScopedCredentialsResult. This is for BasePolarisCatalog to access the storage (doCommit,doRefresh,registerTable).These additional properties will be used / handled by Polaris's
FileIOto implement some custom logic.This PR introduces a new property called
ADDITIONAL_STORAGE_CONFIGto store these properties.