Skip to content

Conversation

@XJDKC
Copy link
Member

@XJDKC XJDKC commented Jan 10, 2025

Context

For customized PolarisMetaStoreManager, sometimes we need to add additional storage config along with the ScopedCredentialsResult. This is for BasePolarisCatalog to access the storage (doCommit, doRefresh, registerTable).
These additional properties will be used / handled by Polaris's FileIO to implement some custom logic.

This PR introduces a new property called ADDITIONAL_STORAGE_CONFIG to store these properties.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open Question: The PolarisMetaStoreManager can set any string as ADDITIONAL_STORAGE_CONFIG, and FileIO will handle it. However, in most cases, this string is expected to be a JSON object. Should we consider deserializing ADDITIONAL_STORAGE_CONFIG into a string map and merging it with other properties within the convertToMapOfString method?

@XJDKC XJDKC force-pushed the xjdkc/additional-storage-conf branch from 4447b90 to 02c79e2 Compare January 10, 2025 23:48
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XJDKC : Thanks for your contribution. It is great to see Polaris being used and customized :)

ADDITIONAL_STORAGE_CONFIG(
String.class,
"additional-storage-config",
"the additional storage config for the cloud storage"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a fairly generic property... why does it need to be part of "credential" properties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only place where we can dynamically override some storage config used by the FileIO. The key idea is that the getScopedCredentials response should be capable of carrying additional storage config settings that can be interpreted later in a context-specific way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully support the idea of allowing custom (configurable) additional properties to be sent to Iceberg clients for their FileIO configuration. However, it might be preferable to refactor Polaris code to avoid mixing "credentials" and general properties. IIRC, the "load table" and "config" responses in the Iceberg REST API are not limited to just "credential" properties. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed @dimas-b -- it makes sense for the catalog to be able to inject additional properties, but I think these may not necessarily be credentials and that it's better to avoid baking that assumption into the code.

Copy link
Member Author

@XJDKC XJDKC Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to expose these additional properties to the iceberg client, this is for BasePolarisCatalog to access the storage (doCommit, doRefresh, registerTable). The PolarisMetaStoreManager we implements will provide some additional storage config for custom FileIO to access the storage.
BasePolarisCatalog.java#L1607-L1631

But thanks for pointing this out, it seems that these properties will be also passed to the client. Maybe we should filter this property when we return the creds in the loadTable response.

For combining the additional properties with creds, the getSubscopedCredentials api actually currently represents a request for a dynamic connection-configuration, which happens to mostly be credentials right now. We should treat it as an api that we can use to get storage config so the Polaris's FileIO can use these info to access the storage.

Copy link
Contributor

@dimas-b dimas-b Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I misunderstood the intent of this PR.

I'm not sure about exposing general FileIO properties for users to configure the internal Polaris access path to table storage. Polaris may not always use Iceberg FileIO for its internal access, but if general properties are exposed, refactoring that area will be much harder.

@XJDKC : Would you be able to share some more details about your intended use case?

Copy link
Member Author

@XJDKC XJDKC Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context
getSubscopedCreds returns an EnumMap<PolarisCredentialProperty, String>, it serves two purposes:

  1. For Polaris: Supplies credentials to initialize FileIO, enabling storage access for Polaris's operations such as doCommit, doRefresh, and registerTable
  2. For REST Client: Vends credentials to the REST Client, enabling it to access storage.

Our Intention
The PolarisMetaStoreManager we implement may need to dynamically select AWS S3 or Azure storage endpoints at runtime based on routing requirements (e.g., proxy, PrivateLink). It interprets these details contextually. This requires updating the routing logic to ensure Polaris can access the storage by sending requests to the correct Hostnames/IPs, so we implement our own FileIO.

To achieve this, we need to pass additional information along with credentials to Polaris (From PolarisMetaStoreManager to FileIO). We consider getSubscopedCreds a suitable mechanism for providing these details, as it effectively informs Polaris how to access storage. But right now, this api happens to mostly be credentials.

Problem
However, as you mentioned, including additional props in getSubscopedCreds also exposes them to the REST client, which uses the same API for a different purpose. Since the client (query engine) operates in the user’s environment, these info is unnecessary and should not be shared with the REST client due to security and relevance concerns.

Proposed Solution
To resolve this, we can filter out the additional storage configuration from the loadTable response, ensuring that it is only utilized internally by Polaris and not sent to the REST client.

cc: @dimas-b: Hope this answers your question!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, @XJDKC !

The use case makes total sense from my POV. However, I'm still not sure about overloading credential properties with this information.

Do you think a larger refactoring could be beneficial to allow customizing storage access inside Polaris, without affecting credentials specifically? Perhaps with such a refactoring having to excluding the new information from REST client responses will not be necessary... WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dimas-b

We come up with another solution to achieve that, details can be found in this PR: #724
Can you pls take a look? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#724 certainly looks preferable from my POV. Let's continue the discussion there.

@XJDKC
Copy link
Member Author

XJDKC commented Jan 29, 2025

Closed this PR as we have another proposal: #724

@XJDKC XJDKC closed this Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants