Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon S3 Tables Integration #577

Open
flyrain opened this issue Dec 17, 2024 · 8 comments
Open

Amazon S3 Tables Integration #577

flyrain opened this issue Dec 17, 2024 · 8 comments
Labels
enhancement New feature or request
Milestone

Comments

@flyrain
Copy link
Contributor

flyrain commented Dec 17, 2024

Polaris is designed to act as a REST facade for S3 tables, enabling both read and write operations by interacting with the S3 table API. Polaris registers an S3 table using its metadata location. A flag will be needed to label the new Iceberg table is a s3 table. Below is a summary of the proposed approach:

Read Path

  • The LoadTable endpoint in Polaris will call the S3 table API get_table_metadata_location to fetch the metadata.json file.
  • Polaris will then serialize the content of the metadata.json into a LoadTableResponse to return to the client.

Write Path

When the table is updated, Polaris will:

  1. Gather the changes and generate a new metadata.json.
  2. Use the S3 table API update-table-metadata-location to commit the new metadata.

AuthZ/AuthN

We need to ensure that the AWS role used for creating the Polaris catalog has the read and write privileges of the s3 table.

Describe alternatives you've considered

No response

@flyrain flyrain added the enhancement New feature or request label Dec 17, 2024
@puchengy
Copy link

puchengy commented Dec 18, 2024

@flyrain Thank you for starting the issue, I have two questions:

  1. is it right that with this integration, users will continue to have the option to own the metastore service (that holds source of the truth data)?
  2. should it be a concern if "update-table-metadata-location" fail and cause table metadata out of the sync, which could mislead S3 Tables to clean up missing metadata data (and its data files) as orphan files?

BTW, the link of get_table_metadata_location seems wrong? and it should be from S3 Tables API doc.

@flyrain
Copy link
Contributor Author

flyrain commented Dec 18, 2024

is it right that with this integration, users will continue to have the option to own the metastore service (that holds source of the truth data)?

That's right. This integration won't change the source of truth(s3 table in this case), and other tools or pipelines against the source catalog should still work as is.

should it be a concern if "update-table-metadata-location" fail and cause table metadata out of the sync, which could mislead S3 Tables to clean up missing metadata data (and its data files) as orphan files?

Polaris client will get the failure message in that case, then it can retry or just fail itself. The table is still consistent. But the failure may leave orphan files, which is fine as the other clients also leave orphan files in case of failure.

Thanks for pointing out the wrong link. Updated.

@puchengy
Copy link

@flyrain

Polaris client will get the failure message in that case, then it can retry or just fail itself. The table is still consistent.

Do you mean that "update-table-metadata-location" will be part of the commit to the source catalog (e.g. HMS)?

@flyrain
Copy link
Contributor Author

flyrain commented Dec 18, 2024

Yes, you could consider Polaris as a proxy when the source of truth is a remote catalog. Here are two different commit paths depends on different client types:

# REST clients commit path
REST client ->  Polaris -> Remote catalog(e.g. HMS, S3 table)

# Other clients commit path
Non-REST clients ->  Remote catalog(e.g. HMS, S3 table)

@puchengy
Copy link

puchengy commented Dec 18, 2024

@flyrain Got it, so in this proposal, HMS and S3 Tables are mutually exclusive?

Sorry for being unclear, the reason I asked the original question is to see if it is possible to operate HMS and S3 Tables at the same time, and use HMS as the source of the truth (because we want to keep our source of the truth data in HMS) but leverage some of S3 Tables additional features.

@flyrain
Copy link
Contributor Author

flyrain commented Dec 18, 2024

Yes. They are mutually exclusive. I think it's better to only have one source of truth. In case of s3 tables, s3 service is the source of the truth for sure. An integration of HMS would be like:

HMS clients ->  HMS -> S3 table

HMS will have to invoke the s3 api update-table-metadata-location to commit, otherwise there would be consistent issue.

It's another topic anyway. We may discuss it elsewhere.

@flyrain flyrain added this to the 1.1.0 milestone Dec 19, 2024
@soumilshah1995
Copy link

I wanted to kindly check if there are any updates on when these items are expected to be released

@jackye1995
Copy link

Thank you for bringing this up @flyrain ! In general, I agree with your implementation details described about the read and write paths.

I think the challenge here is less about implementation, but more about how federation looks like for Polaris. We probably want to get consensus around the federation proposal first, before proceeding further:

https://docs.google.com/document/d/1Q6eEytxb0btpOPcL8RtkULskOlYUCo_3FLvFRnHkzBY/edit?pli=1&tab=t.0#heading=h.dr4s0lqru4mo

Because there could be the option to directly mount an entire table bucket, or mount individual S3 tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants