Skip to content

Allow AWS S3 Access Points #1826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
crabba opened this issue Dec 7, 2020 · 9 comments
Open

Allow AWS S3 Access Points #1826

crabba opened this issue Dec 7, 2020 · 9 comments

Comments

@crabba
Copy link

crabba commented Dec 7, 2020

New feature: Allow AWS S3 Access Points

AWS S3 Access Points are unique hostnames attached to an S3 bucket, each with dedicated access policies. This allows large scale access control to be delegated to multiple APs, each dedicated to providing access to one user, rather than combining all access control in one large bucket policy. Larger scale users are increasingly using APs to simplify bucket access control.

Usage scenario

Allow S3 APs to be used as a parameter, for example input files: "--reads", "s3://arn:aws:s3:<region>:<account-id>:accesspoint/<ap-name>/my-fastq-data/*_{1,2}.fastq.gz"

Suggest implementation

The AWS S3 CLI, SDKs, and REST API support Access Points. Currently, using an S3 AP ARN in place of a bucket name results in an error 'The specified bucket is not valid', so it seems the ARN is being used as a literal bucket name. Enabling this feature would involve recognising the AP ARN and using it appropriately in CLI or API calls.

@pditommaso
Copy link
Member

Not sure the underlying library implementing the support AWS S3 is able to handle this? have you tried to use the access point in the S3 URL instead of bucket name?

@crabba
Copy link
Author

crabba commented Dec 9, 2020

Using the access point in the S3 URL submitted to Nextflow results in an error in the S3 service 'InvalidBucketName'. However the S3 CLI supports S3 Access Points as a direct replacement for bucket names; the following are equivalent:

$ aws s3 ls s3://mybucket/mydata/
$ aws s3 ls s3://arn:aws:s3:us-east-1:012345678901:accesspoint/my-ap/mydata

In the log file it looks like the S3 AP ARN is not matching an S3 pattern so is not being treated as S3:

[PathVisitor-5] DEBUG nextflow.file.PathVisitor - files for syntax: glob; folder: /arn:aws:s3:us-east-1:012345678901:accesspoint/testnextflow-ap-00/data/small/; pattern: 00*_{1,2}.fastq.gz; options: [:]
[PathVisitor-5] DEBUG nextflow.file.FileHelper - Path matcher not defined by 'S3FileSystem' file system -- using default default strategy
[PathVisitor-5] ERROR nextflow.Channel - The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: xxx; S3 Extended Request ID: xxx)
com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: xxx; S3 Extended Request ID: xxx)

Log file: nextflow.log.df00f000-9180-4fae-8ead-30649b935077.1.zip

@pditommaso
Copy link
Member

pditommaso commented Dec 9, 2020

I agree that this feature would be useful. We need some aws guru that puts the hands in the S3 client used by NF or we need to wait that official AWS implementation supporting NIO.

A pull request in any case supporting this feature is welcome.

@crabba
Copy link
Author

crabba commented Dec 9, 2020

If this feature were available in s3fs-nio, would it be possible for you to use that library?

@pditommaso
Copy link
Member

We may switch to it when it's stable.

@stale
Copy link

stale bot commented May 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 9, 2021
@pditommaso
Copy link
Member

@crabba any idea if this allows bypassing S3 rate limits? I mean https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling

@stale stale bot removed the stale label May 11, 2021
@crabba
Copy link
Author

crabba commented May 11, 2021

I don't think that the use of S3 Access Points would change any rate usage limits in the underlying bucket, as the access point is just a different way of implementing S3 access policies.

@markjschreiber
Copy link

All S3 Access points have an access point alias which is not an ARN and is compatible (usually) with Java URI Paths (unlike an ARN). This is probably what should be used here, frameworks like Spark and with other NIO libraries. https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-alias.html

The general pattern would be s3://my-ap-alias-abcdef/path/to/object

I have done some testing of this with https://github.com/awslabs/aws-java-nio-spi-for-s3 and it seems to work so will probably work with Nextflow's implementation of the s3-nio library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants