Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializing s3-compatible endpoint url within path strings #334

Open
Koncopd opened this issue Jan 30, 2025 · 3 comments
Open

Serializing s3-compatible endpoint url within path strings #334

Koncopd opened this issue Jan 30, 2025 · 3 comments
Labels
enhancement 🚀 New feature or request

Comments

@Koncopd
Copy link
Contributor

Koncopd commented Jan 30, 2025

I want to discuss this.
We manage a lot of different paths and need also to support paths in s3-compatible endpoint urls along with normal s3 paths. I understand that an endpoint url can be passed easily via UPath("s3://bucket/key", endpoint_url=endpoint_url).
However it really feels that the endpoint url is a part of the path itself, and it would be very useful to have something like UPath("s3://bucket?endpoint_url/key") or UPath("s3://bucket/key?endpoint_url"), also storing such path strings are easier, and differentiating them from normal s3 paths. What do you think?

@ap--
Copy link
Collaborator

ap-- commented Jan 31, 2025

Hi @Koncopd,

Thanks for opening the issue ❤

It's possible to do what you want by subclassing S3FileSystem and S3Path and registering both in your code. I'll provide a code example later today.

I also understand that sometimes it's just more convenient to have a single string representation for a path. It makes passing around everything just a bit easier.

In general though, I would discourage it. If you want to be as flexible as possible for supporting filesystem_spec filesystems, you need to accept that a path is clearly defined via three items:

the filesystem

  1. protocol (indirectly defines the filesystem_spec AbstractFileSystem subclass)
  2. storage_options (defines how the filesystem is instantiated, and potentially contains chained filesystems)

and the path on the filesystem

  1. path (i.e. some string that points to a resource)

So it's better and more flexible if your application is designed in a way to store the three pieces of information of a unique path. The fact that for some filesystems, some of the storage_options can be serialized into the path, is convenient, but it's easy to construct cases where specific storage_options either can't easily be serialized to a string, or where a let's say query-parameter serialization would introduce ambiguity because the options collide.

If we could collect a few examples, that require a better way of serializing protocol storage_options and path then we could potentially work out a better solution.

Some time ago, when I had a similar issue, I experimented with serializing to json and just storing the json object in my application code: https://github.com/Bayer-Group/pado/blob/635d7b8b57e527254d6302730100a6dab5a2095f/pado/io/files.py#L351-L386 but that is not universal either, because there are (edge) cases where your only option is to pickle the storage_options...

P.S.: there are storage_args too, but so far I think I haven't seen an example where you couldn't map them to keyword arguments.

@ap-- ap-- added the enhancement 🚀 New feature or request label Jan 31, 2025
@ap--
Copy link
Collaborator

ap-- commented Feb 2, 2025

Here's an example how this could be implemented. Serializing the endpoint_url as a query_parameter didn't need subclassing S3FileSystem. That might be needed if you want to add the endpoint_url to the bucket somehow.

For final use this would need a bunch of more tests of course, since there could be a few strange edge cases dependent on the different path operations.

And as I said, I would recommend just designing the application to be able to store storage_options with the url.

from urllib.parse import parse_qs
from urllib.parse import urlencode
from urllib.parse import urlsplit


from upath.implementations.cloud import S3Path
from upath.registry import register_implementation



class S3EndpointPath(S3Path):
    
    @classmethod
    def _transform_init_args(cls, args, protocol, storage_options):
        args, protocol, storage_options = super(S3EndpointPath, cls)._transform_init_args(args, protocol, storage_options)
         
        sr = urlsplit(str(args[0]))
        query_params = parse_qs(sr.query)
        if ep := query_params.pop("endpoint_url", []):
            if len(ep) == 1:
                if "endpoint_url" in storage_options and ep[0] != storage_options["endpoint_url"]:
                    raise ValueError("incompatible endpoint_urls")
                storage_options.setdefault("endpoint_url", ep[0])
            elif len(ep) > 1:
                raise ValueError("multiple endpoint_url query parameter")
        args0 = sr._replace(query=urlencode(query_params, doseq=True)).geturl()

        return (args0, *args[1:]), protocol, storage_options

    def __str__(self):
        urlpath = super().__str__()
        if endpoint_url := self._storage_options.get("endpoint_url"):
            qs = urlencode({"endpoint_url": endpoint_url})
            urlpath = f"{urlpath}?{qs}"
        return urlpath


register_implementation("s3", S3EndpointPath, clobber=True)


if __name__ == "__main__":
    from upath import UPath


    p0 = UPath("s3://bucket/abc")
    assert isinstance(p0, S3EndpointPath)
    assert str(p0) == "s3://bucket/abc"
    assert p0.storage_options == {}
    assert p0.parts == ("bucket/", "abc")

    p1 = UPath("s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com")
    assert str(p1) == "s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com"
    assert p1.storage_options == {"endpoint_url": "https://abc.com"}
    assert p1.parts == ("bucket/", "abc")

    p2 = p1.joinpath("efg")
    assert isinstance(p2, S3EndpointPath)
    assert str(p2) == "s3://bucket/abc/efg?endpoint_url=https%3A%2F%2Fabc.com", str(p2)
    assert p2.storage_options == {"endpoint_url": "https://abc.com"}
    assert p2.parts == ("bucket/", "abc", "efg")

@ap-- ap-- changed the title [feature request] A way to store s3-compatible endpoint url within path strings Serializing s3-compatible endpoint url within path strings Feb 2, 2025
@Koncopd
Copy link
Contributor Author

Koncopd commented Feb 3, 2025

Hi, @ap-- , thank you very much for your answer. Yes, i agree with your points about generality.
And Thank you also for the subclass example, very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🚀 New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants