-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serializing s3-compatible endpoint url within path strings #334
Comments
Hi @Koncopd, Thanks for opening the issue ❤ It's possible to do what you want by subclassing I also understand that sometimes it's just more convenient to have a single string representation for a path. It makes passing around everything just a bit easier. In general though, I would discourage it. If you want to be as flexible as possible for supporting filesystem_spec filesystems, you need to accept that a path is clearly defined via three items: the filesystem
and the path on the filesystem
So it's better and more flexible if your application is designed in a way to store the three pieces of information of a unique path. The fact that for some filesystems, some of the storage_options can be serialized into the path, is convenient, but it's easy to construct cases where specific storage_options either can't easily be serialized to a string, or where a let's say query-parameter serialization would introduce ambiguity because the options collide. If we could collect a few examples, that require a better way of serializing protocol storage_options and path then we could potentially work out a better solution. Some time ago, when I had a similar issue, I experimented with serializing to json and just storing the json object in my application code: https://github.com/Bayer-Group/pado/blob/635d7b8b57e527254d6302730100a6dab5a2095f/pado/io/files.py#L351-L386 but that is not universal either, because there are (edge) cases where your only option is to pickle the storage_options... P.S.: there are |
Here's an example how this could be implemented. Serializing the endpoint_url as a query_parameter didn't need subclassing S3FileSystem. That might be needed if you want to add the endpoint_url to the bucket somehow. For final use this would need a bunch of more tests of course, since there could be a few strange edge cases dependent on the different path operations. And as I said, I would recommend just designing the application to be able to store storage_options with the url. from urllib.parse import parse_qs
from urllib.parse import urlencode
from urllib.parse import urlsplit
from upath.implementations.cloud import S3Path
from upath.registry import register_implementation
class S3EndpointPath(S3Path):
@classmethod
def _transform_init_args(cls, args, protocol, storage_options):
args, protocol, storage_options = super(S3EndpointPath, cls)._transform_init_args(args, protocol, storage_options)
sr = urlsplit(str(args[0]))
query_params = parse_qs(sr.query)
if ep := query_params.pop("endpoint_url", []):
if len(ep) == 1:
if "endpoint_url" in storage_options and ep[0] != storage_options["endpoint_url"]:
raise ValueError("incompatible endpoint_urls")
storage_options.setdefault("endpoint_url", ep[0])
elif len(ep) > 1:
raise ValueError("multiple endpoint_url query parameter")
args0 = sr._replace(query=urlencode(query_params, doseq=True)).geturl()
return (args0, *args[1:]), protocol, storage_options
def __str__(self):
urlpath = super().__str__()
if endpoint_url := self._storage_options.get("endpoint_url"):
qs = urlencode({"endpoint_url": endpoint_url})
urlpath = f"{urlpath}?{qs}"
return urlpath
register_implementation("s3", S3EndpointPath, clobber=True)
if __name__ == "__main__":
from upath import UPath
p0 = UPath("s3://bucket/abc")
assert isinstance(p0, S3EndpointPath)
assert str(p0) == "s3://bucket/abc"
assert p0.storage_options == {}
assert p0.parts == ("bucket/", "abc")
p1 = UPath("s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com")
assert str(p1) == "s3://bucket/abc?endpoint_url=https%3A%2F%2Fabc.com"
assert p1.storage_options == {"endpoint_url": "https://abc.com"}
assert p1.parts == ("bucket/", "abc")
p2 = p1.joinpath("efg")
assert isinstance(p2, S3EndpointPath)
assert str(p2) == "s3://bucket/abc/efg?endpoint_url=https%3A%2F%2Fabc.com", str(p2)
assert p2.storage_options == {"endpoint_url": "https://abc.com"}
assert p2.parts == ("bucket/", "abc", "efg") |
Hi, @ap-- , thank you very much for your answer. Yes, i agree with your points about generality. |
I want to discuss this.
We manage a lot of different paths and need also to support paths in s3-compatible endpoint urls along with normal s3 paths. I understand that an endpoint url can be passed easily via
UPath("s3://bucket/key", endpoint_url=endpoint_url)
.However it really feels that the endpoint url is a part of the path itself, and it would be very useful to have something like
UPath("s3://bucket?endpoint_url/key")
orUPath("s3://bucket/key?endpoint_url")
, also storing such path strings are easier, and differentiating them from normal s3 paths. What do you think?The text was updated successfully, but these errors were encountered: