-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spike] Investigate why get_protocol_and_path
does not correctly parse paths with #
in them
#3196
Comments
Thanks for reporting this @jasonmhite. I confirm:
In case it helps, I did some digging on the history of that function: it was implemented in https://github.com/quantumblacklabs/private-kedro/pull/545 (internal link) and the given rationale is described in https://jira.quantumblack.com/browse/KED-1544 (also internal):
|
Digging more into it, it looks like the issue might be the use of |
Also,
The Man, I had forgotten how much of a pain arbitrary URLs/paths can be, this seems to be a pretty tough problem to solve in general. I took a stab at kludging the Kedro code to special case ? and # but haven't gotten it right. |
Ooof, good catch... I wonder if we should take this conversation upstream @jasonmhite, they have a much better context of the potential corner cases. Do you want to open an issue in https://github.com/fsspec/filesystem_spec/issues yourself? Otherwise I'll do it, no fuss |
@astrojuanlu It might be better for you to do it, you're probably more familiar with fsspec than I am. I really have a hard time navigating their docs :'(.
|
I've been investigating this a bit more. Yes, HTTP(S) seems to be getting a special treatment, and for non-HTTP(S) paths Those changes were introduced in fsspec/filesystem_spec#64 without much context. About the special characters, from https://tools.ietf.org/html/rfc1738 (URLs) (idea from fsspec/filesystem_spec#171 (comment)):
And from https://tools.ietf.org/html/rfc3986 (URIs):
So, it's my understanding that filenames containing @jasonmhite Can you test if percent-encoding reserved characters in remote paths in the Kedro catalog works? |
@astrojuanlu So for the case I found this, I don't actually have the the problematic filepath in my catalog entry, it's in a subdirectory that gets loaded up by the Dataset class. I point my class to a top level directory in the catalog and it builds up a (customized) PartitionedDataset from the subdirectory names. That said I can directly map the problematic directory directly to the appropriate Dataset class. But it doesn't work, S3 doesn't properly interpret the % encoded path. It just treats it as literal text and then FileNotFoundError since that file doesn't exist. |
Thanks for testing it @jasonmhite. There's a chance then that certain filenames are unreachable on S3 through fsspec, I'll give this one more round of investigation with a local MinIO server. |
@astrojuanlu Just to be clear, this issue is in Kedro's parsing of paths for |
get_protocol_and_path
does not correctly parse paths with #
in them
We will first investigate why this happens. |
Source of the issueIn the In the implementation of the Line 877 in 396a1f5
Suggested fix |
See my comment on #4409 , the problem @ElenaKhaustova is describing as the root cause is correct, but I think the proposed fix makes some invalid assumptions. |
Description
Dataset implementations rely on
get_protocol_and_path
to split the protocol and the path. However, paths with a#
in them (which is valid) are truncated.#
is a valid character on most filesystems/platforms as far as I know. Best I can tell Kedro will not load any Datasets from such a path.Context
The actual problem is in
kedro.io.core._parse_filepath
, which is what does the parsing. I am going to refer to that in steps to reproduce. It looks like the regex's are probably incomplete, but it's hard to unravel. There may be other valid characters that are missed, too, but I happened to have a file with a#
in it.Steps to Reproduce
Expected Result
{'protocol': 's3', 'path': 'some/dummy#filename'}
Actual Result
{'protocol': 's3', 'path': 'some/dummy'}
Your Environment
MacOS and Linux, 0.18.8 but I think the code in question is the same on current git main. Python 3.10.11.
The text was updated successfully, but these errors were encountered: