Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Investigate why get_protocol_and_path does not correctly parse paths with # in them #3196

Closed
jasonmhite opened this issue Oct 18, 2023 · 12 comments
Assignees
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed

Comments

@jasonmhite
Copy link
Contributor

Description

Dataset implementations rely on get_protocol_and_path to split the protocol and the path. However, paths with a # in them (which is valid) are truncated. # is a valid character on most filesystems/platforms as far as I know. Best I can tell Kedro will not load any Datasets from such a path.

Context

The actual problem is in kedro.io.core._parse_filepath, which is what does the parsing. I am going to refer to that in steps to reproduce. It looks like the regex's are probably incomplete, but it's hard to unravel. There may be other valid characters that are missed, too, but I happened to have a file with a # in it.

Steps to Reproduce

from kedro.io.core import _parse_filepath

print(_parse_filepath("s3://some/dummy#filename"))

Expected Result

{'protocol': 's3', 'path': 'some/dummy#filename'}

Actual Result

{'protocol': 's3', 'path': 'some/dummy'}

Your Environment

MacOS and Linux, 0.18.8 but I think the code in question is the same on current git main. Python 3.10.11.

@jasonmhite jasonmhite changed the title get_protocol_and_path does not correclty parse paths with # in them get_protocol_and_path does not correctly parse paths with # in them Oct 18, 2023
@astrojuanlu
Copy link
Member

Thanks for reporting this @jasonmhite. I confirm:

In [1]: from kedro.io.core import get_protocol_and_path

In [2]: get_protocol_and_path("s3://some/dummy#filename")
Out[2]: ('s3', 'some/dummy')

fsspec seems to do the right thing in this particular case:

In [1]: from fsspec.utils import infer_storage_options

In [2]: infer_storage_options("s3://some/dummy#filename")
Out[2]: 
{'protocol': 's3',
 'path': 'some/dummy#filename',
 'host': 'some',
 'url_fragment': 'filename'}

In case it helps, I did some digging on the history of that function: it was implemented in https://github.com/quantumblacklabs/private-kedro/pull/545 (internal link) and the given rationale is described in https://jira.quantumblack.com/browse/KED-1544 (also internal):

We need to split filepath without using infer_storage_options() from fsspec since it returns different results for different protocols that doesn't satisfy our needs (we are already making exceptions for HTTP protocol, we should avoid piling on exceptions e.g. for Azure).

@astrojuanlu astrojuanlu added the Issue: Bug Report 🐞 Bug that needs to be fixed label Oct 18, 2023
@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Oct 18, 2023
@jasonmhite
Copy link
Contributor Author

jasonmhite commented Oct 18, 2023

Digging more into it, it looks like the issue might be the use of urlsplit and it handling # as if it was a URL. Based on that I noticed that paths with ? in them have the same problem.

@jasonmhite
Copy link
Contributor Author

jasonmhite commented Oct 19, 2023

Also, infer_storage_options from fsspec does the right thing for # but not ?.

from fsspec.utils import infer_storage_options

infer_storage_options("s3://blah/some?path")

>>> {'protocol': 's3', 'path': 'blah/some', 'host': 'blah', 'url_query': 'path'}

The host there is wrong, too (at least for s3) but I don't think that matters.

Man, I had forgotten how much of a pain arbitrary URLs/paths can be, this seems to be a pretty tough problem to solve in general. I took a stab at kludging the Kedro code to special case ? and # but haven't gotten it right.

@astrojuanlu
Copy link
Member

Ooof, good catch... I wonder if we should take this conversation upstream @jasonmhite, they have a much better context of the potential corner cases. Do you want to open an issue in https://github.com/fsspec/filesystem_spec/issues yourself? Otherwise I'll do it, no fuss

@jasonmhite
Copy link
Contributor Author

@astrojuanlu It might be better for you to do it, you're probably more familiar with fsspec than I am. I really have a hard time navigating their docs :'(.

# and ? are only special for HTTP URLs that I can think of, right? HTTP already gets some special treatment, what about just handle HTTP as a special case there and then for other storage paths split on :// (after inserting file:// if it's omitted for a local path)? Or something along those lines, at least.

@astrojuanlu
Copy link
Member

I've been investigating this a bit more. Yes, HTTP(S) seems to be getting a special treatment, and for non-HTTP(S) paths ? and # are treated differently:

https://github.com/fsspec/filesystem_spec/blob/096e9a186ffdb39b26b6afdf38c610d092c85f95/fsspec/utils.py#L87-L89

Those changes were introduced in fsspec/filesystem_spec#64 without much context.

About the special characters, from https://tools.ietf.org/html/rfc1738 (URLs) (idea from fsspec/filesystem_spec#171 (comment)):

All unsafe characters must always be encoded within a URL. For example, the character "#" must be encoded within URLs

And from https://tools.ietf.org/html/rfc3986 (URIs):

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

So, it's my understanding that filenames containing # or ? should have those characters percent-encoded.

@jasonmhite Can you test if percent-encoding reserved characters in remote paths in the Kedro catalog works?

@jasonmhite
Copy link
Contributor Author

jasonmhite commented Oct 31, 2023

@astrojuanlu So for the case I found this, I don't actually have the the problematic filepath in my catalog entry, it's in a subdirectory that gets loaded up by the Dataset class. I point my class to a top level directory in the catalog and it builds up a (customized) PartitionedDataset from the subdirectory names.

That said I can directly map the problematic directory directly to the appropriate Dataset class. But it doesn't work, S3 doesn't properly interpret the % encoded path. It just treats it as literal text and then FileNotFoundError since that file doesn't exist.

@astrojuanlu
Copy link
Member

Thanks for testing it @jasonmhite. There's a chance then that certain filenames are unreachable on S3 through fsspec, I'll give this one more round of investigation with a local MinIO server.

@jasonmhite
Copy link
Contributor Author

jasonmhite commented Nov 3, 2023

@astrojuanlu Just to be clear, this issue is in Kedro's parsing of paths for get_protocol_and_path and the use of urlsplit and happens before fsspec ever gets involved. It's also not limited to S3, it will parse local filesystem paths with # or ? in them (which are perfectly valid) incorrectly as well.

@merelcht merelcht moved this to To Do in Kedro Framework Aug 6, 2024
@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Nov 1, 2024
@merelcht merelcht moved this to To Do in Kedro Framework Nov 26, 2024
@astrojuanlu astrojuanlu changed the title get_protocol_and_path does not correctly parse paths with # in them [Spike] Investigate why get_protocol_and_path does not correctly parse paths with # in them Dec 2, 2024
@astrojuanlu
Copy link
Member

We will first investigate why this happens.

@ElenaKhaustova ElenaKhaustova self-assigned this Jan 8, 2025
@ElenaKhaustova ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jan 8, 2025
@ElenaKhaustova
Copy link
Contributor

Source of the issue

In the _parse_filepath method we use urllib.parse.urlsplit return - SplitResult to split protocol from path for URLs. urlsplit parses a URL into 5 components: <scheme>://<netloc>/<path>?<query>#<fragment>.

In the implementation of the _parse_filepath we only used scheme, netloc and path components to retrieve the resulting protocol and path. So query and fragment were always omitted.

def _parse_filepath(filepath: str) -> dict[str, str]:

Suggested fix

#4409

@jasonmhite
Copy link
Contributor Author

See my comment on #4409 , the problem @ElenaKhaustova is describing as the root cause is correct, but I think the proposed fix makes some invalid assumptions.

@ElenaKhaustova ElenaKhaustova moved this from In Progress to In Review in Kedro Framework Jan 10, 2025
@merelcht merelcht moved this from In Review to To Do in Kedro Framework Jan 13, 2025
@merelcht merelcht moved this from To Do to In Review in Kedro Framework Jan 13, 2025
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report 🐞 Bug that needs to be fixed
Projects
Status: Done
Development

No branches or pull requests

4 participants