Skip to content

HDFS URI support does not support network location #162

@vvaten

Description

@vvaten

It seems that "hdfs://host:port/path/file.txt" url scheme is not supported by smart open currently. Therefore, with smart_open it is only possible to access the local HDFS filesystem with the hdfs:// protocol. The URI is parsed correctly by the urlsplit, but the network location is always ignored.

The hdfs dfs command supports the host:port in the URI out of the box, so adding this would be very easy. However, then the url scheme "hdfs://tmp/test.txt" would no longer work as the 'tmp' here would be interpreted as the network location, which it really is if you read the URI literally according to specification.

For me it would be expected result that:

  • ParseUri('hdfs://host:port/path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "hdfs://host:port/path/test.txt"]
  • ParseUri ('hdfs:///path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "/path/test.txt"]
  • ParseUri ('hdfs://host/path/file.txt', 'wb') ==> ["hdfs", "dfs", "-text", "hdfs://host/path/test.txt"]

Let me know what you think.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions