Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading from URLs in Python #1566

Open
hyanwong opened this issue Jul 10, 2021 · 28 comments
Open

Support loading from URLs in Python #1566

hyanwong opened this issue Jul 10, 2021 · 28 comments
Labels
enhancement New feature or request future Issues that are closed as they are not planned in the medium-term, but which are still desirable. Python API Issue is about the Python API

Comments

@hyanwong
Copy link
Member

For teaching purposes, and possibly for other reasons too, I think it would be really useful to be able to load a tree sequence from a URL. IMO the nicest way to make this easy for a user would be to have a url argument to tskit.load:

tskit.load(url="https://tskit.dev/tutorials/data/basics.trees")

I would hope that this would mainly be used for small tree sequences, although I suppose it could be an easy way for someone to load up larger ones e.g. from zenodo - depends how long that would take I guess.

I suspect implementing this using the urllib.request library would be quite easy, although I don't know how we would unit test it - probably mock the urllib.request.urlopen function when testing, or something.

@hyanwong
Copy link
Member Author

NB: this would mean that content in e.g. jupyterbooks could simply be pasted into the user's jupyter session and (as long as the right libraries were installed), the examples should all run. It's less error-prone and fiddly than trying to run the jupyterbook e.g. in binder (see tskit-dev/tutorials#114)

@jeromekelleher jeromekelleher added enhancement New feature or request future Issues that are closed as they are not planned in the medium-term, but which are still desirable. labels Jul 10, 2021
@jeromekelleher
Copy link
Member

The right way to do this I think would be to make the string form of the file input support URLs, which we intercept in Python. BUT this would be quite a lot of work to do well (it's simple to do badly) so I'm not super keen on it right now.

@jeromekelleher jeromekelleher added the Python API Issue is about the Python API label Jul 10, 2021
@jeromekelleher jeromekelleher changed the title Add url param to tskit.load Support loading from URLs in Python Jul 10, 2021
@hyanwong
Copy link
Member Author

hyanwong commented Jul 10, 2021

Just to say that I have wanted to do this for a while, for other purposes too (not just the tutorials), so I am keen to get something like this done. But maybe others could weigh in if they have needed to do this in the past (as it could just be me).

Out of interest, why would it be a lot of work to do well? I would have thought the standard URL libraries would do most of the hard lifting?

@hyanwong
Copy link
Member Author

Presumably the workaround for the moment is simply e.g.

import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)

Is there any reason that we shouldn't be doing this as a workaround?

@hyanwong
Copy link
Member Author

hyanwong commented Jul 10, 2021

I've been thinking about this, and actually I do think that the form load(url="http://...") is worth considering, for two reasons. Firstly the silly one, that you can still load a file starting with "http" in this case. Secondly, perhaps a more sensible one, that it makes it much easier to document and code without complications, as you can simply pass the url parameter to urllib.request.urlopen. In fact, I think that makes it about a 4 line addition to tskit, and to unit test it you can simply try it with a file:/// url, which should be pretty simple. Then we simply say "anything passed as a url parameter is passed on to urllib.request to open" (or something like that. I think this would tick the box of "doing it well" without having to check on the format of the string, which, as @jeromekelleher points out, is a bit icky.

@jeromekelleher
Copy link
Member

It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.

@jeromekelleher
Copy link
Member

import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file)

This seems like a perfectly reasonable thing to do to me (assuming it works).

@hyanwong
Copy link
Member Author

It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now.

Right, that sounds reasonable. I thought that we could simply say that we are handing all this stuff over to urllib.request.urlopen and therefore if it doesn't work (e.g. problems with proxies etc) it's not something we are responsible for. But maybe that's a cop out.

@benjeffery
Copy link
Member

I'd be much more motivated to do this if tskit had any sense of being able to use part of a file, as zarr for example does. tskit needs the whole file to do anything, so you have to download it all anyway. The tutorials should also be using the most common way of loading a .trees - from disk. Adding url arguments at that point adds confusion.

Pointing out that urlopen can be used in the load docs seems a good idea though.

@hyanwong
Copy link
Member Author

hyanwong commented Jul 19, 2021

Annoyingly I find that urlopen doesn't actually work for http(s) urls:

import urllib.request
file = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
tskit.load(file)

Gives FileFormatError: File not in KAS format, presumably because, although file.fileno() is valid here, which I think is the requirement tested in KAS, it's not actually stored on disk quite like a normal file, and the part of the tskit C library that loads from a file doesn't know what to make of it. So yes, as Jerome says, this is more complex than I had hoped, and can be kicked down the line, I guess. Shame.

Note that I presume this means that there's no way to load in tree sequence from a stream of data: we require a file on disk, not e.g. some data spooled to us by a DB or equivalent.

@benjeffery
Copy link
Member

@hyanwong
Copy link
Member Author

Oh, useful, thanks. Maybe it's because I'm on OS X. Weird that OS X doesn't do file descriptors properly though.

@hyanwong
Copy link
Member Author

Although actually I still get the _tskit.FileFormatError: File not in KAS format on holly, so I guess I'm doing something wrong here.

@hyanwong
Copy link
Member Author

Digging a little more. I would have thought that if I pass a fileno to tskit.load it should "just work", regardless of where the fileno comes from, but tskit seems to distinguish a fileno from a local file with one from urllib:

f = open("tmp.trees", "rb")
ts = tskit.load(f.fileno())  # works

f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
ts = tskit.load(f.fileno())  # fails

The failure is:

---------------------------------------------------------------------------
FileFormatError                           Traceback (most recent call last)
/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
   3480             ts = _tskit.TreeSequence()
-> 3481             ts.load(file)
   3482             return TreeSequence(ts)

FileFormatError: File not in KAS format

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-131-02f25af848d6> in <module>
      3 
      4 f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
----> 5 ts = tskit.load(f.fileno())  # fails

/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(file)
   2829     :rtype: :class:`tskit.TreeSequence`
   2830     """
-> 2831     return TreeSequence.load(file)
   2832 
   2833 

/usr/local/lib/python3.9/site-packages/tskit/trees.py in load(cls, file_or_path)
   3483         except exceptions.FileFormatError as e:
   3484             # TODO Fix this for new file semantics
-> 3485             formats.raise_hdf5_format_error(file_or_path, e)
   3486         finally:
   3487             if local_file:

/usr/local/lib/python3.9/site-packages/tskit/formats.py in raise_hdf5_format_error(filename, original_exception)
    271     h5py = get_h5py()
    272     try:
--> 273         with h5py.File(filename, "r") as root:
    274             version = tuple(root.attrs["format_version"])
    275             raise exceptions.VersionTooOldError(

/usr/local/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, **kwds)
    410                 name = repr(name).encode('ASCII', 'replace')
    411             else:
--> 412                 name = filename_encode(name)
    413 
    414             if track_order is None:

/usr/local/lib/python3.9/site-packages/h5py/_hl/compat.py in filename_encode(filename)
     17     filenames in h5py for more information.
     18     """
---> 19     filename = fspath(filename)
     20     if sys.platform == "win32":
     21         if isinstance(filename, str):

TypeError: expected str, bytes or os.PathLike object, not int

@benjeffery
Copy link
Member

benjeffery commented Jul 20, 2021

The HTTPResponse doesn't support seek, so won't work. Not all fileno's are equal.

EDIT: This was very wrong.

@hyanwong
Copy link
Member Author

Ah, that explains it. Didn't know we used seek when reading in, thanks.

@grahamgower
Copy link
Member

grahamgower commented Jul 20, 2021

It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.

@hyanwong
Copy link
Member Author

It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore.

I'm pretty sure it's a kastore file:

>>> f = urllib.request.urlopen("https://raw.github.com/tskit-dev/tutorials/main/data/basics.trees")
>>> f.read()[0:10]
b'\x89KAS\r\n\x1a\n\x01\x00'

(same as you get when read()ing from an msprime-generated tree sequence.

@grahamgower
Copy link
Member

Well something funny is going on, because seek() shouldn't be needed. File streaming support depends on avoiding seek calls.

@grahamgower
Copy link
Member

stdin doesn't support seek() for example, and this works just fine

$ python -c "import tskit, sys; tskit.load(sys.stdin)" < basics.trees

@benjeffery
Copy link
Member

Ah, I understand now. After failing to read the kastore, tskit tries to load as h5py, and that's when it tries to seek. So yes, something weird here.

@benjeffery
Copy link
Member

I've tracked this down a bit - as tskit's python interface is taking the file descriptor of the HTTPResponse and then reopening it in the C API, it is getting the encrypted data! Fixing this in the general case is interesting - maybe creating a proxy file descriptor that is passed to the C API?

@jeromekelleher
Copy link
Member

Not a high priority I think, but nicely done tracking it down. This stuff is always more complicated than it seems...

@benjeffery
Copy link
Member

No, not planning to fix soon. Tracked it down as unexplained behaviour is always concerning!

@hyanwong
Copy link
Member Author

Nicely triaged, @benjeffery, thanks. Agree about low priority.

@hyanwong
Copy link
Member Author

hyanwong commented Jun 7, 2022

Just coming back to this as a result of workshops and pyodide. It's more than likely that online tree sequences will be posted in .tsz format (e.g. the unified genealogies on zenodo), so perhaps it's more of a priority to allow tszip.decompress(url="https://...") I don't know if that's any easier to sort than tskit.load(url=...): I assume not.

For reference, here's what I'm doing as a workaround:

import tszip
import urllib.request
temporary_filename, _ = urllib.request.urlretrieve(
    "https://zenodo.org/record/5495535/files/hgdp_tgp_sgdp_chr22_q.dated.trees.tsz"
)
ts = tszip.decompress(temporary_filename)
urllib.request.urlcleanup() # should remove temporary_filename

@jeromekelleher
Copy link
Member

Adding url support to tszip sounds like a great idea - zarr already supports some forms of url access so we should hopefully be able to tap into that.

@benjeffery
Copy link
Member

tskit-dev/tszip#66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request future Issues that are closed as they are not planned in the medium-term, but which are still desirable. Python API Issue is about the Python API
Projects
None yet
Development

No branches or pull requests

4 participants