-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading from URLs in Python #1566
Comments
NB: this would mean that content in e.g. jupyterbooks could simply be pasted into the user's jupyter session and (as long as the right libraries were installed), the examples should all run. It's less error-prone and fiddly than trying to run the jupyterbook e.g. in binder (see tskit-dev/tutorials#114) |
The right way to do this I think would be to make the string form of the file input support URLs, which we intercept in Python. BUT this would be quite a lot of work to do well (it's simple to do badly) so I'm not super keen on it right now. |
Just to say that I have wanted to do this for a while, for other purposes too (not just the tutorials), so I am keen to get something like this done. But maybe others could weigh in if they have needed to do this in the past (as it could just be me). Out of interest, why would it be a lot of work to do well? I would have thought the standard URL libraries would do most of the hard lifting? |
Presumably the workaround for the moment is simply e.g.
Is there any reason that we shouldn't be doing this as a workaround? |
I've been thinking about this, and actually I do think that the form |
It's about much more than just the parameters @hyanwong - once you start interacting with the network all sorts of extra complications come in, like http proxies, error handling, retries, etc etc etc. The nominal case is trivial - it's the error and edge cases that are really tricky. These are hard to implement and especially hard to test. I've done it several times and don't particularly want to spend the time doing it now. |
import urllib.request
file = urllib.request.urlopen('file:///Users/yan/Documents/GitHub/tutorials/data/basics.trees')
ts = tskit.load(file) This seems like a perfectly reasonable thing to do to me (assuming it works). |
Right, that sounds reasonable. I thought that we could simply say that we are handing all this stuff over to |
I'd be much more motivated to do this if tskit had any sense of being able to use part of a file, as Pointing out that |
Annoyingly I find that
Gives Note that I presume this means that there's no way to load in tree sequence from a stream of data: we require a file on disk, not e.g. some data spooled to us by a DB or equivalent. |
We have pre-exisiting tests for sockets: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L289 and streams: https://github.com/tskit-dev/tskit/blob/main/python/tests/test_fileobj.py#L246 |
Oh, useful, thanks. Maybe it's because I'm on OS X. Weird that OS X doesn't do file descriptors properly though. |
Although actually I still get the |
Digging a little more. I would have thought that if I pass a fileno to
The failure is:
|
EDIT: This was very wrong. |
Ah, that explains it. Didn't know we used |
It looks like that file is a hdf5 file, not a kastore file. Non-seek file objects should work just fine through kastore. |
I'm pretty sure it's a kastore file:
(same as you get when |
Well something funny is going on, because |
stdin doesn't support
|
Ah, I understand now. After failing to read the kastore, tskit tries to load as h5py, and that's when it tries to seek. So yes, something weird here. |
I've tracked this down a bit - as tskit's python interface is taking the file descriptor of the |
Not a high priority I think, but nicely done tracking it down. This stuff is always more complicated than it seems... |
No, not planning to fix soon. Tracked it down as unexplained behaviour is always concerning! |
Nicely triaged, @benjeffery, thanks. Agree about low priority. |
Just coming back to this as a result of workshops and pyodide. It's more than likely that online tree sequences will be posted in For reference, here's what I'm doing as a workaround:
|
Adding url support to tszip sounds like a great idea - zarr already supports some forms of url access so we should hopefully be able to tap into that. |
For teaching purposes, and possibly for other reasons too, I think it would be really useful to be able to load a tree sequence from a URL. IMO the nicest way to make this easy for a user would be to have a
url
argument totskit.load
:I would hope that this would mainly be used for small tree sequences, although I suppose it could be an easy way for someone to load up larger ones e.g. from zenodo - depends how long that would take I guess.
I suspect implementing this using the
urllib.request
library would be quite easy, although I don't know how we would unit test it - probably mock theurllib.request.urlopen
function when testing, or something.The text was updated successfully, but these errors were encountered: