|
| 1 | +# Remote files |
| 2 | + |
| 3 | +So far, we have only dealt with local files in the tutorials and guides. But there are |
| 4 | +lots of use cases to deal with remote files. |
| 5 | + |
| 6 | +- You distribute the workflow without the data and want to make it easy for others to |
| 7 | + get started. So, some tasks reference remote files instead of local files. |
| 8 | +- You store the workflow results in remote storage to save and distribute them. |
| 9 | + |
| 10 | +pytask uses [universal_pathlib](https://github.com/fsspec/universal_pathlib) to work |
| 11 | +with remote files. The package provides a {mod}`pathlib`-like interface, making it very |
| 12 | +easy to interact with files from an HTTP(S)-, Dropbox-, S3-, GCP-, Azure-based |
| 13 | +filesystem, and many more. |
| 14 | + |
| 15 | +:::{warning} |
| 16 | +universal_pathlib does currently not support Python 3.12. To track progress, see [this |
| 17 | +PR](https://github.com/fsspec/universal_pathlib/pull/152) and check the [releases |
| 18 | +`>0.1.4`](https://github.com/fsspec/universal_pathlib/releases) |
| 19 | +::: |
| 20 | + |
| 21 | +## HTTP(S)-based filesystem |
| 22 | + |
| 23 | +As an example for dealing with an HTTP(S)-based filesystem, we will download the iris |
| 24 | +data set and save it as a CSV file. |
| 25 | + |
| 26 | +```{literalinclude} ../../../docs_src/how_to_guides/remote_files/https.py |
| 27 | +``` |
| 28 | + |
| 29 | +## Other filesystems |
| 30 | + |
| 31 | +universal_pathlib supports Azure Storage, Dropbox, Google Cloud Storage, AWS S3, and |
| 32 | +[many more filesystems](https://github.com/fsspec/universal_pathlib#currently-supported-filesystems-and-schemes). |
| 33 | + |
| 34 | +For example, let us try accessing a file in an S3 bucket. We pass `anon=True` to |
| 35 | +{class}`~upath.UPath` since no credentials are required. |
| 36 | + |
| 37 | +```pycon |
| 38 | +>>> from upath import UPath |
| 39 | +>>> path = UPath("s3://upath-aws-example/iris.data", anon=True) |
| 40 | +>>> path.stat() |
| 41 | +ModuleNotFoundError |
| 42 | +... |
| 43 | +ImportError: Install s3fs to access S3 |
| 44 | +``` |
| 45 | + |
| 46 | +Some filesystems are supported |
| 47 | +[out-of-the-box](https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations). |
| 48 | +[Others](https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations) |
| 49 | +are available as plugins and require additional packages. |
| 50 | + |
| 51 | +After installing s3fs, rerun the command. |
| 52 | + |
| 53 | +```pycon |
| 54 | +>>> path.stat() |
| 55 | +{'ETag': '"42615765a885ddf54427f12c34a0a070"', |
| 56 | + 'LastModified': datetime.datetime(2023, 12, 11, 23, 50, 3, tzinfo=tzutc()), |
| 57 | + 'size': 4551, |
| 58 | + 'name': 'upath-aws-example/iris.data', |
| 59 | + 'type': 'file', |
| 60 | + 'StorageClass': 'STANDARD', |
| 61 | + 'VersionId': None, |
| 62 | + 'ContentType': 'binary/octet-stream'} |
| 63 | +``` |
| 64 | + |
| 65 | +Usually, you will need credentials to access files. Search in |
| 66 | +[fsspec's documentation](https://filesystem-spec.readthedocs.io/en/latest) |
| 67 | +or the plugin's documentation, here |
| 68 | +[s3fs](https://s3fs.readthedocs.io/en/latest/#credentials), for information on |
| 69 | +authentication. One way would be to set the environment variables `AWS_ACCESS_KEY_ID` |
| 70 | +and `AWS_SECRET_ACCESS_KEY`. |
| 71 | + |
| 72 | +## Detecting changes in remote files |
| 73 | + |
| 74 | +pytask uses the [entity tag (ETag)](https://en.wikipedia.org/wiki/HTTP_ETag) to detect |
| 75 | +changes in remote files. The ETag is an optional header field that can signal a file has |
| 76 | +changed. For example, |
| 77 | +[AWS S3 uses an MD5 digest](https://teppen.io/2018/06/23/aws_s3_etags/) of the uploaded |
| 78 | +file as the ETag. If the file changes, so does the ETag, and pytask will detect it. |
| 79 | + |
| 80 | +Many files on the web do not provide an ETag like this version of the iris dataset. |
| 81 | + |
| 82 | +```pycon |
| 83 | +>>> import requests |
| 84 | +>>> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" |
| 85 | +>>> r = requests.head(url) |
| 86 | +>>> r.headers |
| 87 | +{'Server': 'nginx/1.25.3', 'Date': 'Sun, 10 Dec 2023 23:59:21 GMT', 'Connection': 'keep-alive'} |
| 88 | +``` |
| 89 | + |
| 90 | +In these instances, pytask does not recognize if the file has changed and only reruns |
| 91 | +the task if other conditions are not met, like the product is missing, the task module |
| 92 | +has changed, etc.. |
0 commit comments