Skip to content

Commit c3b5596

Browse files
committed
Improve docs.
1 parent 43486d7 commit c3b5596

File tree

2 files changed

+35
-21
lines changed

2 files changed

+35
-21
lines changed

docs/source/how_to_guides/provisional_nodes_and_task_generators.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,29 +8,32 @@ pytask's execution model can usually be separated into three phases.
88

99
But, in some situations, pytask needs to be more flexible.
1010

11-
Imagine you want to download files from an online storage, but the total number of files
12-
and their filenames is unknown before the task has started. How can you still describe
13-
the files as products of the task?
11+
Imagine you want to download a folder with files from an online storage. Before the task
12+
is completed you do not know the total number of files or their filenames. How can you
13+
still describe the files as products of the task?
1414

1515
And how would you define another task that depends on these files?
1616

1717
The following sections will explain how you use pytask in these situations.
1818

1919
## Producing provisional nodes
2020

21-
Let us start with a task that downloads all files without an extension from the root
22-
folder of the pytask repository and stores them on disk in a folder called `downloads`.
21+
As an example for the aforementioned scenario, let us write a task that downloads all
22+
files without a file extension from the root folder of the pytask GitHub repository. The
23+
files are downloaded to a folder called `downloads`. `downloads` is in the same folder
24+
as the task module because it is a relative path.
2325

2426
```{literalinclude} ../../../docs_src/how_to_guides/provisional_products.py
2527
---
26-
emphasize-lines: 4, 11
28+
emphasize-lines: 4, 22
2729
---
2830
```
2931

3032
Since the names of the files are not known when pytask is started, we need to use a
31-
{class}`~pytask.DirectoryNode`. With a {class}`~pytask.DirectoryNode` we can specify
32-
where pytask can find the files. The files are described with a path (default is the
33-
directory of the task module) and a glob pattern (default is `*`).
33+
{class}`~pytask.DirectoryNode` to define the task's product. With a
34+
{class}`~pytask.DirectoryNode` we can specify where pytask can find the files. The files
35+
are described with a root path (default is the directory of the task module) and a glob
36+
pattern (default is `*`).
3437

3538
When we use the {class}`~pytask.DirectoryNode` as a product annotation, we get access to
3639
the `root_dir` as a {class}`~pathlib.Path` object inside the function, which allows us
@@ -49,16 +52,19 @@ actual nodes. A {class}`~pytask.DirectoryNode`, for example, returns
4952
In the next step, we want to define a task that consumes and merges all previously
5053
downloaded files into one file.
5154

55+
The difficulty here is how can we reference the downloaded files before they have been
56+
downloaded.
57+
5258
```{literalinclude} ../../../docs_src/how_to_guides/provisional_task.py
5359
---
5460
emphasize-lines: 9
5561
---
5662
```
5763

58-
Here, the {class}`~pytask.DirectoryNode` is a dependency because we do not know the
59-
names of the downloaded files. Before the task is executed, the list of files in the
60-
folder defined by the root path and the pattern are automatically collected and passed
61-
to the task.
64+
To reference the files that will be downloaded, we use the
65+
{class}`~pytask.DirectoryNode` is a dependency. Before the task is executed, the list of
66+
files in the folder defined by the root path and the pattern are automatically collected
67+
and passed to the task.
6268

6369
If we use a {class}`~pytask.DirectoryNode` with the same `root_dir` and `pattern` in
6470
both tasks, pytask will automatically recognize that the second task depends on the
Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,33 @@
11
from pathlib import Path
22

3-
import requests
3+
import httpx
44
from pytask import DirectoryNode
55
from pytask import Product
66
from typing_extensions import Annotated
77

88

9+
def get_files_without_file_extensions_from_repo() -> list[str]:
10+
url = "https://api.github.com/repos/pytask-dev/pytask/git/trees/main"
11+
response = httpx.get(url)
12+
elements = response.json()["tree"]
13+
return [
14+
e["path"]
15+
for e in elements
16+
if e["type"] == "blob" and Path(e["path"]).suffix == ""
17+
]
18+
19+
920
def task_download_files(
1021
download_folder: Annotated[
1122
Path, DirectoryNode(root_dir=Path("downloads"), pattern="*"), Product
1223
],
1324
) -> None:
1425
"""Download files."""
15-
# Scrape list of files without file extension from
16-
# https://github.com/pytask-dev/pytask. (We skip this part for simplicity.)
17-
files_to_download = ("CITATION", "LICENSE")
26+
# Contains names like CITATION or LICENSE.
27+
files_to_download = get_files_without_file_extensions_from_repo()
1828

19-
# Download them.
2029
for file_ in files_to_download:
21-
response = requests.get(
22-
url=f"raw.githubusercontent.com/pytask-dev/pytask/main/{file_}", timeout=5
23-
)
30+
url = "raw.githubusercontent.com/pytask-dev/pytask/main"
31+
response = httpx.get(url=f"{url}/{file_}", timeout=5)
2432
content = response.text
2533
download_folder.joinpath(file_).write_text(content)

0 commit comments

Comments
 (0)