Skip to content

Commit b8b7f9f

Browse files
authored
Implement task generators and provisional nodes. (#487)
1 parent a5ac789 commit b8b7f9f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+1209
-145
lines changed

docs/source/changes.md

+1
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
7070
- {pull}`485` adds missing steps to unconfigure pytask after the job is done, which
7171
caused flaky tests.
7272
- {pull}`486` adds default names to {class}`~pytask.PPathNode`.
73+
- {pull}`487` implements task generators and provisional nodes.
7374
- {pull}`488` raises an error when an invalid value is used in a return annotation.
7475
- {pull}`489` and {pull}`491` simplifies parsing products and does not raise an error
7576
when a product annotation is used with the argument name `produces`. And allow

docs/source/how_to_guides/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ capture_warnings
1919
how_to_influence_build_order
2020
hashing_inputs_of_tasks
2121
using_task_returns
22+
provisional_nodes_and_task_generators
2223
writing_custom_nodes
2324
extending_pytask
2425
the_data_catalog
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Provisional nodes and task generators
2+
3+
pytask's execution model can usually be separated into three phases.
4+
5+
1. Collection of tasks, dependencies, and products.
6+
1. Building the DAG.
7+
1. Executing the tasks.
8+
9+
But, in some situations, pytask needs to be more flexible.
10+
11+
Imagine you want to download a folder with files from an online storage. Before the task
12+
is completed you do not know the total number of files or their filenames. How can you
13+
still describe the files as products of the task?
14+
15+
And how would you define another task that depends on these files?
16+
17+
The following sections will explain how you use pytask in these situations.
18+
19+
## Producing provisional nodes
20+
21+
As an example for the aforementioned scenario, let us write a task that downloads all
22+
files without a file extension from the root folder of the pytask GitHub repository. The
23+
files are downloaded to a folder called `downloads`. `downloads` is in the same folder
24+
as the task module because it is a relative path.
25+
26+
```{literalinclude} ../../../docs_src/how_to_guides/provisional_products.py
27+
---
28+
emphasize-lines: 4, 22
29+
---
30+
```
31+
32+
Since the names of the files are not known when pytask is started, we need to use a
33+
{class}`~pytask.DirectoryNode` to define the task's product. With a
34+
{class}`~pytask.DirectoryNode` we can specify where pytask can find the files. The files
35+
are described with a root path (default is the directory of the task module) and a glob
36+
pattern (default is `*`).
37+
38+
When we use the {class}`~pytask.DirectoryNode` as a product annotation, we get access to
39+
the `root_dir` as a {class}`~pathlib.Path` object inside the function, which allows us
40+
to store the files.
41+
42+
```{note}
43+
The {class}`~pytask.DirectoryNode` is a provisional node that implements
44+
{class}`~pytask.PProvisionalNode`. A provisional node is not a {class}`~pytask.PNode`,
45+
but when its {meth}`~pytask.PProvisionalNode.collect` method is called, it returns
46+
actual nodes. A {class}`~pytask.DirectoryNode`, for example, returns
47+
{class}`~pytask.PathNode`.
48+
```
49+
50+
## Depending on provisional nodes
51+
52+
In the next step, we want to define a task that consumes and merges all previously
53+
downloaded files into one file.
54+
55+
The difficulty here is how can we reference the downloaded files before they have been
56+
downloaded.
57+
58+
```{literalinclude} ../../../docs_src/how_to_guides/provisional_task.py
59+
---
60+
emphasize-lines: 9
61+
---
62+
```
63+
64+
To reference the files that will be downloaded, we use the
65+
{class}`~pytask.DirectoryNode` is a dependency. Before the task is executed, the list of
66+
files in the folder defined by the root path and the pattern are automatically collected
67+
and passed to the task.
68+
69+
If we use a {class}`~pytask.DirectoryNode` with the same `root_dir` and `pattern` in
70+
both tasks, pytask will automatically recognize that the second task depends on the
71+
first. If that is not true, you might need to make this dependency more explicit by
72+
using {func}`@task(after=...) <pytask.task>`, which is explained {ref}`here <after>`.
73+
74+
## Task generators
75+
76+
What if we wanted to process each downloaded file separately instead of dealing with
77+
them in one task?
78+
79+
For that, we have to write a task generator to define an unknown number of tasks for an
80+
unknown number of downloaded files.
81+
82+
A task generator is a task function in which we define more tasks, just as if we were
83+
writing functions in a task module.
84+
85+
The code snippet shows each task takes one of the downloaded files and copies its
86+
content to a `.txt` file.
87+
88+
```{literalinclude} ../../../docs_src/how_to_guides/provisional_task_generator.py
89+
```
90+
91+
```{important}
92+
The generated tasks need to be decoratored with {func}`@task <pytask.task>` to be
93+
collected.
94+
```

docs/source/reference_guides/api.md

+4
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,8 @@ Protocols define how tasks and nodes for dependencies and products have to be se
190190
:show-inheritance:
191191
.. autoprotocol:: pytask.PTaskWithPath
192192
:show-inheritance:
193+
.. autoprotocol:: pytask.PProvisionalNode
194+
:show-inheritance:
193195
```
194196

195197
## Nodes
@@ -203,6 +205,8 @@ Nodes are the interface for different kinds of dependencies or products.
203205
:members:
204206
.. autoclass:: pytask.PythonNode
205207
:members:
208+
.. autoclass:: pytask.DirectoryNode
209+
:members:
206210
```
207211

208212
To parse dependencies and products from nodes, use the following functions.

docs/source/tutorials/skipping_tasks.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,18 @@ skip tasks during development that take too much time to compute right now.
1313
```{literalinclude} ../../../docs_src/tutorials/skipping_tasks_example_1.py
1414
```
1515

16-
Not only will this task be skipped, but all tasks that depend on
16+
Not only will this task be skipped, but all tasks depending on
1717
`time_intensive_product.pkl`.
1818

1919
## Conditional skipping
2020

2121
In large projects, you may have many long-running tasks that you only want to execute on
22-
a remote server but not when you are not working locally.
22+
a remote server, but not when you are not working locally.
2323

2424
In this case, use the {func}`@pytask.mark.skipif <pytask.mark.skipif>` decorator, which
2525
requires a condition and a reason as arguments.
2626

27-
Place the condition variable in a different module than the task, so you can change it
27+
Place the condition variable in a module different from the task so you can change it
2828
without causing a rerun of the task.
2929

3030
```python
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
from pathlib import Path
2+
3+
import httpx
4+
from pytask import DirectoryNode
5+
from pytask import Product
6+
from typing_extensions import Annotated
7+
8+
9+
def get_files_without_file_extensions_from_repo() -> list[str]:
10+
url = "https://api.github.com/repos/pytask-dev/pytask/git/trees/main"
11+
response = httpx.get(url)
12+
elements = response.json()["tree"]
13+
return [
14+
e["path"]
15+
for e in elements
16+
if e["type"] == "blob" and Path(e["path"]).suffix == ""
17+
]
18+
19+
20+
def task_download_files(
21+
download_folder: Annotated[
22+
Path, DirectoryNode(root_dir=Path("downloads"), pattern="*"), Product
23+
],
24+
) -> None:
25+
"""Download files."""
26+
# Contains names like CITATION or LICENSE.
27+
files_to_download = get_files_without_file_extensions_from_repo()
28+
29+
for file_ in files_to_download:
30+
url = "raw.githubusercontent.com/pytask-dev/pytask/main"
31+
response = httpx.get(url=f"{url}/{file_}", timeout=5)
32+
content = response.text
33+
download_folder.joinpath(file_).write_text(content)
+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from pathlib import Path
2+
3+
from pytask import DirectoryNode
4+
from typing_extensions import Annotated
5+
6+
7+
def task_merge_files(
8+
paths: Annotated[
9+
list[Path], DirectoryNode(root_dir=Path("downloads"), pattern="*")
10+
],
11+
) -> Annotated[str, Path("all_text.txt")]:
12+
"""Merge files."""
13+
contents = [path.read_text() for path in paths]
14+
return "\n".join(contents)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
from pathlib import Path
2+
3+
from pytask import DirectoryNode
4+
from pytask import task
5+
from typing_extensions import Annotated
6+
7+
8+
@task(is_generator=True)
9+
def task_copy_files(
10+
paths: Annotated[
11+
list[Path], DirectoryNode(root_dir=Path("downloads"), pattern="*")
12+
],
13+
) -> None:
14+
"""Create tasks to copy each file to a ``.txt`` file."""
15+
for path in paths:
16+
# The path of the copy will be CITATION.txt, for example.
17+
path_to_copy = path.with_suffix(".txt")
18+
19+
@task
20+
def copy_file(path: Annotated[Path, path]) -> Annotated[str, path_to_copy]:
21+
return path.read_text()

src/_pytask/collect.py

+60-26
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,9 @@
3030
from _pytask.mark_utils import has_mark
3131
from _pytask.node_protocols import PNode
3232
from _pytask.node_protocols import PPathNode
33+
from _pytask.node_protocols import PProvisionalNode
3334
from _pytask.node_protocols import PTask
35+
from _pytask.nodes import DirectoryNode
3436
from _pytask.nodes import PathNode
3537
from _pytask.nodes import PythonNode
3638
from _pytask.nodes import Task
@@ -299,6 +301,8 @@ def pytask_collect_task(
299301
raise ValueError(msg)
300302

301303
path_nodes = Path.cwd() if path is None else path.parent
304+
305+
# Collect dependencies and products.
302306
dependencies = parse_dependencies_from_task_function(
303307
session, path, name, path_nodes, obj
304308
)
@@ -309,6 +313,9 @@ def pytask_collect_task(
309313
markers = get_all_marks(obj)
310314
collection_id = obj.pytask_meta._id if hasattr(obj, "pytask_meta") else None
311315
after = obj.pytask_meta.after if hasattr(obj, "pytask_meta") else []
316+
is_generator = (
317+
obj.pytask_meta.is_generator if hasattr(obj, "pytask_meta") else False
318+
)
312319

313320
# Get the underlying function to avoid having different states of the function,
314321
# e.g. due to pytask_meta, in different layers of the wrapping.
@@ -321,7 +328,11 @@ def pytask_collect_task(
321328
depends_on=dependencies,
322329
produces=products,
323330
markers=markers,
324-
attributes={"collection_id": collection_id, "after": after},
331+
attributes={
332+
"collection_id": collection_id,
333+
"after": after,
334+
"is_generator": is_generator,
335+
},
325336
)
326337
return Task(
327338
base_name=name,
@@ -330,41 +341,25 @@ def pytask_collect_task(
330341
depends_on=dependencies,
331342
produces=products,
332343
markers=markers,
333-
attributes={"collection_id": collection_id, "after": after},
344+
attributes={
345+
"collection_id": collection_id,
346+
"after": after,
347+
"is_generator": is_generator,
348+
},
334349
)
335350
if isinstance(obj, PTask):
336351
return obj
337352
return None
338353

339354

340-
_TEMPLATE_ERROR: str = """\
341-
The provided path of the dependency/product is
342-
343-
{}
344-
345-
, but the path of the file on disk is
346-
347-
{}
348-
349-
Case-sensitive file systems would raise an error because the upper and lower case \
350-
format of the paths does not match.
351-
352-
Please, align the names to ensure reproducibility on case-sensitive file systems \
353-
(often Linux or macOS) or disable this error with 'check_casing_of_paths = false' in \
354-
the pyproject.toml file.
355-
356-
Hint: If parts of the path preceding your project directory are not properly \
357-
formatted, check whether you need to call `.resolve()` on `SRC`, `BLD` or other paths \
358-
created from the `__file__` attribute of a module.
359-
"""
360-
361-
362355
_TEMPLATE_ERROR_DIRECTORY: str = """\
363356
The path '{path}' points to a directory, although only files are allowed."""
364357

365358

366359
@hookimpl(trylast=True)
367-
def pytask_collect_node(session: Session, path: Path, node_info: NodeInfo) -> PNode: # noqa: C901, PLR0912
360+
def pytask_collect_node( # noqa: C901, PLR0912
361+
session: Session, path: Path, node_info: NodeInfo
362+
) -> PNode | PProvisionalNode:
368363
"""Collect a node of a task as a :class:`pytask.PNode`.
369364
370365
Strings are assumed to be paths. This might be a strict assumption, but since this
@@ -384,6 +379,21 @@ def pytask_collect_node(session: Session, path: Path, node_info: NodeInfo) -> PN
384379
"""
385380
node = node_info.value
386381

382+
if isinstance(node, DirectoryNode):
383+
if node.root_dir is None:
384+
node.root_dir = path
385+
if (
386+
not node.name
387+
or node.name == node.root_dir.joinpath(node.pattern).as_posix()
388+
):
389+
short_root_dir = shorten_path(
390+
node.root_dir, session.config["paths"] or (session.config["root"],)
391+
)
392+
node.name = Path(short_root_dir, node.pattern).as_posix()
393+
394+
if isinstance(node, PProvisionalNode):
395+
return node
396+
387397
if isinstance(node, PythonNode):
388398
node.node_info = node_info
389399
if not node.name:
@@ -418,9 +428,11 @@ def pytask_collect_node(session: Session, path: Path, node_info: NodeInfo) -> PN
418428
raise ValueError(_TEMPLATE_ERROR_DIRECTORY.format(path=node.path))
419429

420430
if isinstance(node, PNode):
431+
if not node.name:
432+
node.name = create_name_of_python_node(node_info)
421433
return node
422434

423-
if isinstance(node, UPath):
435+
if isinstance(node, UPath): # pragma: no cover
424436
if not node.protocol:
425437
node = Path(node)
426438
else:
@@ -459,6 +471,28 @@ def pytask_collect_node(session: Session, path: Path, node_info: NodeInfo) -> PN
459471
return PythonNode(value=node, name=node_name, node_info=node_info)
460472

461473

474+
_TEMPLATE_ERROR: str = """\
475+
The provided path of the dependency/product is
476+
477+
{}
478+
479+
, but the path of the file on disk is
480+
481+
{}
482+
483+
Case-sensitive file systems would raise an error because the upper and lower case \
484+
format of the paths does not match.
485+
486+
Please, align the names to ensure reproducibility on case-sensitive file systems \
487+
(often Linux or macOS) or disable this error with 'check_casing_of_paths = false' in \
488+
your pytask configuration file.
489+
490+
Hint: If parts of the path preceding your project directory are not properly \
491+
formatted, check whether you need to call `.resolve()` on `SRC`, `BLD` or other paths \
492+
created from the `__file__` attribute of a module.
493+
"""
494+
495+
462496
def _raise_error_if_casing_of_path_is_wrong(
463497
path: Path, check_casing_of_paths: bool
464498
) -> None:

0 commit comments

Comments
 (0)