Skip to content

Commit ba055a6

Browse files
authored
Follow-up on #616. (#632)
1 parent 982d2c7 commit ba055a6

File tree

5 files changed

+110
-27
lines changed

5 files changed

+110
-27
lines changed

docs/source/changes.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ chronological order. Releases follow [semantic versioning](https://semver.org/)
55
releases are available on [PyPI](https://pypi.org/project/pytask) and
66
[Anaconda.org](https://anaconda.org/conda-forge/pytask).
77

8-
## 0.5.1 - 2024-xx-xx
8+
## 0.5.1 - 2024-07-19
99

10-
- {pull}`616` redesigns the guide on "Scaling Tasks".
10+
- {pull}`616` and {pull}`632` redesign the guide on "Scaling Tasks".
1111
- {pull}`617` fixes an interaction with provisional nodes and `@mark.persist`.
1212
- {pull}`618` ensures that `root_dir` of `DirectoryNode` is created before the task is
1313
executed.

docs/source/how_to_guides/bp_complex_task_repetitions.md

+62-20
Original file line numberDiff line numberDiff line change
@@ -32,27 +32,35 @@ are growing over time and you run into these problems.
3232
## Solution
3333

3434
The main idea for the solution is quickly explained. We will, first, formalize
35-
dimensions into objects and, secondly, combine them in one object such that we only have
36-
to iterate over instances of this object in a single loop.
37-
38-
We will start by defining the dimensions using {class}`~typing.NamedTuple` or
35+
dimensions into objects using {class}`~typing.NamedTuple` or
3936
{func}`~dataclasses.dataclass`.
4037

41-
Then, we will define the object that holds both pieces of information together and for
42-
the lack of a better name, we will call it an experiment.
38+
Secondly, we will combine dimensions in multi-dimensional objects such that we only have
39+
to iterate over instances of this object in a single loop. Here and for the lack of a
40+
better name, we will call the object an experiment.
41+
42+
Lastly, we will also use the {class}`~pytask.DataCatalog` to not be bothered with
43+
defining paths.
4344

44-
```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py
45+
```{seealso}
46+
If you have not learned about the {class}`~pytask.DataCatalog` yet, start with the
47+
{doc}`tutorial <../tutorials/using_a_data_catalog>` and continue with the
48+
{doc}`how-to guide <the_data_catalog>`.
49+
```
50+
51+
```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/config.py
4552
---
4653
caption: config.py
4754
---
4855
```
4956

5057
There are some things to be said.
5158

52-
- The names on each dimension need to be unique and ensure that by combining them for
53-
the name of the experiment, we get a unique and descriptive id.
54-
- Dimensions might need more attributes than just a name, like paths, or other arguments
55-
for the task. Add them.
59+
- The `.name` attributes on each dimension need to return unique names and to ensure
60+
that by combining them for the name of the experiment, we get a unique and descriptive
61+
id.
62+
- Dimensions might need more attributes than just a name, like paths, keys for the data
63+
catalog, or other arguments for the task.
5664

5765
Next, we will use these newly defined data structures and see how our tasks change when
5866
we use them.
@@ -63,21 +71,55 @@ caption: task_example.py
6371
---
6472
```
6573

66-
As you see, we replaced
74+
As you see, we lost a level of indentation and we moved all the generations of names and
75+
paths to the dimensions and multi-dimensional objects.
6776

68-
## Using the `DataCatalog`
77+
## Adding another level
6978

70-
## Adding another dimension
79+
Extending a dimension by another level is usually quickly done. For example, if we have
80+
another model that we want to fit to the data, we extend `MODELS` which will
81+
automatically lead to all downstream tasks being created.
7182

72-
## Adding another level
83+
```{code-block} python
84+
---
85+
caption: config.py
86+
---
87+
...
88+
MODELS = [Model("ols"), Model("logit"), Model("linear_prob"), Model("new_model")]
89+
...
90+
```
91+
92+
Of course, you might need to alter `task_fit_model` because the task needs to handle the
93+
new model as well as the others. Here is where it pays off if you are using high-level
94+
interfaces in your code that handle all of the models with a simple
95+
`fitted_model = fit_model(data=data, model_name=model_name)` call and also return fitted
96+
models that are similar objects.
7397

7498
## Executing a subset
7599

76-
## Grouping and aggregating
100+
What if you want to execute a subset of tasks, for example, all tasks related to a model
101+
or a dataset?
102+
103+
When you are using the `.name` attributes of the dimensions and multi-dimensional
104+
objects like in the example above, you ensure that the names of dimensions are included
105+
in all downstream tasks.
106+
107+
Thus, you can simply call pytask with the following expression to execute all tasks
108+
related to the logit model.
109+
110+
```console
111+
pytask -k logit
112+
```
113+
114+
```{seealso}
115+
Expressions and markers for selecting tasks are explained in
116+
{doc}`../tutorials/selecting_tasks`.
117+
```
77118

78119
## Extending repetitions
79120

80-
Some parametrized tasks are costly to run - costly in terms of computing power, memory,
81-
or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
82-
use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
83-
in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.
121+
Some repeated tasks are costly to run - costly in terms of computing power, memory, or
122+
runtime. If you change a task module, you might accidentally trigger all other tasks in
123+
the module to be rerun. Use the {func}`@pytask.mark.persist <pytask.mark.persist>`
124+
decorator, which is explained in more detail in this
125+
{doc}`tutorial <../tutorials/making_tasks_persist>`.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
from pathlib import Path
2+
from typing import NamedTuple
3+
4+
from pytask import DataCatalog
5+
6+
SRC = Path(__file__).parent
7+
BLD = SRC / "bld"
8+
9+
data_catalog = DataCatalog()
10+
11+
12+
class Dataset(NamedTuple):
13+
name: str
14+
15+
@property
16+
def path(self) -> Path:
17+
return SRC / f"{self.name}.pkl"
18+
19+
20+
class Model(NamedTuple):
21+
name: str
22+
23+
24+
DATASETS = [Dataset("a"), Dataset("b"), Dataset("c")]
25+
MODELS = [Model("ols"), Model("logit"), Model("linear_prob")]
26+
27+
28+
class Experiment(NamedTuple):
29+
dataset: Dataset
30+
model: Model
31+
32+
@property
33+
def name(self) -> str:
34+
return f"{self.model.name}-{self.dataset.name}"
35+
36+
@property
37+
def fitted_model_name(self) -> str:
38+
return f"{self.name}-fitted-model"
39+
40+
41+
EXPERIMENTS = [Experiment(dataset, model) for dataset in DATASETS for model in MODELS]
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
1-
from pathlib import Path
21
from typing import Annotated
2+
from typing import Any
33

44
from myproject.config import EXPERIMENTS
5-
from pytask import Product
5+
from myproject.config import data_catalog
66
from pytask import task
77

88
for experiment in EXPERIMENTS:
99

1010
@task(id=experiment.name)
1111
def task_fit_model(
1212
path_to_data: experiment.dataset.path,
13-
path_to_model: Annotated[Path, Product] = experiment.path,
14-
) -> None: ...
13+
) -> Annotated[Any, data_catalog[experiment.fitted_model_name]]: ...

pyproject.toml

+2-1
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ test = [
7272
"aiohttp", # For HTTPPath tests.
7373
"coiled",
7474
]
75-
typing = ["mypy>=1.9.0", "nbqa[mypy]>=1.8.5"]
75+
typing = ["mypy>=1.9.0,<1.11", "nbqa[mypy]>=1.8.5"]
7676

7777
[project.urls]
7878
Changelog = "https://pytask-dev.readthedocs.io/en/stable/changes.html"
@@ -186,6 +186,7 @@ disallow_untyped_defs = true
186186
no_implicit_optional = true
187187
warn_redundant_casts = true
188188
warn_unused_ignores = true
189+
disable_error_code = ["import-untyped"]
189190

190191
[[tool.mypy.overrides]]
191192
module = "tests.*"

0 commit comments

Comments
 (0)