Skip to content

Commit 982d2c7

Browse files
authored
Redesign the scaling tasks guide. (#616)
1 parent 7f9a787 commit 982d2c7

13 files changed

+158
-217
lines changed

docs/source/changes.md

+1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
77

88
## 0.5.1 - 2024-xx-xx
99

10+
- {pull}`616` redesigns the guide on "Scaling Tasks".
1011
- {pull}`617` fixes an interaction with provisional nodes and `@mark.persist`.
1112
- {pull}`618` ensures that `root_dir` of `DirectoryNode` is created before the task is
1213
executed.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Complex task repetitions
2+
3+
{doc}`Task repetitions <../tutorials/repeating_tasks_with_different_inputs>` are amazing
4+
if you want to execute lots of tasks while not repeating yourself in code.
5+
6+
But, in any bigger project, repetitions can become hard to maintain because there are
7+
multiple layers or dimensions of repetition.
8+
9+
Here you find some tips on how to set up your project such that adding dimensions and
10+
increasing dimensions becomes much easier.
11+
12+
## Example
13+
14+
You can write multiple loops around a task function where each loop stands for a
15+
different dimension. A dimension might represent different datasets or model
16+
specifications to analyze the datasets like in the following example. The task arguments
17+
are derived from the dimensions.
18+
19+
```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example.py
20+
---
21+
caption: task_example.py
22+
---
23+
```
24+
25+
There is nothing wrong with using nested loops for simpler projects. But, often projects
26+
are growing over time and you run into these problems.
27+
28+
- When you add a new task, you need to duplicate the nested loops in another module.
29+
- When you add a dimension, you need to touch multiple files in your project and add
30+
another loop and level of indentation.
31+
32+
## Solution
33+
34+
The main idea for the solution is quickly explained. We will, first, formalize
35+
dimensions into objects and, secondly, combine them in one object such that we only have
36+
to iterate over instances of this object in a single loop.
37+
38+
We will start by defining the dimensions using {class}`~typing.NamedTuple` or
39+
{func}`~dataclasses.dataclass`.
40+
41+
Then, we will define the object that holds both pieces of information together and for
42+
the lack of a better name, we will call it an experiment.
43+
44+
```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/experiment.py
45+
---
46+
caption: config.py
47+
---
48+
```
49+
50+
There are some things to be said.
51+
52+
- The names on each dimension need to be unique and ensure that by combining them for
53+
the name of the experiment, we get a unique and descriptive id.
54+
- Dimensions might need more attributes than just a name, like paths, or other arguments
55+
for the task. Add them.
56+
57+
Next, we will use these newly defined data structures and see how our tasks change when
58+
we use them.
59+
60+
```{literalinclude} ../../../docs_src/how_to_guides/bp_complex_task_repetitions/example_improved.py
61+
---
62+
caption: task_example.py
63+
---
64+
```
65+
66+
As you see, we replaced
67+
68+
## Using the `DataCatalog`
69+
70+
## Adding another dimension
71+
72+
## Adding another level
73+
74+
## Executing a subset
75+
76+
## Grouping and aggregating
77+
78+
## Extending repetitions
79+
80+
Some parametrized tasks are costly to run - costly in terms of computing power, memory,
81+
or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
82+
use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
83+
in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.

docs/source/how_to_guides/bp_scaling_tasks.md

-101
This file was deleted.

docs/source/how_to_guides/bp_structure_of_task_files.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ are looking for orientation or inspiration, here are some tips.
1414
module is for.
1515

1616
```{seealso}
17-
The only exception might be for {doc}`repetitions <bp_scaling_tasks>`.
17+
The only exception might be for {doc}`repetitions <bp_complex_task_repetitions>`.
1818
```
1919

2020
- The purpose of the task function is to handle IO operations like loading and saving

docs/source/how_to_guides/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,5 @@ maxdepth: 1
4242
bp_structure_of_a_research_project
4343
bp_structure_of_task_files
4444
bp_templates_and_projects
45-
bp_scaling_tasks
45+
bp_complex_task_repetitions
4646
```

docs/source/tutorials/repeating_tasks_with_different_inputs.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,8 @@ for id_, kwargs in ID_TO_KWARGS.items():
291291
def task_create_random_data(i, produces): ...
292292
```
293293

294-
The {doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>`
294+
The
295+
{doc}`best-practices guide on parametrizations <../how_to_guides/bp_complex_task_repetitions>`
295296
goes into even more detail on how to scale parametrizations.
296297

297298
## A warning on globals
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
from pathlib import Path
2+
from typing import Annotated
3+
4+
from pytask import Product
5+
from pytask import task
6+
7+
SRC = Path(__file__).parent
8+
BLD = SRC / "bld"
9+
10+
11+
for data_name in ("a", "b", "c"):
12+
for model_name in ("ols", "logit", "linear_prob"):
13+
14+
@task(id=f"{model_name}-{data_name}")
15+
def task_fit_model(
16+
path_to_data: Path = SRC / f"{data_name}.pkl",
17+
path_to_model: Annotated[Path, Product] = BLD
18+
/ f"{data_name}-{model_name}.pkl",
19+
) -> None: ...
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from pathlib import Path
2+
from typing import Annotated
3+
4+
from myproject.config import EXPERIMENTS
5+
from pytask import Product
6+
from pytask import task
7+
8+
for experiment in EXPERIMENTS:
9+
10+
@task(id=experiment.name)
11+
def task_fit_model(
12+
path_to_data: experiment.dataset.path,
13+
path_to_model: Annotated[Path, Product] = experiment.path,
14+
) -> None: ...
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
from pathlib import Path
2+
from typing import NamedTuple
3+
4+
SRC = Path(__file__).parent
5+
BLD = SRC / "bld"
6+
7+
8+
class Dataset(NamedTuple):
9+
name: str
10+
11+
@property
12+
def path(self) -> Path:
13+
return SRC / f"{self.name}.pkl"
14+
15+
16+
class Model(NamedTuple):
17+
name: str
18+
19+
20+
DATASETS = [Dataset("a"), Dataset("b"), Dataset("c")]
21+
MODELS = [Model("ols"), Model("logit"), Model("linear_prob")]
22+
23+
24+
class Experiment(NamedTuple):
25+
dataset: Dataset
26+
model: Model
27+
28+
@property
29+
def name(self) -> str:
30+
return f"{self.model.name}-{self.dataset.name}"
31+
32+
@property
33+
def path(self) -> Path:
34+
return BLD / f"{self.name}.pkl"
35+
36+
37+
EXPERIMENTS = [Experiment(dataset, model) for dataset in DATASETS for model in MODELS]

docs_src/how_to_guides/bp_scaling_tasks_1.py

-20
This file was deleted.

docs_src/how_to_guides/bp_scaling_tasks_2.py

-39
This file was deleted.

docs_src/how_to_guides/bp_scaling_tasks_3.py

-18
This file was deleted.

0 commit comments

Comments
 (0)