Skip to content

Commit b191559

Browse files
authored
Update more parts of the documentation. (#441)
1 parent fbda956 commit b191559

18 files changed

+178
-169
lines changed

docs/source/changes.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ chronological order. Releases follow [semantic versioning](https://semver.org/)
55
releases are available on [PyPI](https://pypi.org/project/pytask) and
66
[Anaconda.org](https://anaconda.org/conda-forge/pytask).
77

8-
## 0.4.0 - 2023-xx-xx
8+
## 0.4.0 - 2023-10-07
99

1010
- {pull}`323` remove Python 3.7 support and use a new Github action to provide mamba.
1111
- {pull}`384` allows to parse dependencies from every function argument if `depends_on`
@@ -56,6 +56,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
5656
{func}`pytask.is_task_function`.
5757
- {pull}`438` clarifies some types.
5858
- {pull}`440` refines more types.
59+
- {pull}`441` updates more parts of the documentation.
5960
- {pull}`442` allows users to import `from pytask import mark` and use `@mark.skip`.
6061

6162
## 0.3.2 - 2023-06-07

docs/source/how_to_guides/bp_scalable_repetitions_of_tasks.md

-112
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# Scaling tasks
2+
3+
In any bigger project you quickly come to the point where you stack multiple repetitions
4+
of tasks on top of each other.
5+
6+
For example, you have one dataset, four different ways to prepare it, and three
7+
statistical models to analyze the data. The cartesian product of all steps combined
8+
comprises twelve differently fitted models.
9+
10+
Here you find some tips on how to set up your tasks such that you can easily modify the
11+
cartesian product of steps.
12+
13+
## Scalability
14+
15+
Let us dive right into the aforementioned example. We start with one dataset `data.csv`.
16+
Then, we will create four different specifications of the data and, finally, fit three
17+
different models to each specification.
18+
19+
This is the structure of the project.
20+
21+
```
22+
my_project
23+
├───pyproject.toml
24+
25+
├───src
26+
│ └───my_project
27+
│ ├────config.py
28+
│ │
29+
│ ├───data
30+
│ │ └────data.csv
31+
│ │
32+
│ ├───data_preparation
33+
│ │ ├────__init__.py
34+
│ │ ├────config.py
35+
│ │ └────task_prepare_data.py
36+
│ │
37+
│ └───estimation
38+
│ ├────__init__.py
39+
│ ├────config.py
40+
│ └────task_estimate_models.py
41+
42+
43+
├───setup.py
44+
45+
├───.pytask.sqlite3
46+
47+
└───bld
48+
```
49+
50+
The folder structure, the main `config.py` which holds `SRC` and `BLD`, and the tasks
51+
follow the same structure advocated throughout the tutorials.
52+
53+
New are the local configuration files in each subfolder of `my_project`, which contain
54+
objects shared across tasks. For example, `config.py` holds the paths to the processed
55+
data and the names of the data sets.
56+
57+
```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_1.py
58+
```
59+
60+
The task file `task_prepare_data.py` uses these objects to build the repetitions.
61+
62+
```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_2.py
63+
```
64+
65+
All arguments for the loop and the {func}`@task <pytask.task>` decorator are built
66+
within a function to keep the logic in one place and the module's namespace clean.
67+
68+
Ids are used to make the task {ref}`ids <ids>` more descriptive and to simplify their
69+
selection with {ref}`expressions <expressions>`. Here is an example of the task ids with
70+
an explicit id.
71+
72+
```
73+
# With id
74+
.../my_project/data_preparation/task_prepare_data.py::task_prepare_data[data_0]
75+
```
76+
77+
Next, we move to the estimation to see how we can build another repetition on top.
78+
79+
```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_3.py
80+
```
81+
82+
In the local configuration, we define `ESTIMATIONS` which combines the information on
83+
data and model. The dictionary's key can be used as a task id whenever the estimation is
84+
involved. It allows triggering all tasks related to one estimation - estimation,
85+
figures, tables - with one command.
86+
87+
```console
88+
pytask -k linear_probability_data_0
89+
```
90+
91+
And here is the task file.
92+
93+
```{literalinclude} ../../../docs_src/how_to_guides/bp_scaling_tasks_4.py
94+
```
95+
96+
Replicating this pattern across a project allows a clean way to define repetitions.
97+
98+
## Extending repetitions
99+
100+
Some parametrized tasks are costly to run - costly in terms of computing power, memory,
101+
or time. Users often extend repetitions triggering all repetitions to be rerun. Thus,
102+
use the {func}`@pytask.mark.persist <pytask.mark.persist>` decorator, which is explained
103+
in more detail in this {doc}`tutorial <../tutorials/making_tasks_persist>`.

docs/source/how_to_guides/bp_structure_of_task_files.md

+22-11
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,20 @@
11
# Structure of task files
22

3-
This section provides advice on how to structure task files.
3+
This guide presents some best-practices for structuring your task files. You do not have
4+
to follow them to use pytask or to create a reproducible research project. But, if you
5+
are looking for orientation or inspiration, here are some tips.
46

57
## TL;DR
68

7-
- There might be multiple task functions in a task module, but only if the code is still
8-
readable and not too complex and if runtime for all tasks is low.
9+
- Use task modules to separate task functions from another. Separating tasks by the
10+
stages in research project like data management, analysis, plotting is a good start.
11+
Separate further when task modules become crowded.
912

10-
- A task function should be the first function in a task module.
13+
- Task functions should be at the top of a task module to easily identify what the
14+
module is for.
1115

1216
:::{seealso}
13-
The only exception might be for {doc}`repetitions <bp_scalable_repetitions_of_tasks>`.
17+
The only exception might be for {doc}`repetitions <bp_scaling_tasks>`.
1418
:::
1519

1620
- The purpose of the task function is to handle IO operations like loading and saving
@@ -20,25 +24,32 @@ This section provides advice on how to structure task files.
2024
- Non-task functions in the task module are {term}`private functions <private function>`
2125
and only used within this task module. The functions should not have side-effects.
2226

23-
- Functions used to accomplish tasks in multiple task modules should have their own
24-
module.
27+
- It should never be necessary to import from task modules. So if you need a function in
28+
multiple task modules, put it in a separate module (which does not start with
29+
`task_`).
2530

2631
## Best Practices
2732

2833
### Number of tasks in a module
2934

3035
There are two reasons to split tasks across several modules.
3136

32-
The first reason concerns readability and complexity. Multiple tasks deal with
33-
(slightly) different concepts and, thus, should be split content-wise. Even if tasks
34-
deal with the same concept, they might be very complex on its own and separate modules
35-
help the reader (most likely you or your colleagues) to focus on one thing.
37+
The first reason concerns readability and complexity. Tasks deal with different concepts
38+
and, thus, should be split. Even if tasks deal with the same concept, they might becna
39+
very complex and separate modules help the reader (most likely you or your colleagues)
40+
to focus on one thing.
3641

3742
The second reason is about runtime. If a task module is changed, all tasks within the
3843
module are re-run. If the runtime of all tasks in the module is high, you wait longer
3944
for your tasks to finish or until an error occurs which prolongs your feedback loops and
4045
hurts your productivity.
4146

47+
:::{seealso}
48+
Use {func}`@pytask.mark.persist <pytask.mark.persist>` if you want to avoid accidentally
49+
triggering an expensive task. It is also explained in [this
50+
tutorial](../tutorials/making_tasks_persist).
51+
:::
52+
4253
### Structure of the module
4354

4455
For the following example, let us assume that the task module contains one task.

docs/source/how_to_guides/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -38,5 +38,5 @@ maxdepth: 1
3838
bp_structure_of_a_research_project
3939
bp_structure_of_task_files
4040
bp_templates_and_projects
41-
bp_scalable_repetitions_of_tasks
41+
bp_scaling_tasks
4242
```

docs/source/how_to_guides/writing_custom_nodes.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ your own to improve your workflows.
1010
## Use-case
1111

1212
A typical task operation is to load data like a {class}`pandas.DataFrame` from a pickle
13-
file, transform it, and store it on disk. The usual way would be to use paths to point to
14-
inputs and outputs and call {func}`pandas.read_pickle` and
13+
file, transform it, and store it on disk. The usual way would be to use paths to point
14+
to inputs and outputs and call {func}`pandas.read_pickle` and
1515
{meth}`pandas.DataFrame.to_pickle`.
1616

1717
```{literalinclude} ../../../docs_src/how_to_guides/writing_custom_nodes_example_1.py
@@ -54,7 +54,7 @@ A custom node needs to follow an interface so that pytask can perform several ac
5454
- Load and save values when tasks are executed.
5555

5656
This interface is defined by protocols [^structural-subtyping]. A custom node must
57-
follow at least the protocol {class}`pytask.Node` or, even better,
57+
follow at least the protocol {class}`pytask.PNode` or, even better,
5858
{class}`pytask.PPathNode` if it is based on a path. The common node for paths,
5959
{class}`pytask.PathNode`, follows the protocol {class}`pytask.PPathNode`.
6060

docs/source/tutorials/repeating_tasks_with_different_inputs.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ We reuse the task from the previous {doc}`tutorial <write_a_task>`, which genera
88
random data and repeat the same operation over several seeds to receive multiple,
99
reproducible samples.
1010

11-
Apply the {func}`@task <pytask.task>` decorator, loop over the function
12-
and supply different seeds and output paths as default arguments of the function.
11+
Apply the {func}`@task <pytask.task>` decorator, loop over the function and supply
12+
different seeds and output paths as default arguments of the function.
1313

1414
::::{tab-set}
1515

@@ -355,7 +355,7 @@ for id_, kwargs in ID_TO_KWARGS.items():
355355
```
356356

357357
The
358-
{doc}`best-practices guide on parametrizations <../how_to_guides/bp_scalable_repetitions_of_tasks>`
358+
{doc}`best-practices guide on parametrizations <../how_to_guides/bp_scaling_tasks>`
359359
goes into even more detail on how to scale parametrizations.
360360

361361
## A warning on globals

docs_src/how_to_guides/bp_scalable_repetitions_of_tasks_1.py docs_src/how_to_guides/bp_scaling_tasks_1.py

+7-2
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,16 @@
55
from my_project.config import SRC
66

77

8-
DATA = ["data_0", "data_1", "data_2", "data_3"]
8+
DATA = {
9+
"data_0": {"subset": "subset_1"},
10+
"data_1": {"subset": "subset_2"},
11+
"data_2": {"subset": "subset_3"},
12+
"data_3": {"subset": "subset_4"},
13+
}
914

1015

1116
def path_to_input_data(name: str) -> Path:
12-
return SRC / "data" / f"{name}.csv"
17+
return SRC / "data" / "data.csv"
1318

1419

1520
def path_to_processed_data(name: str) -> Path:

docs_src/how_to_guides/bp_scalable_repetitions_of_tasks_2.py docs_src/how_to_guides/bp_scaling_tasks_2.py

+9-2
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,19 @@
44
from my_project.data_preparation.config import DATA
55
from my_project.data_preparation.config import path_to_input_data
66
from my_project.data_preparation.config import path_to_processed_data
7+
from pandas import pd
78
from pytask import Product
89
from pytask import task
910
from typing_extensions import Annotated
1011

1112

1213
def _create_parametrization(data: list[str]) -> dict[str, Path]:
1314
id_to_kwargs = {}
14-
for data_name in data:
15+
for data_name, kwargs in data.items():
1516
id_to_kwargs[data_name] = {
1617
"path_to_input_data": path_to_input_data(data_name),
1718
"path_to_processed_data": path_to_processed_data(data_name),
19+
**kwargs,
1820
}
1921

2022
return id_to_kwargs
@@ -27,6 +29,11 @@ def _create_parametrization(data: list[str]) -> dict[str, Path]:
2729

2830
@task(id=id_, kwargs=kwargs)
2931
def task_prepare_data(
30-
path_to_input_data: Path, path_to_processed_data: Annotated[Path, Product]
32+
path_to_input_data: Path,
33+
subset: str,
34+
path_to_processed_data: Annotated[Path, Product],
3135
) -> None:
36+
df = pd.read_csv(path_to_input_data)
3237
...
38+
subset = df.loc[df["subset"].eq(subset)]
39+
subset.to_pickle(path_to_processed_data)

0 commit comments

Comments
 (0)