Skip to content

Commit 8038fcf

Browse files
authored
Improve documentation for data catalogs. (#606)
1 parent 8067e43 commit 8038fcf

File tree

13 files changed

+175
-73
lines changed

13 files changed

+175
-73
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,4 @@ tests/test_jupyter/*.txt
2626
.pytest_cache
2727
.ruff_cache
2828
.venv
29+
docs/jupyter_execute

docs/source/changes.md

+1
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ releases are available on [PyPI](https://pypi.org/project/pytask) and
4747
- {pull}`603` fixes an example in the documentation about capturing warnings.
4848
- {pull}`604` fixes some examples with `PythonNode`s in the documentation.
4949
- {pull}`605` improves checks and CI.
50+
- {pull}`606` improves the documentation for data catalogs.
5051
- {pull}`609` allows a pending status for tasks. Useful for async backends implemented
5152
in pytask-parallel.
5253
- {pull}`611` removes the initial task execution status from

docs/source/conf.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,7 @@
5151
"sphinx_copybutton",
5252
"sphinx_click",
5353
"sphinx_toolbox.more_autodoc.autoprotocol",
54-
"nbsphinx",
55-
"myst_parser",
54+
"myst_nb",
5655
"sphinx_design",
5756
]
5857

docs/source/how_to_guides/bp_scaling_tasks.md

-3
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,6 @@ my_project
3939
│ ├────config.py
4040
│ └────task_estimate_models.py
4141
42-
43-
├───setup.py
44-
4542
├───.pytask
4643
│ └────...
4744
+71-12
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
# The `DataCatalog` - Revisited
22

3-
An introduction to the data catalog can be found in the
4-
[tutorial](../tutorials/using_a_data_catalog.md).
5-
6-
This guide explains some details that were left out of the tutorial.
3+
This guide explains more details about the {class}`~pytask.DataCatalog` that were left
4+
out of the [tutorial](../tutorials/using_a_data_catalog.md). Please, read the tutorial
5+
for a basic understanding.
76

87
## Changing the default node
98

@@ -15,54 +14,64 @@ For example, use the {class}`~pytask.PythonNode` as the default.
1514

1615
```python
1716
from pytask import PythonNode
17+
from pytask import DataCatalog
1818

1919

2020
data_catalog = DataCatalog(default_node=PythonNode)
2121
```
2222

23-
Or, learn to write your own node by reading {doc}`writing_custom_nodes`.
23+
Or, learn to write your node by reading {doc}`writing_custom_nodes`.
2424

25-
Here, is an example for a `PickleNode` that uses cloudpickle instead of the normal
26-
`pickle` module.
25+
Here, is an example for a {class}`~pytask.PickleNode` that uses cloudpickle instead of
26+
the normal {mod}`pickle` module.
2727

2828
```{literalinclude} ../../../docs_src/how_to_guides/the_data_catalog.py
2929
```
3030

3131
## Changing the name and the default path
3232

33-
By default, the data catalogs store their data in a directory `.pytask/data_catalogs`.
34-
If you use a `pyproject.toml` with a `[tool.pytask.ini_options]` section, then the
33+
By default, data catalogs store their data in a directory `.pytask/data_catalogs`. If
34+
you use a `pyproject.toml` with a `[tool.pytask.ini_options]` section, then the
3535
`.pytask` folder is in the same folder as the configuration file.
3636

3737
The default name for a catalog is `"default"` and so you will find its data in
3838
`.pytask/data_catalogs/default`. If you assign a different name like
3939
`"data_management"`, you will find the data in `.pytask/data_catalogs/data_management`.
4040

4141
```python
42+
from pytask import DataCatalog
43+
44+
4245
data_catalog = DataCatalog(name="data_management")
4346
```
4447

48+
```{note}
49+
The name of a data catalog is restricted to letters, numbers, hyphens and underscores.
50+
```
51+
4552
You can also change the path where the data catalogs will be stored by changing the
4653
`path` attribute. Here, we store the data catalog's data next to the module where the
4754
data catalog is defined in `.data`.
4855

4956
```python
5057
from pathlib import Path
58+
from pytask import DataCatalog
5159

5260

5361
data_catalog = DataCatalog(path=Path(__file__).parent / ".data")
5462
```
5563

5664
## Multiple data catalogs
5765

58-
You can use multiple data catalogs when you want to separate your datasets across
59-
multiple catalogs or when you want to use the same names multiple times (although it is
60-
not recommended!).
66+
You can use multiple data catalogs when you want to separate your datasets or to avoid
67+
name collisions of data catalog entries.
6168

6269
Make sure you assign different names to the data catalogs so that their data is stored
6370
in different directories.
6471

6572
```python
73+
from pytask import DataCatalog
74+
6675
# Stored in .pytask/data_catalog/a
6776
data_catalog_a = DataCatalog(name="a")
6877

@@ -71,3 +80,53 @@ data_catalog_b = DataCatalog(name="b")
7180
```
7281

7382
Or, use different paths as explained above.
83+
84+
## Nested data catalogs
85+
86+
Name collisions can also occur when you are using multiple levels of repetitions, for
87+
example, when you are fitting multiple models to multiple data sets.
88+
89+
You can structure your data catalogs like this.
90+
91+
```python
92+
from pytask import DataCatalog
93+
94+
95+
MODEL_NAMES = ("ols", "logistic_regression")
96+
DATA_NAMES = ("data_1", "data_2")
97+
98+
99+
nested_data_catalogs = {
100+
model_name: {
101+
data_name: DataCatalog(name=f"{model_name}-{data_name}")
102+
for data_name in DATA_NAMES
103+
}
104+
for model_name in MODEL_NAMES
105+
}
106+
```
107+
108+
The task could look like this.
109+
110+
```python
111+
from pathlib import Path
112+
from pytask import task
113+
from typing_extensions import Annotated
114+
115+
from my_project.config import DATA_NAMES
116+
from my_project.config import MODEL_NAMES
117+
from my_project.config import nested_data_catalogs
118+
119+
120+
for model_name in MODEL_NAMES:
121+
for data_name in DATA_NAMES:
122+
123+
@task
124+
def fit_model(
125+
path: Path = Path("...", data_name)
126+
) -> Annotated[
127+
Any, nested_data_catalogs[model_name][data_name]["fitted_model"]
128+
]:
129+
data = ...
130+
fitted_model = ...
131+
return fitted_model
132+
```

docs/source/reference_guides/api.md

+5
Original file line numberDiff line numberDiff line change
@@ -228,7 +228,9 @@ Task are currently represented by the following classes:
228228

229229
```{eval-rst}
230230
.. autoclass:: pytask.Task
231+
:members:
231232
.. autoclass:: pytask.TaskWithoutPath
233+
:members:
232234
```
233235

234236
Currently, there are no different types of tasks since changing the `.function`
@@ -325,6 +327,9 @@ resolution and execution.
325327
326328
An indicator to mark arguments of tasks as products.
327329
330+
>>> from pathlib import Path
331+
>>> from pytask import Product
332+
>>> from typing_extensions import Annotated
328333
>>> def task_example(path: Annotated[Path, Product]) -> None:
329334
... path.write_text("Hello, World!")
330335

docs/source/tutorials/using_a_data_catalog.md

+71-28
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,14 @@ Two things will quickly become a nuisance in bigger projects.
1010
they are just intermediate representations.
1111

1212
As a solution, pytask offers a {class}`~pytask.DataCatalog` which is a purely optional
13-
feature. The tutorial focuses on the main features. To learn about all features, read
14-
the [how-to guide](../how_to_guides/the_data_catalog.md).
13+
feature. The tutorial focuses on the main features. To learn about all the features,
14+
read the [how-to guide](../how_to_guides/the_data_catalog.md).
1515

1616
Let us focus on the previous example and see how the {class}`~pytask.DataCatalog` helps
1717
us.
1818

19-
The project structure is the same as in the previous example with the exception of the
20-
`.pytask` folder and the missing `data.pkl` in `bld`.
19+
The project structure is the same as in the previous example except the `.pytask` folder
20+
and the missing `data.pkl` in `bld`.
2121

2222
```text
2323
my_project
@@ -44,15 +44,51 @@ At first, we define the data catalog in `config.py`.
4444
```{literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_1.py
4545
```
4646

47-
## `task_data_preparation`
47+
## `task_create_random_data`
4848

49-
Next, we will use the data catalog to save the product of the task in
50-
`task_data_preparation.py`.
49+
Next, we look at the module `task_data_preparation.py` and its task
50+
`task_create_random_data`. The task creates a dataframe with simulated data that should
51+
be stored on the disk.
5152

52-
Instead of using a path, we set the location of the product in the data catalog with
53-
`data_catalog["data"]`. If the key does not exist, the data catalog will automatically
54-
create a {class}`~pytask.PickleNode` that allows you to save any Python object to a
55-
`pickle` file. The `pickle` file is stored within the `.pytask` folder.
53+
In the previous tutorial, we learned to use {class}`~pathlib.Path`s to define products
54+
of our tasks. Here we see again the signature of the task function.
55+
56+
`````{tab-set}
57+
58+
````{tab-item} Python 3.10+
59+
:sync: python310plus
60+
61+
```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_products_py310.py
62+
:lines: 10-12
63+
```
64+
````
65+
66+
````{tab-item} Python 3.8+
67+
:sync: python38plus
68+
69+
```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_products_py38.py
70+
:lines: 10-12
71+
```
72+
````
73+
74+
````{tab-item} produces
75+
:sync: produces
76+
77+
```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_products_produces.py
78+
:lines: 8
79+
```
80+
````
81+
`````
82+
83+
When we want to use the data catalog, we replace `BLD / "data.pkl"` with an entry of the
84+
data catalog like `data_catalog["data"]`. If there is yet no entry with the name
85+
`"data"`, the data catalog will automatically create a {class}`~pytask.PickleNode`. The
86+
node allows you to save any Python object to a `pickle` file.
87+
88+
You probably noticed that we did not need to define a path. That is because the data
89+
catalog takes care of that and stores the `pickle` file in the `.pytask` folder.
90+
91+
Using `data_catalog["data"]` is thus equivalent to using `PickleNode(path=Path(...))`.
5692

5793
The following tabs show you how to use the data catalog given the interface you prefer.
5894

@@ -125,10 +161,6 @@ Following one of the interfaces gives you immediate access to the
125161
````{tab-item} Python 3.10+
126162
:sync: python310plus
127163
128-
Use `data_catalog["data"]` as an default argument to access the
129-
{class}`~pytask.PickleNode` within the task. When you are done transforming your
130-
{class}`~pandas.DataFrame`, save it with {meth}`~pytask.PickleNode.save`.
131-
132164
```{literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_3_py310.py
133165
:emphasize-lines: 12
134166
```
@@ -138,10 +170,6 @@ Use `data_catalog["data"]` as an default argument to access the
138170
````{tab-item} Python 3.8+
139171
:sync: python38plus
140172
141-
Use `data_catalog["data"]` as an default argument to access the
142-
{class}`~pytask.PickleNode` within the task. When you are done transforming your
143-
{class}`~pandas.DataFrame`, save it with {meth}`~pytask.PickleNode.save`.
144-
145173
```{literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_3_py38.py
146174
:emphasize-lines: 12
147175
```
@@ -160,7 +188,8 @@ In most projects, you have other data sets that you would like to access via the
160188
catalog. To add them, call the {meth}`~pytask.DataCatalog.add` method and supply a name
161189
and a path.
162190

163-
Let's add `file.csv` to the data catalog.
191+
Let's add `file.csv` with the name `"csv"` to the data catalog and use it to create
192+
`data["transformed_csv"]`.
164193

165194
```text
166195
my_project
@@ -174,8 +203,6 @@ my_project
174203
│ ├────task_data_preparation.py
175204
│ └────task_plot_data.py
176205
177-
├───setup.py
178-
179206
├───.pytask
180207
│ └────...
181208
@@ -184,13 +211,24 @@ my_project
184211
└────plot.png
185212
```
186213

187-
The path can be absolute or relative to the module of the data catalog.
214+
We can use a relative or an absolute path to define the location of the file. A relative
215+
path means the location is relative to the module of the data catalog.
188216

189217
```{literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_4.py
190218
```
191219

192-
You can now use the data catalog as in previous example and use the
193-
{class}`~~pathlib.Path` in the task.
220+
You can now use the data catalog as in the previous example and use the
221+
{class}`~pathlib.Path` in the task.
222+
223+
```{note}
224+
Note that the value of `data_catalog["csv"]` inside the task becomes a
225+
{class}`~pathlib.Path`. It is because a {class}`~pathlib.Path` in
226+
{meth}`~pytask.DataCatalog.add` is not parsed to a {class}`~pytask.PickleNode` but a
227+
{class}`~pytask.PathNode`.
228+
229+
Read {doc}`../how_to_guides/writing_custom_nodes` for more information about
230+
different node types which is not relevant now.
231+
```
194232

195233
`````{tab-set}
196234
@@ -224,9 +262,14 @@ You can now use the data catalog as in previous example and use the
224262

225263
## Developing with the `DataCatalog`
226264

227-
You can also use the data catalog in a Jupyter notebook or in the terminal in the Python
228-
interpreter. Simply import the data catalog, select a node and call the
229-
{meth}`~pytask.PNode.load` method of a node to access its value.
265+
You can also use the data catalog in a Jupyter Notebook or the terminal in the Python
266+
interpreter. This can be super helpful when you develop tasks interactively in a Jupyter
267+
Notebook.
268+
269+
Simply import the data catalog, select a node and call the {meth}`~pytask.PNode.load`
270+
method of a node to access its value.
271+
272+
Here is an example with a terminal.
230273

231274
```pycon
232275
>>> from myproject.config import data_catalog

docs_src/tutorials/using_a_data_catalog_4.py

-1
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,3 @@
1010

1111
# Use either a relative or a absolute path.
1212
data_catalog.add("csv", Path("file.csv"))
13-
data_catalog.add("transformed_csv", BLD / "file.pkl")

pyproject.toml

+5-1
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ docs = [
5252
"ipython",
5353
"matplotlib",
5454
"myst-parser",
55-
"nbsphinx",
55+
"myst-nb",
5656
"sphinx",
5757
"sphinx-click",
5858
"sphinx-copybutton",
@@ -92,6 +92,10 @@ build-backend = "hatchling.build"
9292
managed = true
9393
dev-dependencies = ["tox-uv>=1.7.0"]
9494

95+
[tool.rye.scripts]
96+
clean-docs = { cmd = "rm -rf docs/build" }
97+
build-docs = { cmd = "sphinx-build -b html docs/source docs/build" }
98+
9599
[tool.hatch.build.hooks.vcs]
96100
version-file = "src/_pytask/_version.py"
97101

0 commit comments

Comments
 (0)