@@ -10,14 +10,14 @@ Two things will quickly become a nuisance in bigger projects.
10
10
they are just intermediate representations.
11
11
12
12
As a solution, pytask offers a {class}` ~pytask.DataCatalog ` which is a purely optional
13
- feature. The tutorial focuses on the main features. To learn about all features, read
14
- the [ how-to guide] ( ../how_to_guides/the_data_catalog.md ) .
13
+ feature. The tutorial focuses on the main features. To learn about all the features,
14
+ read the [ how-to guide] ( ../how_to_guides/the_data_catalog.md ) .
15
15
16
16
Let us focus on the previous example and see how the {class}` ~pytask.DataCatalog ` helps
17
17
us.
18
18
19
- The project structure is the same as in the previous example with the exception of the
20
- ` .pytask ` folder and the missing ` data.pkl ` in ` bld ` .
19
+ The project structure is the same as in the previous example except the ` .pytask ` folder
20
+ and the missing ` data.pkl ` in ` bld ` .
21
21
22
22
``` text
23
23
my_project
@@ -44,15 +44,51 @@ At first, we define the data catalog in `config.py`.
44
44
``` {literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_1.py
45
45
```
46
46
47
- ## ` task_data_preparation `
47
+ ## ` task_create_random_data `
48
48
49
- Next, we will use the data catalog to save the product of the task in
50
- ` task_data_preparation.py ` .
49
+ Next, we look at the module ` task_data_preparation.py ` and its task
50
+ ` task_create_random_data ` . The task creates a dataframe with simulated data that should
51
+ be stored on the disk.
51
52
52
- Instead of using a path, we set the location of the product in the data catalog with
53
- ` data_catalog["data"] ` . If the key does not exist, the data catalog will automatically
54
- create a {class}` ~pytask.PickleNode ` that allows you to save any Python object to a
55
- ` pickle ` file. The ` pickle ` file is stored within the ` .pytask ` folder.
53
+ In the previous tutorial, we learned to use {class}` ~pathlib.Path ` s to define products
54
+ of our tasks. Here we see again the signature of the task function.
55
+
56
+ ````` {tab-set}
57
+
58
+ ````{tab-item} Python 3.10+
59
+ :sync: python310plus
60
+
61
+ ```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_products_py310.py
62
+ :lines: 10-12
63
+ ```
64
+ ````
65
+
66
+ ````{tab-item} Python 3.8+
67
+ :sync: python38plus
68
+
69
+ ```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_products_py38.py
70
+ :lines: 10-12
71
+ ```
72
+ ````
73
+
74
+ ````{tab-item} produces
75
+ :sync: produces
76
+
77
+ ```{literalinclude} ../../../docs_src/tutorials/defining_dependencies_products_products_produces.py
78
+ :lines: 8
79
+ ```
80
+ ````
81
+ `````
82
+
83
+ When we want to use the data catalog, we replace ` BLD / "data.pkl" ` with an entry of the
84
+ data catalog like ` data_catalog["data"] ` . If there is yet no entry with the name
85
+ ` "data" ` , the data catalog will automatically create a {class}` ~pytask.PickleNode ` . The
86
+ node allows you to save any Python object to a ` pickle ` file.
87
+
88
+ You probably noticed that we did not need to define a path. That is because the data
89
+ catalog takes care of that and stores the ` pickle ` file in the ` .pytask ` folder.
90
+
91
+ Using ` data_catalog["data"] ` is thus equivalent to using ` PickleNode(path=Path(...)) ` .
56
92
57
93
The following tabs show you how to use the data catalog given the interface you prefer.
58
94
@@ -125,10 +161,6 @@ Following one of the interfaces gives you immediate access to the
125
161
````{tab-item} Python 3.10+
126
162
:sync: python310plus
127
163
128
- Use `data_catalog["data"]` as an default argument to access the
129
- {class}`~pytask.PickleNode` within the task. When you are done transforming your
130
- {class}`~pandas.DataFrame`, save it with {meth}`~pytask.PickleNode.save`.
131
-
132
164
```{literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_3_py310.py
133
165
:emphasize-lines: 12
134
166
```
@@ -138,10 +170,6 @@ Use `data_catalog["data"]` as an default argument to access the
138
170
````{tab-item} Python 3.8+
139
171
:sync: python38plus
140
172
141
- Use `data_catalog["data"]` as an default argument to access the
142
- {class}`~pytask.PickleNode` within the task. When you are done transforming your
143
- {class}`~pandas.DataFrame`, save it with {meth}`~pytask.PickleNode.save`.
144
-
145
173
```{literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_3_py38.py
146
174
:emphasize-lines: 12
147
175
```
@@ -160,7 +188,8 @@ In most projects, you have other data sets that you would like to access via the
160
188
catalog. To add them, call the {meth}` ~pytask.DataCatalog.add ` method and supply a name
161
189
and a path.
162
190
163
- Let's add ` file.csv ` to the data catalog.
191
+ Let's add ` file.csv ` with the name ` "csv" ` to the data catalog and use it to create
192
+ ` data["transformed_csv"] ` .
164
193
165
194
``` text
166
195
my_project
@@ -174,8 +203,6 @@ my_project
174
203
│ ├────task_data_preparation.py
175
204
│ └────task_plot_data.py
176
205
│
177
- ├───setup.py
178
- │
179
206
├───.pytask
180
207
│ └────...
181
208
│
@@ -184,13 +211,24 @@ my_project
184
211
└────plot.png
185
212
```
186
213
187
- The path can be absolute or relative to the module of the data catalog.
214
+ We can use a relative or an absolute path to define the location of the file. A relative
215
+ path means the location is relative to the module of the data catalog.
188
216
189
217
``` {literalinclude} ../../../docs_src/tutorials/using_a_data_catalog_4.py
190
218
```
191
219
192
- You can now use the data catalog as in previous example and use the
193
- {class}` ~~pathlib.Path ` in the task.
220
+ You can now use the data catalog as in the previous example and use the
221
+ {class}` ~pathlib.Path ` in the task.
222
+
223
+ ``` {note}
224
+ Note that the value of `data_catalog["csv"]` inside the task becomes a
225
+ {class}`~pathlib.Path`. It is because a {class}`~pathlib.Path` in
226
+ {meth}`~pytask.DataCatalog.add` is not parsed to a {class}`~pytask.PickleNode` but a
227
+ {class}`~pytask.PathNode`.
228
+
229
+ Read {doc}`../how_to_guides/writing_custom_nodes` for more information about
230
+ different node types which is not relevant now.
231
+ ```
194
232
195
233
````` {tab-set}
196
234
@@ -224,9 +262,14 @@ You can now use the data catalog as in previous example and use the
224
262
225
263
## Developing with the ` DataCatalog `
226
264
227
- You can also use the data catalog in a Jupyter notebook or in the terminal in the Python
228
- interpreter. Simply import the data catalog, select a node and call the
229
- {meth}` ~pytask.PNode.load ` method of a node to access its value.
265
+ You can also use the data catalog in a Jupyter Notebook or the terminal in the Python
266
+ interpreter. This can be super helpful when you develop tasks interactively in a Jupyter
267
+ Notebook.
268
+
269
+ Simply import the data catalog, select a node and call the {meth}` ~pytask.PNode.load `
270
+ method of a node to access its value.
271
+
272
+ Here is an example with a terminal.
230
273
231
274
``` pycon
232
275
>>> from myproject.config import data_catalog
0 commit comments