Make task dependencies explicit #47

felixschmitz · 2025-07-31T12:20:19Z

See Issue #46

Note

Refactors the pipeline to explicit module-based clean/combine tasks, adds YAML metadata mapping, updates dataset merging to use it, and improves hwealth/pkal cleaning with config/tooling tweaks.

Pipeline/Tasks:
- Introduces clean_modules/task.py and new combine_modules/* with explicit combine(...) functions; removes combine_variables/*.
- Updates config.py (adds MODULE_STRUCTURE, BLD, renames DataCatalogs to cleaned_modules/combined_modules) and adapts convert_stata_to_pandas/task.py.
Metadata:
- Adds per-module metadata creation (index/variable dtypes, survey-year availability) and writes variable_to_metadata_mapping.yaml; copies to src/create_metadata/.
Dataset merging:
- helper.py now consumes the YAML mapping (map_variable_to_module), improves validation, supports PNode types; example task loads YAML; tests updated.
Data cleaning fixes:
- hwealth: treat -8 as missing for vehicle/wealth fields.
- pkal: refactors monthly employment handling, adds number_of_months_employed and employed_in_at_least_one_month.
Config/Tooling:
- pyproject.toml: switch to Pixi workspace and add pytask-parallel.
- .gitignore: ignore *.yaml; add data/V38/.gitkeep.

^{Written by Cursor Bugbot for commit cab979e. This will update automatically on new commits. Configure here.}

…pendencies.

hmgaudecker

Thanks!

I am just not sure where map_variable_to_data_file comes from. As discussed, I think we should

Have a task creating a yaml it in bld
Copy it to the source folder and read from there
Change the task in 1. so that it fails if the newly-created version is different from the version in source.

src/soep_preparation/create_metadata/task.py

felixschmitz · 2025-08-04T11:06:13Z

Thanks for the input!
I envisioned this PR to be a bit more separate from PR #45, but with the comment, it links quite closely. So I suggest merging #47 into #45 after resolving your comment above.

felixschmitz

Two things, one on where map_variable_to_data_file is "coming from", the other, how to make sure changes are tracked by pytask.

First, I think most of the confusion stems from the fact that the mapping is produced inside create_metadata/task.py by task_create_variable_to_metadata_name_mapping and then stored in DATA_CATALOGS["metadata"]["mapping"].
In dataset_merging/helper.py, I then read the mapping, and it is called map_variable_to_data_file, which is not completely spot on (since it references not only "data files", e.g., biobirth, but also data from combined variables, e.g., medical_variables).
Any suggestions about naming things?

Second, the code suggestions introduced in this PR allow for pytask to track changes in the data, see the comment below.

src/soep_preparation/create_metadata/task.py

felixschmitz · 2025-08-07T15:15:33Z

I started on the steps you outlined above.
There is now:

A task creating map_variable_to_metadata_name (task_create_variable_to_metadata_name_mapping)
Tasks to create a yaml stored in BLD and SRC.

I tried to give sensible names to things, but had a hard time (mainly due to the fact that we handle variables from single data files (e.g., biobirth) and variables combined from multiple data files (e.g., soep_sample). This leads later on in the mapping to awkward names.

Do you have any ideas on this?
Once you have approved the general structure of the yaml, the next step is to adapt the dataset merging to read information from this file.

hmgaudecker · 2025-08-07T16:05:16Z

Tasks to create a yaml stored in BLD and SRC.

Can't look right now, but the task should only create a file in BLD. This will be manually (could add a helper script if you want) copied to SRC.

Then have a check to compare the newly-created with the existing file. This tells us whether we need to update the one in SRC before moving on with the merging.

… variables) and update metadata pipeline.

hmgaudecker

Thanks a lot, great work! Getting closer...

hmgaudecker · 2025-11-14T14:42:23Z

.gitignore

 *.pickle
+
+# ignore yaml files
+*.yaml


Why's that?

Misunderstanding, see below. Please revert.

src/soep_preparation/clean_modules/hwealth.py

src/soep_preparation/combine_modules/task.py

hmgaudecker · 2025-11-14T15:05:03Z

src/soep_preparation/combine_modules/task.py

+                TypeError: If input data files or function is not of expected type.
+                ValueError: If number of dataframes is not as expected.
+            """
+            _error_handling_derived_variables(


Suggested change

_error_handling_derived_variables(

fail_if_input_has_invalid_type(input_=mapping, expected_dtypes=["dict"])

_fail_if_too_many_or_too_few_dataframes(dataframes=mapping, expected_entries=2)

fail_if_input_has_invalid_type(input_=function_, expected_dtypes=["function"])

No reason to define an extra function here.

But why do we care about the number of dataframes? Could I not write a function that depends on 3?

hmgaudecker · 2025-11-14T15:11:53Z

src/soep_preparation/combine_modules/task.py

+        )
+
+        @task(id=variable_name)
+        def task_create_combined_modules(


Can you explain why we go down to the variable level here? Why can't we just do what we do in clean_modules, i.e., call the modules pkal_pl etc. and do the combination of all contents in the body of the main function there? I think it's much clearer to stay at the (SOEP) module level everywhere. (but I only understand this now, ofc)

hmgaudecker · 2025-11-14T15:49:44Z

src/soep_preparation/create_metadata/task.py

+def _create_variable_to_metadata_mapping(
+    map_module_to_metadata: dict,
+) -> dict[str, dict]:
+    """Create a mapping of variable to metadata.


Suggested change

"""Create a mapping of variable to metadata.

"""Return a mapping of variables to their metadata.

hmgaudecker · 2025-11-14T15:50:30Z

src/soep_preparation/create_metadata/task.py

+        A mapping of variable to metadata.
+    """
+    mapping = {}
+    for module_name, metadata in map_module_to_metadata.items():


Suggested change

for module_name, metadata in map_module_to_metadata.items():

for module_name, variables_to_metadata in map_modules_to_variables_and_their_metadata.items():

hmgaudecker · 2025-11-14T15:51:00Z

src/soep_preparation/create_metadata/task.py

+    """
+    mapping = {}
+    for module_name, metadata in map_module_to_metadata.items():
+        for variable_name, variable_metadata in metadata["variable_metadata"].items():


Suggested change

for variable_name, variable_metadata in metadata["variable_metadata"].items():

for name, metadata in metadata["variables_to_metadata"].items():

hmgaudecker · 2025-11-14T15:51:42Z

src/soep_preparation/create_metadata/task.py

+    mapping = {}
+    for module_name, metadata in map_module_to_metadata.items():
+        for variable_name, variable_metadata in metadata["variable_metadata"].items():
+            mapping[variable_name] = {"module": module_name} | variable_metadata


Suggested change

mapping[variable_name] = {"module": module_name} | variable_metadata

mapping[name] = {"module": module_name} | metadata

hmgaudecker · 2025-11-14T15:55:26Z

src/soep_preparation/create_metadata/task.py

+    shutil.copy(in_path, out_path)
+
+
+def _error_handling_mapping_task(modules_metadata: Any) -> None:


Error handling is great, if necessary. But do we gain a lot by these messages relative to the case where something fails in the usual pipeline? It also creates an extra maintenance burden.

…odule level.

…ic manner.

…ceEconomics/soep-preparation into make-task-dependencies-explicit

hmgaudecker

Looks good, thanks!

src/soep_preparation/create_metadata/task.py still seems to need work, else this looks pretty much good to go!

hmgaudecker · 2025-11-19T20:17:46Z

src/soep_preparation/combine_modules/pequiv_pl.py

-        Combined medical variables.
+        Combined pequiv and pl modules.
    """
    out = pd.DataFrame(index=pequiv.index)


Suggested change

out = pd.DataFrame(index=pequiv.index)

out = pd.DataFrame(index=merged.index)

I am not sure what happens if the indexes of the two dataframes for the outer merge are different. Better be explicit about the behaviour,

Same probably in all other files in this folder?

hmgaudecker · 2025-11-19T20:20:38Z

src/soep_preparation/convert_stata_to_pandas/task.py



-if not data_file_names:
+if not MODULE_STRUCTURE["cleaned_modules"]:


We should now check this upon creation of MODULE_STRUCTURE["cleaned_modules"]

hmgaudecker · 2025-11-19T20:26:17Z

src/soep_preparation/utilities/general.py

        raise FileNotFoundError(msg)


+def get_variable_names_in_script(script: Any) -> list[str]:  # noqa: ANN401


This is a bit confusing for me. For one, as an ex-Stata user, "variable_names" make me think of "column names" in a data project. So maybe "global_object_names" ? Second, the function name is very general, but then it only returns those objects that start with derive_?

hmgaudecker · 2025-11-19T20:26:50Z

src/soep_preparation/utilities/general.py

+        The variable names in the script.
+    """
+    return [
+        variable_name.split("derive_")[-1]


Suggested change

variable_name.split("derive_")[-1]

variable_name[7:]

?

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-11-19T20:31:10Z

.gitignore

 *.pickle
+
+# ignore yaml files
+*.yaml


Bug: YAML files incorrectly ignored by git (Bugbot Rules)

Adding *.yaml to .gitignore prevents version control of YAML files that need to be tracked, specifically variable_to_metadata_mapping.yaml in SRC/. The PR discussion indicates this file should be manually copied to SRC/ and version-controlled to track changes. The maintainer explicitly requested reverting this change. The pattern also conflicts with existing YAML configuration files like .yamllint.yml that are already tracked.

felixschmitz added 3 commits July 31, 2025 12:18

Refactor config, raw data task and metadata tasks to have explicit de…

badeaaf

…pendencies.

Refactor dataset merging dependencies.

26cff1b

Fix error handling in dataset merging.

741ae92

felixschmitz requested a review from hmgaudecker July 31, 2025 12:20

felixschmitz self-assigned this Jul 31, 2025

Move to .

851dd66

hmgaudecker reviewed Aug 2, 2025

View reviewed changes

src/soep_preparation/create_metadata/task.py Outdated Show resolved Hide resolved

felixschmitz added 3 commits August 4, 2025 13:41

Rename mapping-dictionaries in metadata and dataset creation.

166d02f

Addendum to renaming.

15cf8ec

Fix tests.

5583763

felixschmitz commented Aug 4, 2025

View reviewed changes

src/soep_preparation/create_metadata/task.py Outdated Show resolved Hide resolved

felixschmitz requested a review from hmgaudecker August 4, 2025 15:47

felixschmitz and others added 8 commits August 5, 2025 15:04

Merge branch 'main' into make-task-dependencies-explicit

dfbbc54

Move error handling of empty inputs into utilities directory.

7064d98

Move get_variable_names_in_module function.

bcd246c

Add available survey years to variable metadata.

1409733

Re-Add pytask-parallel.

fbc3372

Fix cleaning variables.

43d8b09

Store merging information in yaml.

89ad982

Rename variables and helper functions of metadata tasks.

6afdec7

Create helper function to move yml from bld to src.

d9d9157

This was linked to issues Nov 10, 2025

BUG: Adapt metadata data_file information #57

Open

ENH: Implement metadata dtype categorical #58

Open

felixschmitz added 2 commits November 14, 2025 15:32

Introduce module as umbrella-term for DataFrames (containing multiple…

3a615e2

… variables) and update metadata pipeline.

Provisional fix of dataset merging based on metadata mapping in yaml.

fb81460

felixschmitz linked an issue Nov 14, 2025 that may be closed by this pull request

BUG: Make task to combine variable from multiple modules static #59

Open

hmgaudecker reviewed Nov 14, 2025

View reviewed changes

felixschmitz mentioned this pull request Nov 17, 2025

ENH: Make float dtype conversion consistent #60

Open

felixschmitz and others added 5 commits November 17, 2025 15:41

Generalize error handling of missing function for script.

95364ba

Refactor combining variables from multiple modules to take place on m…

87194c7

…odule level.

Keep directory structure around.

d30e6ac

Introduce module structure inside config to loop over tasks in a stat…

8858c44

…ic manner.

Merge branch 'make-task-dependencies-explicit' of github.com:OpenSour…

fd74f19

…ceEconomics/soep-preparation into make-task-dependencies-explicit

hmgaudecker reviewed Nov 19, 2025

View reviewed changes

Get rid of pixi deprecation warning.

cab979e

cursor bot reviewed Nov 19, 2025

View reviewed changes

-            _error_handling_derived_variables(
+    fail_if_input_has_invalid_type(input_=mapping, expected_dtypes=["dict"])
+    _fail_if_too_many_or_too_few_dataframes(dataframes=mapping, expected_entries=2)
+    fail_if_input_has_invalid_type(input_=function_, expected_dtypes=["function"])

	"""Create a mapping of variable to metadata.
	"""Return a mapping of variables to their metadata.

	for module_name, metadata in map_module_to_metadata.items():
	for module_name, variables_to_metadata in map_modules_to_variables_and_their_metadata.items():

	for variable_name, variable_metadata in metadata["variable_metadata"].items():
	for name, metadata in metadata["variables_to_metadata"].items():

	mapping[variable_name] = {"module": module_name} \| variable_metadata
	mapping[name] = {"module": module_name} \| metadata

		shutil.copy(in_path, out_path)


		def _error_handling_mapping_task(modules_metadata: Any) -> None:

	out = pd.DataFrame(index=pequiv.index)
	out = pd.DataFrame(index=merged.index)



		if not data_file_names:
		if not MODULE_STRUCTURE["cleaned_modules"]:

		raise FileNotFoundError(msg)


		def get_variable_names_in_script(script: Any) -> list[str]: # noqa: ANN401

Make task dependencies explicit #47

Are you sure you want to change the base?

Make task dependencies explicit #47

Uh oh!

Conversation

felixschmitz commented Jul 31, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hmgaudecker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felixschmitz commented Aug 4, 2025

Uh oh!

felixschmitz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felixschmitz commented Aug 7, 2025

Uh oh!

hmgaudecker commented Aug 7, 2025

Uh oh!

hmgaudecker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmgaudecker Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmgaudecker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Nov 19, 2025

Choose a reason for hiding this comment

Bug: YAML files incorrectly ignored by git (Bugbot Rules)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

felixschmitz commented Jul 31, 2025 •

edited by cursor bot

Loading

hmgaudecker Nov 14, 2025 •

edited

Loading