Skip to content

Conversation

@felixschmitz
Copy link
Collaborator

@felixschmitz felixschmitz commented Jul 31, 2025

See Issue #46


Note

Refactors the pipeline to explicit module-based clean/combine tasks, adds YAML metadata mapping, updates dataset merging to use it, and improves hwealth/pkal cleaning with config/tooling tweaks.

  • Pipeline/Tasks:
    • Introduces clean_modules/task.py and new combine_modules/* with explicit combine(...) functions; removes combine_variables/*.
    • Updates config.py (adds MODULE_STRUCTURE, BLD, renames DataCatalogs to cleaned_modules/combined_modules) and adapts convert_stata_to_pandas/task.py.
  • Metadata:
    • Adds per-module metadata creation (index/variable dtypes, survey-year availability) and writes variable_to_metadata_mapping.yaml; copies to src/create_metadata/.
  • Dataset merging:
    • helper.py now consumes the YAML mapping (map_variable_to_module), improves validation, supports PNode types; example task loads YAML; tests updated.
  • Data cleaning fixes:
    • hwealth: treat -8 as missing for vehicle/wealth fields.
    • pkal: refactors monthly employment handling, adds number_of_months_employed and employed_in_at_least_one_month.
  • Config/Tooling:
    • pyproject.toml: switch to Pixi workspace and add pytask-parallel.
    • .gitignore: ignore *.yaml; add data/V38/.gitkeep.

Written by Cursor Bugbot for commit cab979e. This will update automatically on new commits. Configure here.

@felixschmitz felixschmitz requested a review from hmgaudecker July 31, 2025 12:20
@felixschmitz felixschmitz self-assigned this Jul 31, 2025
Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I am just not sure where map_variable_to_data_file comes from. As discussed, I think we should

  1. Have a task creating a yaml it in bld
  2. Copy it to the source folder and read from there
  3. Change the task in 1. so that it fails if the newly-created version is different from the version in source.

@felixschmitz
Copy link
Collaborator Author

Thanks for the input!
I envisioned this PR to be a bit more separate from PR #45, but with the comment, it links quite closely. So I suggest merging #47 into #45 after resolving your comment above.

Copy link
Collaborator Author

@felixschmitz felixschmitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things, one on where map_variable_to_data_file is "coming from", the other, how to make sure changes are tracked by pytask.

First, I think most of the confusion stems from the fact that the mapping is produced inside create_metadata/task.py by task_create_variable_to_metadata_name_mapping and then stored in DATA_CATALOGS["metadata"]["mapping"].
In dataset_merging/helper.py, I then read the mapping, and it is called map_variable_to_data_file, which is not completely spot on (since it references not only "data files", e.g., biobirth, but also data from combined variables, e.g., medical_variables).
Any suggestions about naming things?

Second, the code suggestions introduced in this PR allow for pytask to track changes in the data, see the comment below.

@felixschmitz
Copy link
Collaborator Author

I started on the steps you outlined above.
There is now:

  1. A task creating map_variable_to_metadata_name (task_create_variable_to_metadata_name_mapping)
  2. Tasks to create a yaml stored in BLD and SRC.

I tried to give sensible names to things, but had a hard time (mainly due to the fact that we handle variables from single data files (e.g., biobirth) and variables combined from multiple data files (e.g., soep_sample). This leads later on in the mapping to awkward names.

Do you have any ideas on this?
Once you have approved the general structure of the yaml, the next step is to adapt the dataset merging to read information from this file.

@hmgaudecker
Copy link
Collaborator

Tasks to create a yaml stored in BLD and SRC.

Can't look right now, but the task should only create a file in BLD. This will be manually (could add a helper script if you want) copied to SRC.

Then have a check to compare the newly-created with the existing file. This tells us whether we need to update the one in SRC before moving on with the merging.

@felixschmitz felixschmitz linked an issue Nov 14, 2025 that may be closed by this pull request
Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, great work! Getting closer...

*.pickle

# ignore yaml files
*.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that?

Copy link
Collaborator

@hmgaudecker hmgaudecker Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misunderstanding, see below. Please revert.

TypeError: If input data files or function is not of expected type.
ValueError: If number of dataframes is not as expected.
"""
_error_handling_derived_variables(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_error_handling_derived_variables(
fail_if_input_has_invalid_type(input_=mapping, expected_dtypes=["dict"])
_fail_if_too_many_or_too_few_dataframes(dataframes=mapping, expected_entries=2)
fail_if_input_has_invalid_type(input_=function_, expected_dtypes=["function"])

No reason to define an extra function here.

But why do we care about the number of dataframes? Could I not write a function that depends on 3?

)

@task(id=variable_name)
def task_create_combined_modules(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we go down to the variable level here? Why can't we just do what we do in clean_modules, i.e., call the modules pkal_pl etc. and do the combination of all contents in the body of the main function there? I think it's much clearer to stay at the (SOEP) module level everywhere. (but I only understand this now, ofc)

def _create_variable_to_metadata_mapping(
map_module_to_metadata: dict,
) -> dict[str, dict]:
"""Create a mapping of variable to metadata.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Create a mapping of variable to metadata.
"""Return a mapping of variables to their metadata.

A mapping of variable to metadata.
"""
mapping = {}
for module_name, metadata in map_module_to_metadata.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for module_name, metadata in map_module_to_metadata.items():
for module_name, variables_to_metadata in map_modules_to_variables_and_their_metadata.items():

"""
mapping = {}
for module_name, metadata in map_module_to_metadata.items():
for variable_name, variable_metadata in metadata["variable_metadata"].items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for variable_name, variable_metadata in metadata["variable_metadata"].items():
for name, metadata in metadata["variables_to_metadata"].items():

mapping = {}
for module_name, metadata in map_module_to_metadata.items():
for variable_name, variable_metadata in metadata["variable_metadata"].items():
mapping[variable_name] = {"module": module_name} | variable_metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mapping[variable_name] = {"module": module_name} | variable_metadata
mapping[name] = {"module": module_name} | metadata

shutil.copy(in_path, out_path)


def _error_handling_mapping_task(modules_metadata: Any) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling is great, if necessary. But do we gain a lot by these messages relative to the case where something fails in the usual pipeline? It also creates an extra maintenance burden.

Copy link
Collaborator

@hmgaudecker hmgaudecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

src/soep_preparation/create_metadata/task.py still seems to need work, else this looks pretty much good to go!

Combined medical variables.
Combined pequiv and pl modules.
"""
out = pd.DataFrame(index=pequiv.index)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
out = pd.DataFrame(index=pequiv.index)
out = pd.DataFrame(index=merged.index)

I am not sure what happens if the indexes of the two dataframes for the outer merge are different. Better be explicit about the behaviour,

Same probably in all other files in this folder?



if not data_file_names:
if not MODULE_STRUCTURE["cleaned_modules"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should now check this upon creation of MODULE_STRUCTURE["cleaned_modules"]

raise FileNotFoundError(msg)


def get_variable_names_in_script(script: Any) -> list[str]: # noqa: ANN401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing for me. For one, as an ex-Stata user, "variable_names" make me think of "column names" in a data project. So maybe "global_object_names" ? Second, the function name is very general, but then it only returns those objects that start with derive_?

The variable names in the script.
"""
return [
variable_name.split("derive_")[-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variable_name.split("derive_")[-1]
variable_name[7:]

?

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

*.pickle

# ignore yaml files
*.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: YAML files incorrectly ignored by git (Bugbot Rules)

Adding *.yaml to .gitignore prevents version control of YAML files that need to be tracked, specifically variable_to_metadata_mapping.yaml in SRC/. The PR discussion indicates this file should be manually copied to SRC/ and version-controlled to track changes. The maintainer explicitly requested reverting this change. The pattern also conflicts with existing YAML configuration files like .yamllint.yml that are already tracked.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Make task to combine variable from multiple modules static ENH: Implement metadata dtype categorical BUG: Adapt metadata data_file information

3 participants