-
Notifications
You must be signed in to change notification settings - Fork 0
Make task dependencies explicit #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
hmgaudecker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I am just not sure where map_variable_to_data_file comes from. As discussed, I think we should
- Have a task creating a
yamlit inbld - Copy it to the source folder and read from there
- Change the task in 1. so that it fails if the newly-created version is different from the version in source.
felixschmitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things, one on where map_variable_to_data_file is "coming from", the other, how to make sure changes are tracked by pytask.
First, I think most of the confusion stems from the fact that the mapping is produced inside create_metadata/task.py by task_create_variable_to_metadata_name_mapping and then stored in DATA_CATALOGS["metadata"]["mapping"].
In dataset_merging/helper.py, I then read the mapping, and it is called map_variable_to_data_file, which is not completely spot on (since it references not only "data files", e.g., biobirth, but also data from combined variables, e.g., medical_variables).
Any suggestions about naming things?
Second, the code suggestions introduced in this PR allow for pytask to track changes in the data, see the comment below.
|
I started on the steps you outlined above.
I tried to give sensible names to things, but had a hard time (mainly due to the fact that we handle variables from single data files (e.g., Do you have any ideas on this? |
Can't look right now, but the task should only create a file in Then have a check to compare the newly-created with the existing file. This tells us whether we need to update the one in |
… variables) and update metadata pipeline.
hmgaudecker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, great work! Getting closer...
| *.pickle | ||
|
|
||
| # ignore yaml files | ||
| *.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why's that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misunderstanding, see below. Please revert.
| TypeError: If input data files or function is not of expected type. | ||
| ValueError: If number of dataframes is not as expected. | ||
| """ | ||
| _error_handling_derived_variables( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| _error_handling_derived_variables( | |
| fail_if_input_has_invalid_type(input_=mapping, expected_dtypes=["dict"]) | |
| _fail_if_too_many_or_too_few_dataframes(dataframes=mapping, expected_entries=2) | |
| fail_if_input_has_invalid_type(input_=function_, expected_dtypes=["function"]) |
No reason to define an extra function here.
But why do we care about the number of dataframes? Could I not write a function that depends on 3?
| ) | ||
|
|
||
| @task(id=variable_name) | ||
| def task_create_combined_modules( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we go down to the variable level here? Why can't we just do what we do in clean_modules, i.e., call the modules pkal_pl etc. and do the combination of all contents in the body of the main function there? I think it's much clearer to stay at the (SOEP) module level everywhere. (but I only understand this now, ofc)
| def _create_variable_to_metadata_mapping( | ||
| map_module_to_metadata: dict, | ||
| ) -> dict[str, dict]: | ||
| """Create a mapping of variable to metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """Create a mapping of variable to metadata. | |
| """Return a mapping of variables to their metadata. |
| A mapping of variable to metadata. | ||
| """ | ||
| mapping = {} | ||
| for module_name, metadata in map_module_to_metadata.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for module_name, metadata in map_module_to_metadata.items(): | |
| for module_name, variables_to_metadata in map_modules_to_variables_and_their_metadata.items(): |
| """ | ||
| mapping = {} | ||
| for module_name, metadata in map_module_to_metadata.items(): | ||
| for variable_name, variable_metadata in metadata["variable_metadata"].items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for variable_name, variable_metadata in metadata["variable_metadata"].items(): | |
| for name, metadata in metadata["variables_to_metadata"].items(): |
| mapping = {} | ||
| for module_name, metadata in map_module_to_metadata.items(): | ||
| for variable_name, variable_metadata in metadata["variable_metadata"].items(): | ||
| mapping[variable_name] = {"module": module_name} | variable_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| mapping[variable_name] = {"module": module_name} | variable_metadata | |
| mapping[name] = {"module": module_name} | metadata |
| shutil.copy(in_path, out_path) | ||
|
|
||
|
|
||
| def _error_handling_mapping_task(modules_metadata: Any) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling is great, if necessary. But do we gain a lot by these messages relative to the case where something fails in the usual pipeline? It also creates an extra maintenance burden.
…ceEconomics/soep-preparation into make-task-dependencies-explicit
hmgaudecker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
src/soep_preparation/create_metadata/task.py still seems to need work, else this looks pretty much good to go!
| Combined medical variables. | ||
| Combined pequiv and pl modules. | ||
| """ | ||
| out = pd.DataFrame(index=pequiv.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| out = pd.DataFrame(index=pequiv.index) | |
| out = pd.DataFrame(index=merged.index) |
I am not sure what happens if the indexes of the two dataframes for the outer merge are different. Better be explicit about the behaviour,
Same probably in all other files in this folder?
|
|
||
|
|
||
| if not data_file_names: | ||
| if not MODULE_STRUCTURE["cleaned_modules"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should now check this upon creation of MODULE_STRUCTURE["cleaned_modules"]
| raise FileNotFoundError(msg) | ||
|
|
||
|
|
||
| def get_variable_names_in_script(script: Any) -> list[str]: # noqa: ANN401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing for me. For one, as an ex-Stata user, "variable_names" make me think of "column names" in a data project. So maybe "global_object_names" ? Second, the function name is very general, but then it only returns those objects that start with derive_?
| The variable names in the script. | ||
| """ | ||
| return [ | ||
| variable_name.split("derive_")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| variable_name.split("derive_")[-1] | |
| variable_name[7:] |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| *.pickle | ||
|
|
||
| # ignore yaml files | ||
| *.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: YAML files incorrectly ignored by git (Bugbot Rules)
Adding *.yaml to .gitignore prevents version control of YAML files that need to be tracked, specifically variable_to_metadata_mapping.yaml in SRC/. The PR discussion indicates this file should be manually copied to SRC/ and version-controlled to track changes. The maintainer explicitly requested reverting this change. The pattern also conflicts with existing YAML configuration files like .yamllint.yml that are already tracked.
See Issue #46
Note
Refactors the pipeline to explicit module-based clean/combine tasks, adds YAML metadata mapping, updates dataset merging to use it, and improves hwealth/pkal cleaning with config/tooling tweaks.
clean_modules/task.pyand newcombine_modules/*with explicitcombine(...)functions; removescombine_variables/*.config.py(addsMODULE_STRUCTURE,BLD, renames DataCatalogs tocleaned_modules/combined_modules) and adaptsconvert_stata_to_pandas/task.py.variable_to_metadata_mapping.yaml; copies tosrc/create_metadata/.helper.pynow consumes the YAML mapping (map_variable_to_module), improves validation, supports PNode types; example task loads YAML; tests updated.hwealth: treat-8as missing for vehicle/wealth fields.pkal: refactors monthly employment handling, addsnumber_of_months_employedandemployed_in_at_least_one_month.pyproject.toml: switch to Pixi workspace and addpytask-parallel..gitignore: ignore*.yaml; adddata/V38/.gitkeep.Written by Cursor Bugbot for commit cab979e. This will update automatically on new commits. Configure here.