Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
badeaaf
Refactor config, raw data task and metadata tasks to have explicit de…
felixschmitz Jul 31, 2025
26cff1b
Refactor dataset merging dependencies.
felixschmitz Jul 31, 2025
741ae92
Fix error handling in dataset merging.
felixschmitz Jul 31, 2025
851dd66
Move to .
felixschmitz Jul 31, 2025
166d02f
Rename mapping-dictionaries in metadata and dataset creation.
felixschmitz Aug 4, 2025
15cf8ec
Addendum to renaming.
felixschmitz Aug 4, 2025
5583763
Fix tests.
felixschmitz Aug 4, 2025
dfbbc54
Merge branch 'main' into make-task-dependencies-explicit
felixschmitz Aug 5, 2025
7064d98
Move error handling of empty inputs into utilities directory.
felixschmitz Aug 5, 2025
bcd246c
Move get_variable_names_in_module function.
felixschmitz Aug 7, 2025
1409733
Add available survey years to variable metadata.
felixschmitz Aug 7, 2025
fbc3372
Re-Add pytask-parallel.
felixschmitz Aug 7, 2025
43d8b09
Fix cleaning variables.
felixschmitz Aug 7, 2025
89ad982
Store merging information in yaml.
felixschmitz Aug 7, 2025
6afdec7
Rename variables and helper functions of metadata tasks.
felixschmitz Aug 7, 2025
d9d9157
Create helper function to move yml from bld to src.
felixschmitz Aug 7, 2025
3a615e2
Introduce module as umbrella-term for DataFrames (containing multiple…
felixschmitz Nov 14, 2025
fb81460
Provisional fix of dataset merging based on metadata mapping in yaml.
felixschmitz Nov 14, 2025
95364ba
Generalize error handling of missing function for script.
felixschmitz Nov 17, 2025
87194c7
Refactor combining variables from multiple modules to take place on m…
felixschmitz Nov 17, 2025
d30e6ac
Keep directory structure around.
hmgaudecker Nov 17, 2025
8858c44
Introduce module structure inside config to loop over tasks in a stat…
felixschmitz Nov 17, 2025
fd74f19
Merge branch 'make-task-dependencies-explicit' of github.com:OpenSour…
hmgaudecker Nov 19, 2025
cab979e
Get rid of pixi deprecation warning.
hmgaudecker Nov 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,6 @@ sandbox/

# ignore pickle files
*.pickle

# ignore yaml files
*.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's that?

Copy link
Collaborator

@hmgaudecker hmgaudecker Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misunderstanding, see below. Please revert.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: YAML files incorrectly ignored by git (Bugbot Rules)

Adding *.yaml to .gitignore prevents version control of YAML files that need to be tracked, specifically variable_to_metadata_mapping.yaml in SRC/. The PR discussion indicates this file should be manually copied to SRC/ and version-controlled to track changes. The maintainer explicitly requested reverting this change. The pattern also conflicts with existing YAML configuration files like .yamllint.yml that are already tracked.

Fix in Cursor Fix in Web

Empty file added data/V38/.gitkeep
Empty file.
1,325 changes: 583 additions & 742 deletions pixi.lock

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ allow-direct-references = true
# Pixi configuration
# ======================================================================================

[tool.pixi.project]
[tool.pixi.workspace]
channels = ["conda-forge"]
platforms = ["linux-64", "osx-64", "osx-arm64", "win-64"]

Expand All @@ -78,6 +78,7 @@ pre-commit = "*"
pygraphviz = "*"
pyarrow = ">=21.0.0,<22"
pandas = ">=2.3.1,<3"
pytask-parallel = ">=0.5.1,<0.6"

[tool.pixi.pypi-dependencies]
soep_preparation = {path = ".", editable = true}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,42 +60,51 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
out["hh_net_overall_wealth_d"] = apply_smallest_float_dtype(raw_data["w011hd"])
out["hh_net_overall_wealth_e"] = apply_smallest_float_dtype(raw_data["w011he"])

out["hh_vehicles_value_a"] = apply_smallest_float_dtype(raw_data["v010ha"])
out["hh_vehicles_value_b"] = apply_smallest_float_dtype(raw_data["v010hb"])
out["hh_vehicles_value_c"] = apply_smallest_float_dtype(raw_data["v010hc"])
out["hh_vehicles_value_d"] = apply_smallest_float_dtype(raw_data["v010hd"])
out["hh_vehicles_value_e"] = apply_smallest_float_dtype(raw_data["v010he"])
out["hh_vehicles_value_a"] = apply_smallest_float_dtype(
raw_data["v010ha"].replace({-8: pd.NA})
)
out["hh_vehicles_value_b"] = apply_smallest_float_dtype(
raw_data["v010hb"].replace({-8: pd.NA})
)
out["hh_vehicles_value_c"] = apply_smallest_float_dtype(
raw_data["v010hc"].replace({-8: pd.NA})
)
out["hh_vehicles_value_d"] = apply_smallest_float_dtype(
raw_data["v010hd"].replace({-8: pd.NA})
)
out["hh_vehicles_value_e"] = apply_smallest_float_dtype(
raw_data["v010he"].replace({-8: pd.NA})
)

out["hh_gross_overall_wealth_including_vehicles_a"] = apply_smallest_float_dtype(
raw_data["n010ha"]
raw_data["n010ha"].replace({-8: pd.NA})
)
out["hh_gross_overall_wealth_including_vehicles_b"] = apply_smallest_float_dtype(
raw_data["n010hb"]
raw_data["n010hb"].replace({-8: pd.NA})
)
out["hh_gross_overall_wealth_including_vehicles_c"] = apply_smallest_float_dtype(
raw_data["n010hc"]
raw_data["n010hc"].replace({-8: pd.NA})
)
out["hh_gross_overall_wealth_including_vehicles_d"] = apply_smallest_float_dtype(
raw_data["n010hd"]
raw_data["n010hd"].replace({-8: pd.NA})
)
out["hh_gross_overall_wealth_including_vehicles_e"] = apply_smallest_float_dtype(
raw_data["n010he"]
raw_data["n010he"].replace({-8: pd.NA})
)

out["hh_net_overall_wealth_including_vehicles_and_student_loans_a"] = (
apply_smallest_float_dtype(raw_data["n011ha"])
apply_smallest_float_dtype(raw_data["n011ha"].replace({-8: pd.NA}))
)
out["hh_net_overall_wealth_including_vehicles_and_student_loans_b"] = (
apply_smallest_float_dtype(raw_data["n011hb"])
apply_smallest_float_dtype(raw_data["n011hb"].replace({-8: pd.NA}))
)
out["hh_net_overall_wealth_including_vehicles_and_student_loans_c"] = (
apply_smallest_float_dtype(raw_data["n011hc"])
apply_smallest_float_dtype(raw_data["n011hc"].replace({-8: pd.NA}))
)
out["hh_net_overall_wealth_including_vehicles_and_student_loans_d"] = (
apply_smallest_float_dtype(raw_data["n011hd"])
apply_smallest_float_dtype(raw_data["n011hd"].replace({-8: pd.NA}))
)
out["hh_net_overall_wealth_including_vehicles_and_student_loans_e"] = (
apply_smallest_float_dtype(raw_data["n011he"])
apply_smallest_float_dtype(raw_data["n011he"].replace({-8: pd.NA}))
)

out[
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
The processed pkal data.
"""
out = pd.DataFrame()
tmp = pd.DataFrame()

out["p_id"] = apply_smallest_int_dtype(raw_data["pid"])
out["hh_id"] = apply_smallest_int_dtype(raw_data["hid"])
Expand All @@ -102,20 +103,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
# the second the timeframe 1998 until 2022
# individual employment status by month
# Month 1 - Jan
out["tmp_ft_employed_m_v1_1"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_1"] = object_to_str_categorical(
series=raw_data["kal1a001_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_1"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_1"] = object_to_str_categorical(
series=raw_data["kal1a001_v2"],
renaming={
"[1] Jan Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Jan Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_1"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_1"],
series_2=out["tmp_ft_employed_m_v2_1"],
series_1=tmp["tmp_ft_employed_m_v1_1"],
series_2=tmp["tmp_ft_employed_m_v2_1"],
ordered=False,
)
out["pt_employed_m_1"] = object_to_str_categorical(
Expand All @@ -131,20 +132,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 2 - Feb
out["tmp_ft_employed_m_v1_2"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_2"] = object_to_str_categorical(
series=raw_data["kal1a002_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_2"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_2"] = object_to_str_categorical(
series=raw_data["kal1a002_v2"],
renaming={
"[1] Feb Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Feb Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_2"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_2"],
series_2=out["tmp_ft_employed_m_v2_2"],
series_1=tmp["tmp_ft_employed_m_v1_2"],
series_2=tmp["tmp_ft_employed_m_v2_2"],
ordered=False,
)
out["pt_employed_m_2"] = object_to_str_categorical(
Expand All @@ -160,20 +161,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 3 - Mrz
out["tmp_ft_employed_m_v1_3"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_3"] = object_to_str_categorical(
series=raw_data["kal1a003_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_3"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_3"] = object_to_str_categorical(
series=raw_data["kal1a003_v2"],
renaming={
"[1] Mrz Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Mrz Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_3"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_3"],
series_2=out["tmp_ft_employed_m_v2_3"],
series_1=tmp["tmp_ft_employed_m_v1_3"],
series_2=tmp["tmp_ft_employed_m_v2_3"],
ordered=False,
)
out["pt_employed_m_3"] = object_to_str_categorical(
Expand All @@ -189,20 +190,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 4 - Apr
out["tmp_ft_employed_m_v1_4"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_4"] = object_to_str_categorical(
series=raw_data["kal1a004_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_4"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_4"] = object_to_str_categorical(
series=raw_data["kal1a004_v2"],
renaming={
"[1] Apr Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Apr Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_4"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_4"],
series_2=out["tmp_ft_employed_m_v2_4"],
series_1=tmp["tmp_ft_employed_m_v1_4"],
series_2=tmp["tmp_ft_employed_m_v2_4"],
ordered=False,
)
out["pt_employed_m_4"] = object_to_str_categorical(
Expand All @@ -218,20 +219,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 5 - Mai
out["tmp_ft_employed_m_v1_5"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_5"] = object_to_str_categorical(
series=raw_data["kal1a005_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_5"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_5"] = object_to_str_categorical(
series=raw_data["kal1a005_v2"],
renaming={
"[1] Mai Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Mai Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_5"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_5"],
series_2=out["tmp_ft_employed_m_v2_5"],
series_1=tmp["tmp_ft_employed_m_v1_5"],
series_2=tmp["tmp_ft_employed_m_v2_5"],
ordered=False,
)
out["pt_employed_m_5"] = object_to_str_categorical(
Expand All @@ -247,20 +248,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 6 - Jun
out["tmp_ft_employed_m_v1_6"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_6"] = object_to_str_categorical(
series=raw_data["kal1a006_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_6"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_6"] = object_to_str_categorical(
series=raw_data["kal1a006_v2"],
renaming={
"[1] Jun Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Jun Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_6"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_6"],
series_2=out["tmp_ft_employed_m_v2_6"],
series_1=tmp["tmp_ft_employed_m_v1_6"],
series_2=tmp["tmp_ft_employed_m_v2_6"],
ordered=False,
)
out["pt_employed_m_6"] = object_to_str_categorical(
Expand All @@ -276,20 +277,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 7 - Jul
out["tmp_ft_employed_m_v1_7"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_7"] = object_to_str_categorical(
series=raw_data["kal1a007_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_7"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_7"] = object_to_str_categorical(
series=raw_data["kal1a007_v2"],
renaming={
"[1] Jul Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Jul Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_7"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_7"],
series_2=out["tmp_ft_employed_m_v2_7"],
series_1=tmp["tmp_ft_employed_m_v1_7"],
series_2=tmp["tmp_ft_employed_m_v2_7"],
ordered=False,
)
out["pt_employed_m_7"] = object_to_str_categorical(
Expand All @@ -305,20 +306,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 8 - Aug
out["tmp_ft_employed_m_v1_8"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_8"] = object_to_str_categorical(
series=raw_data["kal1a008_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_8"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_8"] = object_to_str_categorical(
series=raw_data["kal1a008_v2"],
renaming={
"[1] Aug Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Aug Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_8"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_8"],
series_2=out["tmp_ft_employed_m_v2_8"],
series_1=tmp["tmp_ft_employed_m_v1_8"],
series_2=tmp["tmp_ft_employed_m_v2_8"],
ordered=False,
)
out["pt_employed_m_8"] = object_to_str_categorical(
Expand All @@ -334,20 +335,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 9 - Sep
out["tmp_ft_employed_m_v1_9"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_9"] = object_to_str_categorical(
series=raw_data["kal1a009_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_9"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_9"] = object_to_str_categorical(
series=raw_data["kal1a009_v2"],
renaming={
"[1] Sep Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Sep Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_9"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_9"],
series_2=out["tmp_ft_employed_m_v2_9"],
series_1=tmp["tmp_ft_employed_m_v1_9"],
series_2=tmp["tmp_ft_employed_m_v2_9"],
ordered=False,
)
out["pt_employed_m_9"] = object_to_str_categorical(
Expand All @@ -363,20 +364,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 10 - Okt
out["tmp_ft_employed_m_v1_10"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_10"] = object_to_str_categorical(
series=raw_data["kal1a010_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_10"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_10"] = object_to_str_categorical(
series=raw_data["kal1a010_v2"],
renaming={
"[1] Okt Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Okt Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_10"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_10"],
series_2=out["tmp_ft_employed_m_v2_10"],
series_1=tmp["tmp_ft_employed_m_v1_10"],
series_2=tmp["tmp_ft_employed_m_v2_10"],
ordered=False,
)
out["pt_employed_m_10"] = object_to_str_categorical(
Expand All @@ -392,20 +393,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 11 - Nov
out["tmp_ft_employed_m_v1_11"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_11"] = object_to_str_categorical(
series=raw_data["kal1a011_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_11"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_11"] = object_to_str_categorical(
series=raw_data["kal1a011_v2"],
renaming={
"[1] Nov Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Nov Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_11"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_11"],
series_2=out["tmp_ft_employed_m_v2_11"],
series_1=tmp["tmp_ft_employed_m_v1_11"],
series_2=tmp["tmp_ft_employed_m_v2_11"],
ordered=False,
)
out["pt_employed_m_11"] = object_to_str_categorical(
Expand All @@ -421,20 +422,20 @@ def clean(raw_data: pd.DataFrame) -> pd.DataFrame:
)

# Month 12 - Dez
out["tmp_ft_employed_m_v1_12"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v1_12"] = object_to_str_categorical(
series=raw_data["kal1a012_v1"],
renaming={"[1] Ja": "Vollzeit erwerbstätig"},
)
out["tmp_ft_employed_m_v2_12"] = object_to_str_categorical(
tmp["tmp_ft_employed_m_v2_12"] = object_to_str_categorical(
series=raw_data["kal1a012_v2"],
renaming={
"[1] Dez Vollzeit erwerbst.": "Vollzeit erwerbstätig",
"[8] Dez Werkstatt fuer behinderte Menschen": "Werkstatt für behinderte Menschen", # noqa: E501
},
)
out["ft_employed_m_12"] = combine_first_and_make_categorical(
series_1=out["tmp_ft_employed_m_v1_12"],
series_2=out["tmp_ft_employed_m_v2_12"],
series_1=tmp["tmp_ft_employed_m_v1_12"],
series_2=tmp["tmp_ft_employed_m_v2_12"],
ordered=False,
)
out["pt_employed_m_12"] = object_to_str_categorical(
Expand Down
Loading
Loading