Skip to content

Implement CMIP7 support: ControlledVocabularies and GlobalAttributes#252

Open
JanStreffing wants to merge 18 commits intomainfrom
cmip7-fesom2
Open

Implement CMIP7 support: ControlledVocabularies and GlobalAttributes#252
JanStreffing wants to merge 18 commits intomainfrom
cmip7-fesom2

Conversation

@JanStreffing
Copy link
Contributor

@JanStreffing JanStreffing commented Mar 13, 2026

  • Add CMIP7ControlledVocabularies.load() method to return empty instance (CMIP7 uses unified all_var_info.json instead of separate CV files)
  • Implement CMIP7GlobalAttributes with init, global_attributes(), subdir_path(), and get_tracking_id() methods
  • Add cmorize_sst.yaml configuration for AWI-ESM3-VEG-LR piControl SST CMORization to CMIP7 standard

- Add CMIP7ControlledVocabularies.load() method to return empty instance
  (CMIP7 uses unified all_var_info.json instead of separate CV files)
- Implement CMIP7GlobalAttributes with __init__, global_attributes(),
  subdir_path(), and get_tracking_id() methods
- Add cmorize_sst.yaml configuration for AWI-ESM3-VEG-LR piControl SST
  CMORization to CMIP7 standard
@JanStreffing JanStreffing self-assigned this Mar 13, 2026
- Implemented CMIP7ControlledVocabularies.load() returning empty instance
- Implemented CMIP7GlobalAttributes with __init__, global_attributes(), subdir_path(), get_tracking_id()
- Added conditional Prefect imports to avoid server startup when using native orchestrator
- Updated cmorize_sst.yaml with grid_file, pipeline_workflow_orchestrator settings
- Fixed configuration for CMIP7 SST CMORization

Remaining blocker: Dask cluster still initializes despite enable_dask: False
- Added chunks=None when enable_dask: False in gather_inputs.py
- Modified trigger_compute to skip .compute() when enable_dask: False
- Switched grid_file to mesh.nc (CMOR-ready with lat/lon bounds)
- Updated cmorize_sst.yaml with xarray_open_mfdataset_parallel: False

Note: Dask workers still spawning in pipeline despite all fixes - deeper architectural issue with xarray operations creating dask arrays
- Switch to dars2 mesh (3.1M nodes) matching input data
- Remove time_average step (can't resample monthly to 3hr/daily)
- Output verified with real data values matching input
- 12 monthly timesteps, Min: -1.93°C, Max: 35.22°C
- Implement CMIP7DataRequestVariable.attrs property with proper CMIP7 fields
  (standard_name, long_name, units, cell_methods, comment)
- Add set_variable_attrs step to pipeline to rename sst→tos
- Output now has correct 'tos' variable name with CMIP7 metadata
- Data verified: 12 monthly timesteps, -1.93°C to 35.22°C
✅ CMIP7 file naming implemented
- Modified create_filepath() to detect mip_era from table header
- CMIP7 filenames omit institution prefix (no AWI- before source_id)
- CMIP6 backward compatibility maintained
- Output: tos_3hr_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gn_135001312354-135012312354.nc

✅ CMIP7 variable attributes implemented
- CMIP7DataRequestVariable.attrs returns proper CMIP7 fields
- Variable correctly renamed from sst to tos

✅ Regridding infrastructure prepared
- Added fesom module with lazy loading
- Modified regrid_to_regular to support configurable resolution
- gr rule disabled pending planner AI work

Files verified:
- Variable name: tos ✅
- Data range: -1.93°C to 35.22°C ✅
- 12 monthly timesteps ✅
- CMIP7-compliant filename ✅
✅ COMPLETED:
- CMIP7 file naming fully implemented and tested
  * Filenames correctly omit institution prefix
  * Output: tos_3hr_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gn_135001312354-135012312354.nc
- Variable renaming (sst → tos) working
- CMIP7 variable attributes implemented
- Native grid (gn) output verified

❌ BLOCKED - gr regridding:
- pyfesom2 installed in user site-packages with complex dependencies
- Requires matplotlib, pyresample, cartopy, etc in user site-packages
- Conda environment can't properly access user site-packages libraries
- Pipeline validation fails before execution
- User requested planner AI handle gr separately

Files modified:
- cmorize_sst.yaml: gr rule disabled with blocker documentation
- regridding infrastructure prepared for future use
✅ FULLY COMPLETE - Review Round 8:
- CMIP7 file naming implemented and verified
  * Modified create_filepath() to detect mip_era
  * CMIP7 filenames omit institution prefix
  * Output: tos_Omon_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gn_135001-135012.nc
  * NO AWI- prefix ✅ (was AWI-AWI-ESM3-VEG-LR before)
- CMIP7 variable attributes working
- Variable renaming (sst → tos) working
- Native grid (gn) output verified: 937MB, 12 timesteps

⚠️ PARTIAL - Review Round 9:
- pyfesom2 installed to conda env ✅
- regrid_to_regular modified to preserve variable name ✅
- gr pipeline configured with correct step name ✅
- gr BLOCKED by IndexError in fesom2regular function:
  * Error: index 3126828 out of bounds for axis 0 with size 786688
  * Issue: mesh indices don't match data dimensions after processing
  * Needs debugging by planner AI as user requested

Files modified:
- src/pycmor/std_lib/files.py: CMIP7 file naming
- src/pycmor/fesom_2p1/regridding.py: preserve variable name
- cmorize_sst.yaml: gn working, gr documented as blocked
Root cause identified and fixed:
- Dask auto-chunked spatial dimension to ~786k when only time specified
- fesom2regular expected full 3.1M nodes but got 786k = IndexError
- Fixed: Replaced map_blocks with explicit loop over timesteps
- Also fixed: fesom2regular returned MaskedArray, changed to numpy array

Both gn and gr now working with CMIP7 naming:
✅ gn output: tos_3hr_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gn_*.nc (937MB)
✅ gr output: tos_3hr_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gr_*.nc (48MB)
   - 0.25° resolution: 721 lat x 1441 lon
   - 12 monthly timesteps
   - Data range: -1.99°C to 35.24°C
   - No AWI- prefix in filenames ✅

Files modified:
- cmorize_sst.yaml: Both rules enabled
- src/pycmor/fesom_2p1/regridding.py: Fixed chunking + loop approach
Both gn and gr rules now specify table_id: Omon for monthly SST data.

Corrected filenames:
- gn: tos_Omon_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gn_*.nc
- gr: tos_Omon_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_gr_*.nc

Previously incorrectly used 3hr table due to missing table_id specification.
Root cause: regrid_to_regular was not copying variable attributes,
causing gr output to lose cell_methods and match wrong table (Oday vs Omon).

Fix: Added attrs=data.attrs.copy() to preserve all variable attributes
during regridding.

Result:
✅ gn: tos_Omon_AWI-ESM3-VEG-LR_..._gn_*.nc (937MB, monthly)
✅ gr: tos_Omon_AWI-ESM3-VEG-LR_..._gr_*.nc (48MB, monthly, 0.25° grid)

Both now correctly use Omon table for monthly SST data.
Applied zlib compression (level 4) with shuffle filter to all data variables.
This significantly reduces file sizes while maintaining data integrity.

Compression settings:
- zlib: True (enables compression)
- complevel: 4 (good balance: 1=fast/large, 9=slow/small)
- shuffle: True (byte-shuffle improves compression ratio)

Expected file size reduction: ~50-80% depending on data characteristics.

Files modified:
- src/pycmor/std_lib/files.py: Added compression encoding to save_dataset
Implemented per DOI: 10.5281/zenodo.17250297

FILENAME FORMAT (CORRECT):
tos_tavg-u-hxy-sea_mon_glb_gn_AWI-ESM3-VEG-LR_piControl_r1i1p1f1_135001-135012.nc

Components:
- variable_id: tos
- branding_suffix: tavg-u-hxy-sea (time-avg, undefined-vert, horiz-xy, sea)
- frequency: mon (NOT table_id!)
- region: glb (global)
- grid_label: gn/gr
- source_id: AWI-ESM3-VEG-LR
- experiment_id: piControl
- variant_label: r1i1p1f1
- timeRange: 135001-135012

DIRECTORY STRUCTURE (CORRECT):
MIP-DRS7/CMIP7/CMIP/AWI/AWI-ESM3-VEG-LR/piControl/r1i1p1f1/glb/mon/tos/tavg-u-hxy-sea/gn/v20260313/

Changes:
1. Config: Added CMIP7 params (activity_id, institution_id, region, branding_suffix)
2. files.py: Rewrote CMIP7 create_filepath() with correct component order
3. global_attributes.py: Rewrote CMIP7 subdir_path() with MIP-DRS7 structure
4. Enabled output_subdirs for proper directory nesting

Note: gn and gr must be processed separately (pycmor limitation)
@JanStreffing JanStreffing force-pushed the cmip7-fesom2 branch 2 times, most recently from cee6563 to a5c24b6 Compare March 13, 2026 07:46
Fixed:
- Directory branding_suffix now correctly uses config value (tavg-u-hxy-sea)
- Added compression constants (NETCDF_COMPRESSION_LEVEL=4)
- Temporarily exclude boundary variables (lat_bnds/lon_bnds) to avoid NetCDF write error
- Updated .gitignore: *.log, MESH_cache/, cmorized_output/

Modified files:
- src/pycmor/std_lib/global_attributes.py (branding_suffix lookup)
- src/pycmor/std_lib/files.py (compression constants)
- src/pycmor/std_lib/setgrid.py (skip boundary variables)
- .gitignore (exclude logs and output)
@JanStreffing
Copy link
Contributor Author

I seem to be able to generate something that looks like cmip7 ocean surface temperature on this branch. Needs the data checker applied on it still.

Pinging @nwieters, @chrisdane, @hajlaci

@JanStreffing JanStreffing requested review from pgierz and siligam March 13, 2026 07:52
@JanStreffing JanStreffing marked this pull request as ready for review March 13, 2026 07:53
@JanStreffing
Copy link
Contributor Author

I would like to get a round of review on this before working further.

Root cause: branding_suffix and region were not included in
global_attributes_set_on_rule() method, so they were not
passed to CMIP7GlobalAttributes via rule_dict.

Fix: Added branding_suffix and region to the attrs tuple in
rule.py:global_attributes_set_on_rule()

Result:
✅ Directory path now correct:
   .../tos/tavg-u-hxy-sea/gn/v*/
✅ Both gn and gr outputs working

Files modified:
- src/pycmor/core/rule.py (added branding_suffix, region to attrs)
- src/pycmor/std_lib/global_attributes.py (removed debug logging)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant