Skip to content

Reproducibility issues in HEMCO_CESM and investigation #31

@jimmielin

Description

@jimmielin

This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: ESCOMP/CAM#856

For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g., FCnudged, FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g., HCO_NO, HCO_NH3, HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.

Test/debug workflow

This setup will help debug the issues.

  • Checkout https://github.com/ESCOMP/CESM (ESCOMP/CESM). cesm2_3_alpha17c was used here but any release with HEMCO (post-cam6_3_118) should do.
  • ./manage_externals/checkout_externals
  • Using this branch (https://github.com/jimmielin/HEMCO_CESM/tree/hplin/debug_parallel) hplin/debug_parallel from jimmielin/HEMCO_CESM for components/cam/src/hemco may be useful, as it has some debug printouts which will appear in cesm.log.
  • Create a case: ./create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022 -- the f10_f10_mg37 resolution is 10x15 degree and coarse enough to run on 1 core. I suggest using FC2010climo or something that is not FCnudged so configuring nudging / met fields can be avoided in user_nl_cam.
  • cd to case directory, ./xmlchange NTASKS=1 for single core or NTASKS=2 for two cores, etc. In the 10x15 case, NTHRDS=1 (I have not successfully ran with more than 1 thread on this grid)
  • ./case.setup --reset, then fill user_nl_cam with:
hemco_config_file = '/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc',

cam_physics_mesh = '/glade/campaign/cesm/cesmdata/inputdata/share/meshes/10x15_nomask_c110308_ESMFmesh.nc'
hemco_grid_xdim = 24,
hemco_grid_ydim = 19,

fincl1 = 'T', 'HCO_CO', 'HCO_NO', 'HCO_NH3', 'CO', 'O3', 'NO', 'HCO_EDGAR_TODNOX'
mfilt = 1,
nhtfrq = 1,

The /glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc test config file only has CEDS with NO CO and NH3 with NO having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run.

<directive> -l select={{ num_nodes }}:ncpus=128:mpiprocs={{ tasks_per_node }}:ompthreads={{ thread_count }}:mem=230GB</directive>
  • Change env_run.xml: RUN_STARTDATE=2016-01-01, STOP_OPTION=nhours, STOP_N=3 (shorter may not work due to coupling intervals)
  • Submit the case
  • Create multiple case directories for 1 core, 2 cores, etc. because clean recompile is needed to change core configuration.

Debugging output is in cesm.log.* and organized per CPU.

The cprnc tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my .zshrc

alias cprnc="/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc"

Usage: cprnc <file1> <file2>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions