This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: ESCOMP/CAM#856
For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g., FCnudged, FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g., HCO_NO, HCO_NH3, HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.
Test/debug workflow
This setup will help debug the issues.
- Checkout https://github.com/ESCOMP/CESM (
ESCOMP/CESM). cesm2_3_alpha17c was used here but any release with HEMCO (post-cam6_3_118) should do.
./manage_externals/checkout_externals
- Using this branch (https://github.com/jimmielin/HEMCO_CESM/tree/hplin/debug_parallel)
hplin/debug_parallel from jimmielin/HEMCO_CESM for components/cam/src/hemco may be useful, as it has some debug printouts which will appear in cesm.log.
- Create a case:
./create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022 -- the f10_f10_mg37 resolution is 10x15 degree and coarse enough to run on 1 core. I suggest using FC2010climo or something that is not FCnudged so configuring nudging / met fields can be avoided in user_nl_cam.
cd to case directory, ./xmlchange NTASKS=1 for single core or NTASKS=2 for two cores, etc. In the 10x15 case, NTHRDS=1 (I have not successfully ran with more than 1 thread on this grid)
./case.setup --reset, then fill user_nl_cam with:
hemco_config_file = '/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc',
cam_physics_mesh = '/glade/campaign/cesm/cesmdata/inputdata/share/meshes/10x15_nomask_c110308_ESMFmesh.nc'
hemco_grid_xdim = 24,
hemco_grid_ydim = 19,
fincl1 = 'T', 'HCO_CO', 'HCO_NO', 'HCO_NH3', 'CO', 'O3', 'NO', 'HCO_EDGAR_TODNOX'
mfilt = 1,
nhtfrq = 1,
The /glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rc test config file only has CEDS with NO CO and NH3 with NO having a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run.
<directive> -l select={{ num_nodes }}:ncpus=128:mpiprocs={{ tasks_per_node }}:ompthreads={{ thread_count }}:mem=230GB</directive>
- Change
env_run.xml: RUN_STARTDATE=2016-01-01, STOP_OPTION=nhours, STOP_N=3 (shorter may not work due to coupling intervals)
- Submit the case
- Create multiple case directories for 1 core, 2 cores, etc. because clean recompile is needed to change core configuration.
Debugging output is in cesm.log.* and organized per CPU.
The cprnc tool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my .zshrc
alias cprnc="/glade/campaign/cesm/cesmdata/cseg/tools/cime/tools/cprnc/cprnc"
Usage: cprnc <file1> <file2>
This issue thread serves to note the reproducibility issues in HEMCO within CESM2 which should eventually be fixed for: ESCOMP/CAM#856
For the purposes of debugging HEMCO_CESM, it is suggested to use CAM-chem compsets (e.g.,
FCnudged,FCclimo2010, ...) beuse CAM-chem is known to be b4b reproducible and GEOS-Chem compsets are likely not. The responsibility of this issue is to ensure that the physics buffer and history fields (e.g.,HCO_NO,HCO_NH3,HCO_CO, ...) match bit-for-bit in restart, different MPI decomp, and different OpenMP threading scenarios.Test/debug workflow
This setup will help debug the issues.
ESCOMP/CESM).cesm2_3_alpha17cwas used here but any release with HEMCO (post-cam6_3_118) should do../manage_externals/checkout_externalshplin/debug_parallelfromjimmielin/HEMCO_CESMforcomponents/cam/src/hemcomay be useful, as it has some debug printouts which will appear incesm.log../create_newcase --case ~/2403_dev_hco_2.3/2403_dev_hco_2.3-f10_singlecore --compset FC2010climo_HCO --res f10_f10_mg37 --run-unsupported --mach derecho --project UHAR0022-- thef10_f10_mg37resolution is 10x15 degree and coarse enough to run on 1 core. I suggest usingFC2010climoor something that is notFCnudgedso configuring nudging / met fields can be avoided inuser_nl_cam.cdto case directory,./xmlchange NTASKS=1for single core orNTASKS=2for two cores, etc. In the 10x15 case,NTHRDS=1(I have not successfully ran with more than 1 thread on this grid)./case.setup --reset, then filluser_nl_camwith:The
/glade/u/home/hplin/2403_dev_hco_2.3/HEMCO_Config.CC.TestOnly.c240331.rctest config file only has CEDS withNOCOandNH3withNOhaving a 1x1 gridded scale factor. This makes it easier to debug and much quicker to run../case.build -vnumactl(binding to certain cores returninginvalidon5x5_amazonresolution NCAR/mpibind#5) - editenv_batch.xmland change the command in<directive gpu_enabled="false">to always request 128 cores from the scheduler (it was{{ max_tasks_per_node}}-> to128):env_run.xml:RUN_STARTDATE=2016-01-01,STOP_OPTION=nhours,STOP_N=3(shorter may not work due to coupling intervals)Debugging output is in
cesm.log.*and organized per CPU.The
cprnctool is very useful to compare two netCDF files for bit-for-bit matches: I use this in my.zshrcUsage:
cprnc <file1> <file2>