Skip to content

Fix 06z gfs_sfcanl jobs failing due to a bug in CCPP#4389

Merged
DavidHuber-NOAA merged 18 commits intoNOAA-EMC:developfrom
DavidHuber-NOAA:fix/gfs_sfcanl
Jan 20, 2026
Merged

Fix 06z gfs_sfcanl jobs failing due to a bug in CCPP#4389
DavidHuber-NOAA merged 18 commits intoNOAA-EMC:developfrom
DavidHuber-NOAA:fix/gfs_sfcanl

Conversation

@DavidHuber-NOAA
Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA commented Dec 29, 2025

Description

This fixes a bug in CCPP that prevented the 06Z gfs_sfcanl job from running global_cycle on the 03Z IAU time. The code change now allows the input hour to be any integer between 0 and 23.
Resolves #4364
Refs #4408 (partially resolves but a full investigation is needed)

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

How has this been tested?

  • C96_atm3DVar_extended test on WCOSS2
  • UFS_Utils regression tests on Ursa (no change to baseline)
  • UFS regression tests
  • Full suite of GW tests on all platforms (when the UFS model hash is ready)

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added

@DavidHuber-NOAA DavidHuber-NOAA added the GFS Change This PR, if merged, will change results for the GFS. label Dec 29, 2025
@DavidHuber-NOAA DavidHuber-NOAA changed the title Fix/gfs sfcanl Fix 06z gfs_sfcanl jobs failing due to a bug in CCPP Dec 29, 2025
Comment on lines +72 to +73
local MOM6_OUTPUT_DIR="${MOM6_OUTPUT_DIR:-./MOM6_OUTPUT}"
local MOM6_RESTART_DIR="${MOM6_RESTART_DIR:-./MOM6_RESTART}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local MOM6_OUTPUT_DIR="${MOM6_OUTPUT_DIR:-./MOM6_OUTPUT}"
local MOM6_RESTART_DIR="${MOM6_RESTART_DIR:-./MOM6_RESTART}"
local MOM6_OUTPUT_DIR="./MOM6_OUTPUT"
local MOM6_RESTART_DIR="./MOM6_RESTART"

This isn't an option. The workflow will always write to this space. The other instances where this variable is set in this manner is ush/parsing_namelists_FV3.sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think this should be configurable. This is assumed other places.

I also see that ufs-community/ufs-weather-model@6f22f57...f8b0802 we have a MOM6_OUTPUT_FH defined, which is defined in forecast_predet.

I'm assuming those are consistent definitions?

aerorahul
aerorahul previously approved these changes Jan 6, 2026
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one suggestion, looks good.

Copy link
Contributor

@JessicaMeixner-NOAA JessicaMeixner-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run a S2S test of some sort to make sure ocean output looks okay given the changes. I didn't follow exactly when those changes went in. @jiandewang or @dpsarmie might know more about that the MOM6_OUTPUT_FH to make sure that is as expected here now.

Comment on lines +72 to +73
local MOM6_OUTPUT_DIR="${MOM6_OUTPUT_DIR:-./MOM6_OUTPUT}"
local MOM6_RESTART_DIR="${MOM6_RESTART_DIR:-./MOM6_RESTART}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think this should be configurable. This is assumed other places.

I also see that ufs-community/ufs-weather-model@6f22f57...f8b0802 we have a MOM6_OUTPUT_FH defined, which is defined in forecast_predet.

I'm assuming those are consistent definitions?

@dpsarmie
Copy link
Contributor

dpsarmie commented Jan 6, 2026

FHOUT_OCN_GFS / FHOUT_OCN is the variable that controls the MOM6 output frequency in GW and MOM6_OUTPUT_FH is just used to store an array of output times (in GW), correct? If so, then there might be an issue.

@JessicaMeixner-NOAA
Copy link
Contributor

FHOUT_OCN_GFS / FHOUT_OCN is the variable that controls the MOM6 output frequency in GW and MOM6_OUTPUT_FH is just used to store an array of output times (in GW), correct? If so, then there might be an issue.

Yes. And looking a bit further, this variable should be "FHOUT_OCN" and not MOM6_OUTPUT_FH.

Copy link
Contributor

@ClaraDraper-NOAA ClaraDraper-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it, but can confirm that this PR brings the needed update to CCPP/physics into UFS_UTILS to fix the gfs_sfcanl issue. It also updates the CCPP/physics used by the model to the same hash.

Note that updating the ufs_model and ufs_utils brings in many additional changes in addition to the gfs_sfcnl issue.

@DavidHuber-NOAA
Copy link
Contributor Author

All tests passed on Ursa. I will run a develop C96C48mx500 case to verify the MOM6 output, then launch CI on all platforms.

@DavidHuber-NOAA
Copy link
Contributor Author

I verified the 6-hour MOM6 forecast for the C96C48mx025_S2SW_gfs_cyc case was identical to last night's nightly run of the same case and also verified that MOM6 output was written every 6 hours as expected:

>cmp /scratch3/NCEPDEV/stmp/David.Huber/rt_4389/COMROOT/C96C48mx500_S2SW_cyc_gfs_4389/gfs.20211220/18/model/ocean/history/gfs.t18z.6hr_avg.f120.nc /scratch3/NCEPDEV/global/role.glopara/GFS_CI_CD/URSA/BUILDS/GITLAB/nightly_6ada5183_010726/RUNTESTS/COMROOT/C96C48mx500_S2SW_cyc_gfs_6ada5183-6840/gfs.20211220/18/model/ocean/history/gfs.t18z.6hr_avg.f120.nc
> echo $?
0

Proceeding with CI testing.

@emcbot emcbot removed the CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress label Jan 15, 2026
@DavidHuber-NOAA
Copy link
Contributor Author

The gaea c6 failure was caused by a build stalling on a known-bad head node. I will relaunch tests there now that CI is running there again smoothly.

The SFS test failure on Ursa is known and that test has since been removed from the CI matrix.

The Hera failure is the only concerning issue. /scratch3/NCEPDEV/global/role.glopara/GFS_CI_CD/HERA/BUILDS/GITLAB/pr_cases_4389_f6105b9e_7163/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_f6105b9e-7163/logs/2021032500/gfs_arch_tar_gfswave.log reports a failure on line 1231:

FileNotFoundError: FATAL ERROR: Required file, directory, or glob gfs.20210325/00/products/wave/gridded/global.0p50/gfs.t00z.*.*.f017.grib2 not found!

The global.0p50 directory does not exist. Instead, it is global.0p16. This is likely a change in grids that came in with the latest UFS model upgrade, which includes a WW3 update. There is likely a namelist variable change required.

@DavidHuber-NOAA
Copy link
Contributor Author

@JessicaMeixner-NOAA @dpsarmie is there a required WW3 update needed with this update? It appears that the wave grid changed unexpectedly from 0p50 to 0p16.

@DavidHuber-NOAA
Copy link
Contributor Author

DavidHuber-NOAA commented Jan 16, 2026

@JessicaMeixner-NOAA @dpsarmie actually, this does not seem to be related to the model update. Last night's nightly also ran WW3 at 0p16 and archived correctly. I will continue investigating.

@DavidHuber-NOAA
Copy link
Contributor Author

I see now what happened. The gfs_arch_tar_gfswave job launched before the gfs_wavepostsbs_f015-f017 job completed. This reflects another archiving dependency that is missing (#4408). I will add this dependency as well as the gfs_waveawipsgridded dependency (also documented in #4408) and relaunch tests on Hera, C6, Ursa, and WCOSS2.

@DavidHuber-NOAA
Copy link
Contributor Author

I am still wrong. The wavepostsbs metatask is already a dependency of the archiving jobs. Instead, we have a silent failure. This is in /scratch3/NCEPDEV/global/role.glopara/GFS_CI_CD/HERA/BUILDS/GITLAB/pr_cases_4389_f6105b9e_7163/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_f6105b9e-7163/logs/2021032500/gfs_wavepostsbs_f015-f017.log during an MPMD section of the job at line 11403:

3: + wave_grib2_sbs.sh[88]/scratch3/NCEPDEV/global/role.glopara/GFS_CI_CD/HERA/BUILDS/GITLAB/pr_cases_4389_f6105b9e_7163/global-workflow/exec/gfs_ww3_grib.x
3: + wave_grib2_sbs.sh[89]export err=126
3: + wave_grib2_sbs.sh[89]err=126
3: + wave_grib2_sbs.sh[90][[ 126 -ne 0 ]]
3: + wave_grib2_sbs.sh[91]echo 'FATAL ERROR: gfs_ww3_grib.x returned non-zero status: 126; exiting!'
3: FATAL ERROR: gfs_ww3_grib.x returned non-zero status: 126; exiting!
3: + wave_grib2_sbs.sh[92]exit 126

Going to the end of the MPMD job at line 9891:

+ run_mpmd.sh[66]IFS=
+ run_mpmd.sh[66]read -r line
+ run_mpmd.sh[71]unset_strict
+ preamble.sh[45]set +eu
+ preamble.sh[46]set +o pipefail
+ run_mpmd.sh[73]srun -l --export=ALL --hint=nomultithread --multi-prog --output=mpmd.%j.%t.out -n 7 /scratch3/NCEPDEV/stmp/role.glopara/RUNDIRS/C48mx500_3DVarAOWCDA_f6105b9e-7163/gfs.2021032500/wavepostsbs_f017.82108/mpmd_cmdfile
+ run_mpmd.sh[74]err=0
+ run_mpmd.sh[75]set_strict
+ preamble.sh[35][[ YES == \Y\E\S ]]
+ preamble.sh[37]set -eu
+ preamble.sh[39]set -o pipefail
+ run_mpmd.sh[101][[ 0 -eq 0 ]]

It appears that run_mpmd.sh is not correctly catching failures of individual jobs.

@DavidHuber-NOAA DavidHuber-NOAA removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed CI-Gaeac6-Failed **Bot use only** CI testing on Gaea C6 for this PR has failed CI-Ursa-Failed **Bot use only** CI testing on Ursa for this PR has failed labels Jan 16, 2026
@emcbot emcbot added CI-Gaeac6-Ready **CM use only** PR is ready for CI testing on Gaea C6 CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Ursa-Ready **CM use only** PR is ready for CI testing on Ursa CI-Ursa-Building **Bot use only** CI testing is cloning/building on Ursa CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Gaeac6-Building **Bot use only** CI testing is cloning/building on Gaea C6 CI-Ursa-Running **Bot use only** CI testing on Ursa for this PR is in-progress and removed CI-Ursa-Ready **CM use only** PR is ready for CI testing on Ursa CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Gaeac6-Ready **CM use only** PR is ready for CI testing on Gaea C6 CI-Ursa-Building **Bot use only** CI testing is cloning/building on Ursa CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Jan 16, 2026
@DavidHuber-NOAA
Copy link
Contributor Author

All tests passed on all platforms. Requesting final approvals to merge.

dep_dict = {'type': 'task', 'name': f'{self.run}_fbwind'}
deps.append(rocoto.add_dependency(dep_dict))

if self.options['do_wave']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay to add but we don't actually archive any of the awips files for waves to my knowledge, so I don't think this is actually needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I'll open an issue to remove this tarball/job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened issues #4454 to retire AWIPS waves archiving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI-Gaeac6-Passed **Bot use only** CI testing on Gaea C6 for this PR has completed successfully CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully CI-Ursa-Passed **Bot use only** CI testing on Ursa for this PR has completed successfully CI-Wcoss2-Passed CI testing on WCOSS for this PR has completed successfully GFS Change This PR, if merged, will change results for the GFS.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

global_cycle reports a fatal error on 06Z gfs cycles

7 participants