-
Notifications
You must be signed in to change notification settings - Fork 24
Description
In our realtime/retro ensembles runs, we got random getkf crashes which were troublesome and difficult to diagnose. This issue was reported in NOAA-EMC/rrfs-workflow#1395.
The root cause was traced to partially written mapsout files when a forecast terminated unexpectedly during the file-writing process. This can occur for various reasons, such as system glitches or slow disk performance preventing complete output within the allocated walltime.
A key contributing factor is that it is currently difficult to reliably determine whether an mpasout file has been fully written.
We added a new stream attribute output_done_marker to allow generating a 'done' file once stream writing is completed.
RRFSx#20
This greatly streamlines our workflow, eliminating crashes and reducing the need for complex validation logic.
NOAA-EMC/rrfs-workflow#1401
Additionally, we no longer need to wait several minutes (typically around 5 minutes) for a new file to become “old” (to make sure the writing is completed) before triggering downstream tasks. Given the large number of ensembles and cycles we run, this results in substantial cumulative time savings.
This change depends on the merge of PR #217, where the newly added stream_mgr_set_property_c(...) function provides a convenient way to set stream attributes.