improve the efficiency of parallel restart handling in MOM6#637
improve the efficiency of parallel restart handling in MOM6#637minghangli-uni merged 2 commits intodev-MC_25km_jra_ryffrom
Conversation
|
This PR should be merged after payu-org/payu#601 has been merged. |
There was a problem hiding this comment.
Thanks @minghangli-uni.
As I mentioned here, I don't think MOM6 can currently restart from uncollated restart files when using the NUOPC cap: restarting from uncollated restarts seems to only be supported when using restart files that have been internally named by MOM, but with NUOPC the restart filenames are set in the cap.
I ran onetwo years of the 25km configuration in this PR with/without PARALLEL_RESTARTFILES = True:
| Repeat | PARALLEL_RESTARTFILES = True |
PARALLEL_RESTARTFILES = False |
|---|---|---|
| 1 | 3hrs 34mins | 3hr 37mins |
| 2 | 3hrs 34mins | 3hr 36mins |
So not a big impact on walltime (at 25km). Possibly still worth doing at some point, but given that it's not straightforward it might have to wait until after the 25km 1.0-beta release?
|
When |
|
Hi @dougiesquire which timing were you referring to, the one in job.yaml or the one from the pbs output? The durations are 3hs 31mins and 3hrs 42mins respectively when PARALLEL_RESTARTFILES is on for my test. |
|
I was reporting the |
ebfe11b to
9f7217f
Compare
9f7217f to
8ba8925
Compare
|
Using 2025.08.000 build, two consecutive runs with
The runtime improvement is around 2.7% faster. |
|
Currently this feature can only be enabled using 2025.08.000 which includes fixes from both the source ACCESS-NRI/MOM6#23 and payu payu-org/payu#601. However, we are still seeing a log error message (does not crash the simulation) tracked in ACCESS-NRI/ACCESS-OM3#143. So this need to wait until that issue is resolved.. During the time, would it make sense to cut a new payu release so we can start using |
|
Thanks @minghangli-uni, my reading of the discussion above also aligns with Dougie:
So not worth worrying about for the time being but perhaps worth considering for the upcoming heavier configurations? |
|
It's no harm to add it in since everything is there once we release a new build and a new release of Payu. |
|
So @minghangli-uni and I had a better look at past chat on this (can't see that issue @minghangli-uni ?) and @micaeljtoliveira found substantial speed ups (O ~40 minutes) for the panan 01 and even more for the 1/20th, so would be worth doing a test for @claireyung's configuration. Would need to do a build update as well but that was planned for in any case. Likely particularly relevant there as Claire (I think) is currently doing a lot of stop/starting. Also, given that it doesn't make the 25km worse, open to doing it there too. |
|
Here is the issue and the comment COSIMA/mom6-panan#29 (comment) |
8ba8925 to
298912c
Compare
|
!test repro |
|
✅ The Bitwise Reproducibility Check Succeeded ✅ When comparing:
Further informationThe experiment can be found on Gadi at The checksums generated by this The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/81a708018986dc87df1fd7061629643f0e6b68fe/testing/checksum Test summary: |
|
@chrisb13 @dougiesquire would you like to review this? |
dougiesquire
left a comment
There was a problem hiding this comment.
Thanks @minghangli-uni. Looks good.
I pushed a very minor change to MOM_input so that it is formatted the same way as MOM_parameter_doc.short. I suggest you "Squash and merge" so that this is squashed into your commit.
I tested/checked:
- successful run across restart
- payu restart collation
- restart reproducibility
|
Thanks @dougiesquire |
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time * Format MOM_input like MOM_parameter_doc.short --------- Co-authored-by: dougiesquire <dougiesquire@gmail.com>
|
Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_100km_jra_ryf was successful. The new pull request can be reviewed and approved here. |
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time * Format MOM_input like MOM_parameter_doc.short --------- Co-authored-by: dougiesquire <dougiesquire@gmail.com>
|
Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_100km_jra_ryf+wombatlite was successful. The new pull request can be reviewed and approved here. |
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time * Format MOM_input like MOM_parameter_doc.short --------- Co-authored-by: dougiesquire <dougiesquire@gmail.com>
|
Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_25km_jra_ryf+wombatlite was successful. The new pull request can be reviewed and approved here. |
|
!cherry-pick 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan |
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time * Format MOM_input like MOM_parameter_doc.short --------- Co-authored-by: dougiesquire <dougiesquire@gmail.com>
|
Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan was successful. The new pull request can be reviewed and approved here. |
|
!cherry-pick 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan+isf |
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time * Format MOM_input like MOM_parameter_doc.short --------- Co-authored-by: dougiesquire <dougiesquire@gmail.com>
|
Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan+isf was successful. The new pull request can be reviewed and approved here. |
1. Summary:
What has changed?
Add
PARALLEL_RESTARTFILEStoMOM_inputWhy was this done?
To improve the efficiency of parallel restart handling in MOM6.
2. Issues Addressed:
Closes #592
3. Depedencies (e.g. on payu, model or om3-scripts)
This change requires changes to (note required version where true):
4. Ad-hoc Testing
What ad-hoc testing was done? How are you convinced this change is correct (plots are good)?
5. CI Testing
!test reprohas been run6. Reproducibility
Is this reproducible with the previous commit? (If not, why not?)
!test repro commithas been run.7. Documentation
The docs folder has been updated with output from running the model?
A PR has been created for updating the documentation?
8. Formatting
Changes to MOM_input have been copied from model output in docs/MOM_parameter_docs.short?
9. Merge Strategy