Skip to content

improve the efficiency of parallel restart handling in MOM6#637

Merged
minghangli-uni merged 2 commits intodev-MC_25km_jra_ryffrom
592-parallel-restart
Dec 8, 2025
Merged

improve the efficiency of parallel restart handling in MOM6#637
minghangli-uni merged 2 commits intodev-MC_25km_jra_ryffrom
592-parallel-restart

Conversation

@minghangli-uni
Copy link
Collaborator

@minghangli-uni minghangli-uni commented Jul 11, 2025

1. Summary:

What has changed?
Add PARALLEL_RESTARTFILES to MOM_input

Why was this done?
To improve the efficiency of parallel restart handling in MOM6.

2. Issues Addressed:

Closes #592

3. Depedencies (e.g. on payu, model or om3-scripts)

This change requires changes to (note required version where true):

  • payu:
  • access-om3:
  • om3-scripts:

4. Ad-hoc Testing

What ad-hoc testing was done? How are you convinced this change is correct (plots are good)?

5. CI Testing

  • !test repro has been run

6. Reproducibility

Is this reproducible with the previous commit? (If not, why not?)

  • Yes
  • No - !test repro commit has been run.

7. Documentation

The docs folder has been updated with output from running the model?

  • Yes
  • N/A

A PR has been created for updating the documentation?

  • Yes:
  • N/A

8. Formatting

Changes to MOM_input have been copied from model output in docs/MOM_parameter_docs.short?

  • Yes
  • N/A

9. Merge Strategy

  • Merge commit
  • Rebase and merge
  • Squash

@minghangli-uni minghangli-uni self-assigned this Jul 11, 2025
@minghangli-uni minghangli-uni marked this pull request as draft July 11, 2025 04:36
@minghangli-uni
Copy link
Collaborator Author

This PR should be merged after payu-org/payu#601 has been merged.

@minghangli-uni minghangli-uni changed the title Enable PARALLEL_RESTARTFILES to allow better initialisation and termi… improve the efficiency of parallel restart handling in MOM6 Jul 11, 2025
Copy link
Collaborator

@dougiesquire dougiesquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @minghangli-uni.

As I mentioned here, I don't think MOM6 can currently restart from uncollated restart files when using the NUOPC cap: restarting from uncollated restarts seems to only be supported when using restart files that have been internally named by MOM, but with NUOPC the restart filenames are set in the cap.

I ran onetwo years of the 25km configuration in this PR with/without PARALLEL_RESTARTFILES = True:

Repeat PARALLEL_RESTARTFILES = True PARALLEL_RESTARTFILES = False
1 3hrs 34mins 3hr 37mins
2 3hrs 34mins 3hr 36mins

So not a big impact on walltime (at 25km). Possibly still worth doing at some point, but given that it's not straightforward it might have to wait until after the 25km 1.0-beta release?

@minghangli-uni
Copy link
Collaborator Author

When PARALLEL_RESTARTFILES is enabled and the config starts from scratch, there is no difference in initialization time. However, the final ocean timestep takes ~170 seconds compared to just 5 seconds when using parallel restart files. This 165 second difference aligns with the timing reported by @dougiesquire here. So for the 25 km config, parallel restarts have minimal impact on the overall runtime.

@minghangli-uni
Copy link
Collaborator Author

minghangli-uni commented Jul 15, 2025

Hi @dougiesquire which timing were you referring to, the one in job.yaml or the one from the pbs output? The durations are 3hs 31mins and 3hrs 42mins respectively when PARALLEL_RESTARTFILES is on for my test.

@dougiesquire
Copy link
Collaborator

I was reporting the PAYU_WALLTIME in job.yaml

Base automatically changed from dev-MC_25km_jra_ryf to release-MC_25km_jra_ryf July 24, 2025 07:01
@minghangli-uni minghangli-uni changed the base branch from release-MC_25km_jra_ryf to dev-MC_25km_jra_ryf September 1, 2025 23:35
@minghangli-uni
Copy link
Collaborator Author

Using 2025.08.000 build, two consecutive runs with PARALLEL_RESTARTFILES enabled and disabled

Param 1st year 2nd year
PARALLEL_RESTARTFILES on 03:46:29 03:45:16
PARALLEL_RESTARTFILES off 03:51:34 03:52:39

The runtime improvement is around 2.7% faster.

@minghangli-uni
Copy link
Collaborator Author

Currently this feature can only be enabled using 2025.08.000 which includes fixes from both the source ACCESS-NRI/MOM6#23 and payu payu-org/payu#601.

However, we are still seeing a log error message (does not crash the simulation) tracked in ACCESS-NRI/ACCESS-OM3#143. So this need to wait until that issue is resolved.. During the time, would it make sense to cut a new payu release so we can start using PARALLEL_RESTARTFILES and take advantage of the runtime improvements? @jo-basevi @dougiesquire @chrisb13

@chrisb13
Copy link
Collaborator

Thanks @minghangli-uni, my reading of the discussion above also aligns with Dougie:

So not a big impact on walltime (at 25km).

So not worth worrying about for the time being but perhaps worth considering for the upcoming heavier configurations?

@minghangli-uni
Copy link
Collaborator Author

It's no harm to add it in since everything is there once we release a new build and a new release of Payu.

@chrisb13
Copy link
Collaborator

chrisb13 commented Sep 15, 2025

So @minghangli-uni and I had a better look at past chat on this (can't see that issue @minghangli-uni ?) and @micaeljtoliveira found substantial speed ups (O ~40 minutes) for the panan 01 and even more for the 1/20th, so would be worth doing a test for @claireyung's configuration. Would need to do a build update as well but that was planned for in any case. Likely particularly relevant there as Claire (I think) is currently doing a lot of stop/starting.

Also, given that it doesn't make the 25km worse, open to doing it there too.

@minghangli-uni
Copy link
Collaborator Author

Here is the issue and the comment COSIMA/mom6-panan#29 (comment)

@minghangli-uni
Copy link
Collaborator Author

!test repro

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

✅ The Bitwise Reproducibility Check Succeeded ✅

When comparing:

  • 592-parallel-restart (checksums created using commit 298912c), against
  • dev-MC_25km_jra_ryf (checksums in commit 81a7080)
Further information

The experiment can be found on Gadi at /scratch/tm70/repro-ci/experiments/access-om3-configs/298912cea5dcb146c9b33977b16d5e05d7bfdf82, and the test results at https://github.com/ACCESS-NRI/access-om3-configs/runs/57386736068.

The checksums generated by this !test command are found in the testing/checksum directory of https://github.com/ACCESS-NRI/access-om3-configs/actions/runs/20013312417/artifacts/4792386247.

The checksums compared against are found here https://github.com/ACCESS-NRI/access-om3-configs/tree/81a708018986dc87df1fd7061629643f0e6b68fe/testing/checksum

Test summary:
test_repro_historical

@minghangli-uni minghangli-uni marked this pull request as ready for review December 8, 2025 01:11
@minghangli-uni
Copy link
Collaborator Author

@chrisb13 @dougiesquire would you like to review this?

Copy link
Collaborator

@dougiesquire dougiesquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @minghangli-uni. Looks good.

I pushed a very minor change to MOM_input so that it is formatted the same way as MOM_parameter_doc.short. I suggest you "Squash and merge" so that this is squashed into your commit.

I tested/checked:

  • successful run across restart
  • payu restart collation
  • restart reproducibility

@minghangli-uni
Copy link
Collaborator Author

Thanks @dougiesquire

github-actions bot pushed a commit that referenced this pull request Dec 8, 2025
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: dougiesquire <dougiesquire@gmail.com>
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_100km_jra_ryf was successful.

The new pull request can be reviewed and approved here.

github-actions bot pushed a commit that referenced this pull request Dec 8, 2025
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: dougiesquire <dougiesquire@gmail.com>
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_100km_jra_ryf+wombatlite was successful.

The new pull request can be reviewed and approved here.

github-actions bot pushed a commit that referenced this pull request Dec 8, 2025
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: dougiesquire <dougiesquire@gmail.com>
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_25km_jra_ryf+wombatlite was successful.

The new pull request can be reviewed and approved here.

@ACCESS-NRI ACCESS-NRI deleted a comment from github-actions bot Dec 8, 2025
@ACCESS-NRI ACCESS-NRI deleted a comment from github-actions bot Dec 8, 2025
@ACCESS-NRI ACCESS-NRI deleted a comment from github-actions bot Dec 8, 2025
@ACCESS-NRI ACCESS-NRI deleted a comment from github-actions bot Dec 8, 2025
@minghangli-uni
Copy link
Collaborator Author

!cherry-pick 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan

github-actions bot pushed a commit that referenced this pull request Dec 8, 2025
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: dougiesquire <dougiesquire@gmail.com>
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan was successful.

The new pull request can be reviewed and approved here.

@minghangli-uni
Copy link
Collaborator Author

!cherry-pick 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan+isf

github-actions bot pushed a commit that referenced this pull request Dec 8, 2025
* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: dougiesquire <dougiesquire@gmail.com>
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Automatic Git cherry-picking of commit(s) 13ee6e1 into dev-MC_4km_jra_ryf+regionalpanan+isf was successful.

The new pull request can be reviewed and approved here.

minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
minghangli-uni added a commit that referenced this pull request Dec 8, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
ezhilsabareesh8 pushed a commit that referenced this pull request Dec 9, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
ezhilsabareesh8 pushed a commit that referenced this pull request Dec 19, 2025
)

* Enable PARALLEL_RESTARTFILES to allow better initialisation and termination time

* Format MOM_input like MOM_parameter_doc.short

---------

Co-authored-by: minghang.li <24727729+minghangli-uni@users.noreply.github.com>
Co-authored-by: dougiesquire <dougiesquire@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants