For pm-gpu: add gpu affinity flag to srun and add pelayouts for coupled ne256 cases by ndkeen · Pull Request #7962 · E3SM-Project/E3SM

ndkeen · 2025-12-25T23:02:42Z

For pm-gpu, add option to srun that sets GPU affinity with a new shell script, but only if MPI's per node is 64 or larger.
Which might only happen with certain pelayouts designed to use new xstrid option, such as these in this PR which
add S/M/L pelayouts for ne256-wcyclxx.
Remove special GPU bind case for MMF compsets which are no longer being used.

These are changes from #7818

BFB

…s for coupled ne256 cases

ndkeen · 2025-12-25T23:06:59Z

I verified we still get same performance using the GPU affinity setting even when it is always used (ie, not just when using 64 or more MPI's per node). However, with MPS, I had a case that was a solid 2% slower with using GPU affinity. We can use less than 64 MPI's per node with MPS, or find another way to turn it off if this was ever needed. Currently, MPS is only used for testing.

I also tested the branch with all GPU suites I could think of:

e3sm_eamxx_v1, e3sm_eamxx_large, e3sm_eamxx_extra_large, e3sm_gpuacc, e3sm_gpucxx

rljacob

These look like the same changes as in 7818 so redo this be either cherry picking commits form there or use --author option in the git commit to add Az's authorship.

ndkeen · 2026-01-05T18:33:11Z

Yes these are the same changes in the other PR, but I was running into issues testing in the older branch. It may only need a rebase, but this was easiest/quickest path for me.

ndkeen

how best to proceed?

rljacob · 2026-01-09T19:21:53Z

Close this and redo the branch using --author on the commit so Az is the author. Or Az can do it. Or just rebase Az's original branch and merge that. What exactly was the problem you had working with it?

ndkeen · 2026-01-09T21:52:33Z

I dont recall the issues now, but it was messy dealing with branch.
Can Az rebase?

For pm-gpu: add gpu affinity flag to srun, as well as xstrid pelayout…

c84ae15

…s for coupled ne256 cases

ndkeen self-assigned this Dec 25, 2025

ndkeen added Machine Files BFB PR leaves answers BFB pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Dec 25, 2025

ndkeen requested review from amametjanov and rljacob December 25, 2025 23:07

ndkeen mentioned this pull request Dec 25, 2025

Update gpu affinity on pm-gpu #7818

Merged

rljacob requested changes Dec 26, 2025

View reviewed changes

ndkeen commented Jan 8, 2026

View reviewed changes

ndkeen closed this Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For pm-gpu: add gpu affinity flag to srun and add pelayouts for coupled ne256 cases#7962

For pm-gpu: add gpu affinity flag to srun and add pelayouts for coupled ne256 cases#7962
ndkeen wants to merge 1 commit intomasterfrom
ndk/machinefiles/pm-gpu-affinity-bind-and-xstrid-pelayouts

ndkeen commented Dec 25, 2025

Uh oh!

ndkeen commented Dec 25, 2025

Uh oh!

rljacob left a comment

Uh oh!

ndkeen commented Jan 5, 2026

Uh oh!

ndkeen left a comment

Uh oh!

rljacob commented Jan 9, 2026

Uh oh!

ndkeen commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ndkeen commented Dec 25, 2025

Uh oh!

ndkeen commented Dec 25, 2025

Uh oh!

rljacob left a comment

Choose a reason for hiding this comment

Uh oh!

ndkeen commented Jan 5, 2026

Uh oh!

ndkeen left a comment

Choose a reason for hiding this comment

Uh oh!

rljacob commented Jan 9, 2026

Uh oh!

ndkeen commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants