Skip to content

For pm-gpu: add gpu affinity flag to srun and add pelayouts for coupled ne256 cases#7962

Closed
ndkeen wants to merge 1 commit intomasterfrom
ndk/machinefiles/pm-gpu-affinity-bind-and-xstrid-pelayouts
Closed

For pm-gpu: add gpu affinity flag to srun and add pelayouts for coupled ne256 cases#7962
ndkeen wants to merge 1 commit intomasterfrom
ndk/machinefiles/pm-gpu-affinity-bind-and-xstrid-pelayouts

Conversation

@ndkeen
Copy link
Contributor

@ndkeen ndkeen commented Dec 25, 2025

For pm-gpu, add option to srun that sets GPU affinity with a new shell script, but only if MPI's per node is 64 or larger.
Which might only happen with certain pelayouts designed to use new xstrid option, such as these in this PR which
add S/M/L pelayouts for ne256-wcyclxx.
Remove special GPU bind case for MMF compsets which are no longer being used.

These are changes from #7818

BFB

@ndkeen ndkeen self-assigned this Dec 25, 2025
@ndkeen ndkeen added Machine Files BFB PR leaves answers BFB pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Dec 25, 2025
@ndkeen
Copy link
Contributor Author

ndkeen commented Dec 25, 2025

I verified we still get same performance using the GPU affinity setting even when it is always used (ie, not just when using 64 or more MPI's per node). However, with MPS, I had a case that was a solid 2% slower with using GPU affinity. We can use less than 64 MPI's per node with MPS, or find another way to turn it off if this was ever needed. Currently, MPS is only used for testing.

I also tested the branch with all GPU suites I could think of:

e3sm_eamxx_v1, e3sm_eamxx_large, e3sm_eamxx_extra_large, e3sm_gpuacc, e3sm_gpucxx

Copy link
Member

@rljacob rljacob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like the same changes as in 7818 so redo this be either cherry picking commits form there or use --author option in the git commit to add Az's authorship.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 5, 2026

Yes these are the same changes in the other PR, but I was running into issues testing in the older branch. It may only need a rebase, but this was easiest/quickest path for me.

Copy link
Contributor Author

@ndkeen ndkeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how best to proceed?

@rljacob
Copy link
Member

rljacob commented Jan 9, 2026

Close this and redo the branch using --author on the commit so Az is the author. Or Az can do it. Or just rebase Az's original branch and merge that. What exactly was the problem you had working with it?

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 9, 2026

I dont recall the issues now, but it was messy dealing with branch.
Can Az rebase?

@ndkeen ndkeen closed this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFB PR leaves answers BFB Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants