Conversation
|
Testing:
Example on 128 nodes:
|
|
I think we want to keep those settings. You may need to find a conditional way to build if you are finding other settings are better for certain cases. I do notice the correct syntax to srun is |
| <arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu_bind=cores"; else echo "--cpu_bind=threads";fi;} </arg> | ||
| <arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu-bind=cores"; else echo "--cpu-bind=threads";fi;} </arg> | ||
| <arg name="placement"> -m plane=$SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg> | ||
| <arg name="gpu-bind"> /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh $SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg> |
There was a problem hiding this comment.
What does this set_affinity_npergu.sh file do?
There was a problem hiding this comment.
Shouldn't set_affinity_npergu.sh be part of our e3sm repo itself?
There was a problem hiding this comment.
I thought the same thing. I'm slowly working thru how to best integrate.
Place in tools directory?
There was a problem hiding this comment.
Oh, I didn't realize that set_affinity_npergu.sh is getting added to the srun something like this:
srun --label -n 512 -N 8 -c 2 --cpu-bind=cores -m plane=64 /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh 64 /pscratch/sd/g/gbisht/e3sm_scratch/pm-gpu/Dlnd.pm-gpu.gpu.NorthAmerica1km.GRFR.aa57ca56c8.NTASKS_32.gnugpu.2025-12-05/bld/e3sm.exe >> e3sm.log.$LID 2>&1
|
It exports
e.g. with 64 tpn: More info at https://docs.nersc.gov/jobs/affinity/ . But |
|
Status: need to run benchmarks to make sure performance not degraded. |
|
I have been looking at this |
|
Note: @ndkeen is still testing. |
|
First of all, do we have now have the change mentioned above to CIME? ( Can suggestions be made for tests/scripts to use to test those new pelayouts? I already made one of the changes in this PR (in #7851), so we can remove it here (regarding the For the MAX_TASKS_PER_NODE change to 128 (from 256), I now realize we do want this change -- 256 is for pm-cpu. We want the Ideally, I'd like to only use this for those compsets that need it. |
|
Moved the affinity script to Documented a comparison of various pe-layouts (stacked, strided, xstrided, disjoint) -- up to 4x speedup stacked-vs-xstrided: https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU . All runs with 4 tasks per node will continue to run at the same throughput as before: previously implicit/omitted binding, now explicit -- also recommended by NERSC at https://docs.nersc.gov/jobs/affinity/ (grep for |
|
Yes I can share some results as well, but basically I'm not seeing any major red flags yet. One thing I wanted to test was using MPS. This is still ongoing, but it looks to have same performance as before using the gpu bind script. As I've already merged some PR's that handle some of the other aspects of this PR, the only change we need is adding the gpu bind command to srun, and the updates to pelayouts. It may be better if I do that in another PR. However, I really think we should have the CIME change mentioned above to do pelayout testing. |
|
I spoke too soon about performance difference. With MPS on 1 pm-gpu node running ne30 scream case, I'm seeing that using this new binding is 2% slower than without. I think something so core as this should maybe only be done for the cases where it is needed. I suggested one way to do this above which is to find a way to only use this arg on the srun line for certain cases (I can make a stab). Another hack, which is how I've been testing myself, is to conditionally use the flag based on an env variable. with this, if I set |
|
2% might be expected due to run-to-run diffs. |
|
Yes I'm familiar with run-to-run diffs, which is why I've been running several tests. With binding: without: I had not noticed your new change: Ah! Yes, that could work, nice find. I will try this. Also, I modified your affinity script to remove the odd characters that prevent Any update on the CIME change needed? |
|
CIME update in #7947 |
|
Cases that should work with new S/M/L PE-layouts on 64/128/256 nodes (on |
|
OK I tried the and then |
|
Yes, i got
|
|
I've done more testing. |
|
Yes, it's fine as long as the original semantics isn't modified: (MAX_MPITASKS_PER_NODE / num_gpus) tasks per gpu. |
|
I had trouble working with this branch as-is, and decided to redo the changes into a new branch with a new PR #7962 |
|
It sounds like Rob wants to re-open this. Also, please include the minor change I had made to PR #7962 |
Also, - remove --gpu-bind options - set max mpi+omp to 128 - fix spelling of --cpu-bind - clean-up omp env-vars from mpi-only runs
…at a merge will then use the new syntax
but it may be fine to just hardcode a value of 4 here
f1ba8c6 to
3138334
Compare
ee73e6a to
59f90dd
Compare
|
Ok, done: rebased onto latest 2026-Jan-9 master and addressed the MMF comment. |
|
You said you rebased, but why do the files-diffs show the OMP changes? those were done in #7932 |
|
pm-cpu on master still has omp lines here. |
|
I may have missed those in my PR. I think this change should be a different PR than this though right? What about #7980 |
|
Ok, done |
|
OK thanks for that Az. I have this branch running some final tests and all look good so far. Will merge to next asap. |
Update gpu affinity on pm-gpu (and muller-gpu, alvarez-gpu) for 64+ ppn such that 16+ processes per node see only one gpu (with CUDA_VISIBLE_DEVICES). Also, add S/M/L PE-layouts for ne256-wcyclxx on pm-gpu/muller-gpu/alvarez-gpu with exclusive-process-striding such that every 16th on-node mpi-process is exclusively reserved for EAMxx with all other on-node procs running other comps. [BFB]
|
merged to next
|
Update gpu affinity on pm-gpu (and muller-gpu, alvarez-gpu) for 64+ ppn
such that 16+ processes per node see only one gpu
(with
CUDA_VISIBLE_DEVICES).Also, add S/M/L PE-layouts for ne256-wcyclxx on pm-gpu/muller-gpu/alvarez-gpu
with exclusive-process-striding such that every 16th on-node mpi-process
is exclusively reserved for EAMxx with all other on-node procs running other comps.
[BFB]
More info on striding and benchmarking is at https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU
To get XML settings forEXCL_STRIDE, need to checkout a pending cime branch cime master (or a pending PR #7947):This PR also had:
set max mpi+omp to 128(done in PR 7932)fix spelling of --cpu-bind(done in PR 7851)clean-up omp env-vars from mpi-only runs(done in PR 7932)