Update gpu affinity on pm-gpu by amametjanov · Pull Request #7818 · E3SM-Project/E3SM

amametjanov · 2025-10-21T21:14:16Z

Update gpu affinity on pm-gpu (and muller-gpu, alvarez-gpu) for 64+ ppn
such that 16+ processes per node see only one gpu
(with CUDA_VISIBLE_DEVICES).

Also, add S/M/L PE-layouts for ne256-wcyclxx on pm-gpu/muller-gpu/alvarez-gpu
with exclusive-process-striding such that every 16th on-node mpi-process
is exclusively reserved for EAMxx with all other on-node procs running other comps.

[BFB]

More info on striding and benchmarking is at https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU

~~To get XML settings for EXCL_STRIDE, need to checkout a pending cime branch cime master (or a pending PR #7947):~~

$ cd cime
$ git fetch && git checkout master

This PR also had:

~~set max mpi+omp to 128~~ (done in PR 7932)
~~fix spelling of --cpu-bind~~ (done in PR 7851)
~~clean-up omp env-vars from mpi-only runs~~ (done in PR 7932)

amametjanov · 2025-10-21T21:14:27Z

Testing:

64 nodes: base 0.18 sypd, this-branch 0.715 sypd -- 3.93x speedup
128 nodes: base 0.34 sypd, this-branch 1.18 sypd -- 3.47x speedup

Example on 128 nodes:

base:

  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        512         0        512    x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (1     )
  lnd = elm        512         0        512    x 1       1      (1     )
  ice = mpassi     512         0        512    x 1       1      (1     )
  ocn = mpaso      512         0        512    x 1       1      (1     )
  rof = mosart     512         0        512    x 1       1      (1     )
  glc = sglc       512         0        512    x 1       1      (1     )
  wav = swav       512         0        512    x 1       1      (1     )
  iac = siac       512         0        512    x 1       1      (1     )
  esp = sesp       512         0        512    x 1       1      (1     )

    TOT Run Time:     694.880 seconds      694.880 seconds/mday         0.34 myears/wday
    CPL Run Time:      21.708 seconds       21.708 seconds/mday        10.90 myears/wday
    ATM Run Time:     120.832 seconds      120.832 seconds/mday         1.96 myears/wday
    LND Run Time:       9.515 seconds        9.515 seconds/mday        24.88 myears/wday
    ICE Run Time:     185.903 seconds      185.903 seconds/mday         1.27 myears/wday
    OCN Run Time:     365.385 seconds      365.385 seconds/mday         0.65 myears/wday
    ROF Run Time:       0.628 seconds        0.628 seconds/mday       376.93 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     17.182 seconds       17.182 seconds/mday        13.78 myears/wday

test:

 component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        8192        0        8192   x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (16    )
  lnd = elm        8192        0        8192   x 1       1      (1     )
  ice = mpassi     8192        0        8192   x 1       1      (1     )
  ocn = mpaso      8192        0        8192   x 1       1      (1     )
  rof = mosart     8192        0        8192   x 1       1      (1     )
  glc = sglc       1           1        1      x 1       1      (1     )
  wav = swav       1           1        1      x 1       1      (1     )
  iac = siac       1           1        1      x 1       1      (1     )
  esp = sesp       1           1        1      x 1       1      (1     )

    TOT Run Time:     200.377 seconds      200.377 seconds/mday         1.18 myears/wday
    CPL Run Time:      12.038 seconds       12.038 seconds/mday        19.66 myears/wday
    ATM Run Time:     125.552 seconds      125.552 seconds/mday         1.89 myears/wday
    LND Run Time:      12.020 seconds       12.020 seconds/mday        19.69 myears/wday
    ICE Run Time:      39.554 seconds       39.554 seconds/mday         5.98 myears/wday
    OCN Run Time:      99.807 seconds       99.807 seconds/mday         2.37 myears/wday
    ROF Run Time:       1.231 seconds        1.231 seconds/mday       192.29 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     70.586 seconds       70.586 seconds/mday         3.35 myears/wday

ndkeen · 2025-10-21T21:16:48Z

I think we want to keep those settings. You may need to find a conditional way to build if you are finding other settings are better for certain cases.

I do notice the correct syntax to srun is --cpu-bind, not what I had which is --cpu_bind, where it may be srun simply ignores the command rather than error. Testing this change now.

rljacob · 2025-10-21T21:33:54Z

cime_config/machines/config_machines.xml

-        <arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu_bind=cores"; else echo "--cpu_bind=threads";fi;} </arg>
+        <arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu-bind=cores"; else echo "--cpu-bind=threads";fi;} </arg>
        <arg name="placement"> -m plane=$SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>
+        <arg name="gpu-bind"> /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh $SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>


What does this set_affinity_npergu.sh file do?

Shouldn't set_affinity_npergu.sh be part of our e3sm repo itself?

I thought the same thing. I'm slowly working thru how to best integrate.
Place in tools directory?

Oh, I didn't realize that set_affinity_npergu.sh is getting added to the srun something like this:

srun --label -n 512 -N 8 -c 2 --cpu-bind=cores -m plane=64 /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh 64 /pscratch/sd/g/gbisht/e3sm_scratch/pm-gpu/Dlnd.pm-gpu.gpu.NorthAmerica1km.GRFR.aa57ca56c8.NTASKS_32.gnugpu.2025-12-05/bld/e3sm.exe >> e3sm.log.$LID 2>&1

amametjanov · 2025-10-22T02:40:24Z

It exports CUDA_VISIBLE_DEVICES=[0|1|2|3] at run-time depending on node-local mpi task id:

if 4 tasks per node, then 1 task per gpu: each task sees only 1 gpu (either of 0,1,2,3), like before
if 64 tpn, then 16 tasks per gpu: first 16 tasks on gpu 0, next 16 on gpu 1 etc.
without this, prior behavior is round-robin task 0 on gpu 0, task 1 on gpu 1: e.g. task 0 and 16 on gpu 0 with pstrid=16 leads to out-of-memory errors.

> cat /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh
#!/bin/bash
#num_gpus=$(nvidia-smi -L | wc -l)
tasks_per_node=$1
tasks_per_gpu=$(( ${tasks_per_node} / 4 ))
gpu=$(( (${SLURM_LOCALID} / ${tasks_per_gpu}) % 4 ))
export CUDA_VISIBLE_DEVICES=$gpu
echo “RANK= ${SLURM_PROCID} LOCAL_RANK= ${SLURM_LOCALID} gpu= ${gpu}”
shift
"$@"

e.g. with 64 tpn:

   0: “RANK= 0 LOCAL_RANK= 0 gpu= 0”
   1: “RANK= 1 LOCAL_RANK= 1 gpu= 0”
  15: “RANK= 15 LOCAL_RANK= 15 gpu= 0”
  16: “RANK= 16 LOCAL_RANK= 16 gpu= 1”
  63: “RANK= 63 LOCAL_RANK= 63 gpu= 3”
  64: “RANK= 64 LOCAL_RANK= 0 gpu= 0”
...

More info at https://docs.nersc.gov/jobs/affinity/ . But --gpu-bind there doesn't work for us, because of direct gpu-to-gpu comms with MPICH_GPU_SUPPORT_ENABLED=1: --gpu-bind leads to IPC cuIpcOpenMemHandle errors like

118: gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
118: (GTL DEBUG: 118) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 360

rljacob · 2025-11-20T18:14:55Z

Status: need to run benchmarks to make sure performance not degraded.

ndkeen · 2025-11-20T18:58:13Z

I have been looking at this

rljacob · 2025-12-11T18:19:43Z

Note: @ndkeen is still testing.

ndkeen · 2025-12-11T20:58:28Z

First of all, do we have now have the change mentioned above to CIME? (azamat/pes/add-xstrid-to-xml).
I would prefer we not merge this until that's in CIME.

Can suggestions be made for tests/scripts to use to test those new pelayouts?

I already made one of the changes in this PR (in #7851), so we can remove it here (regarding the --cpu-bind flag syntax).
I reverted this change in this branch.

For the MAX_TASKS_PER_NODE change to 128 (from 256), I now realize we do want this change -- 256 is for pm-cpu.
For pm-cpu, with 2 CPU's, the max is 256 (as we could have hyper-threading).
But for pm-gpu, only have 1 CPU, so max is 128.
It will not make a difference -- ie we could set this to very large number, but it provides info and prevents accidentally using too many tasks.
Will also make the change for muller/alvarez, but maybe in different PR.
I created #7932 to make some of the safe/unrelated changes.
Then I could create a fresh PR to do the small/important change in this PR.

We want the /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh file to be in the repo, I think. Maybe in tools.
Well, maybe current location is ok for now?

Ideally, I'd like to only use this for those compsets that need it.
I have some initial performance testing for eamxx cases with/without and while I don't see a difference, I would want to run more tests, especially at scale. So we can still merge the main concepts in the the PR and work toward improving in a another PR.

amametjanov · 2025-12-15T20:25:31Z

Moved the affinity script to

cime_config/machines/scripts/pm-gpu_set_affinity_npergpu.sh

Documented a comparison of various pe-layouts (stacked, strided, xstrided, disjoint) -- up to 4x speedup stacked-vs-xstrided: https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU .

All runs with 4 tasks per node will continue to run at the same throughput as before: previously implicit/omitted binding, now explicit -- also recommended by NERSC at https://docs.nersc.gov/jobs/affinity/ (grep for CUDA_VISIBLE_DEVICES).

ndkeen · 2025-12-15T21:30:23Z

Yes I can share some results as well, but basically I'm not seeing any major red flags yet. One thing I wanted to test was using MPS. This is still ongoing, but it looks to have same performance as before using the gpu bind script.
I also wanted to test the NESAP 512-node benchmark.

As I've already merged some PR's that handle some of the other aspects of this PR, the only change we need is adding the gpu bind command to srun, and the updates to pelayouts. It may be better if I do that in another PR.

However, I really think we should have the CIME change mentioned above to do pelayout testing.

ndkeen · 2025-12-16T19:27:43Z

I spoke too soon about performance difference. With MPS on 1 pm-gpu node running ne30 scream case, I'm seeing that using this new binding is 2% slower than without.
I can continue testing.

I think something so core as this should maybe only be done for the cases where it is needed.

I suggested one way to do this above which is to find a way to only use this arg on the srun line for certain cases (I can make a stab). Another hack, which is how I've been testing myself, is to conditionally use the flag based on an env variable.

<arg name="gpu-bind">$SHELL{if [ -n "$GPUBINDTEST" ]; then echo "/global/cfs/cdirs/e3sm/tools/s.sh $(./xmlquery --value MAX_MPITASKS_PER_NODE)"; else echo ""; fi}</arg>

with this, if I set GPUBINDTEST=1, it will use it, otherwise it will not.

amametjanov · 2025-12-16T21:36:27Z

2% might be expected due to run-to-run diffs.
Also, with MPS, there is more than 1 task per GPU (over-subscribed) and some slow-down is expected.
Anyways, with the latest commit, binding of MPI tasks to GPUs is now only when there are 64+ MPI processes per node. If fewer, then no explicit binding and all 4 GPUs are visible to all on-node tasks.

ndkeen · 2025-12-16T21:57:57Z

Yes I'm familiar with run-to-run diffs, which is why I've been running several tests.
With MPS, depending on the problem and number of nodes used, etc, I might see 9-20% improvement.
So it's not slowing down, but improving, but maybe you meant that it is expected to see slowdown with this affinity change. I'm still hoping to implement MPS better, but for now I'm still testing.

With binding:

ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262800.251215-134544:    TOT Run Time:     146.435 seconds       29.287 seconds/mday         8.08 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262806.251215-135605:    TOT Run Time:     146.695 seconds       29.339 seconds/mday         8.07 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262829.251215-162031:    TOT Run Time:     146.359 seconds       29.272 seconds/mday         8.09 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262830.251215-162431:    TOT Run Time:     146.448 seconds       29.290 seconds/mday         8.08 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262831.251215-162831:    TOT Run Time:     146.476 seconds       29.295 seconds/mday         8.08 myears/wday

without:

ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262733.251215-125321:    TOT Run Time:     143.361 seconds       28.672 seconds/mday         8.26 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262819.251215-152845:    TOT Run Time:     143.656 seconds       28.731 seconds/mday         8.24 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262823.251215-161228:    TOT Run Time:     143.295 seconds       28.659 seconds/mday         8.26 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262824.251215-161631:    TOT Run Time:     143.405 seconds       28.681 seconds/mday         8.25 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262825.251215-162031:    TOT Run Time:     143.772 seconds       28.754 seconds/mday         8.23 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262826.251215-162431:    TOT Run Time:     143.311 seconds       28.662 seconds/mday         8.26 myears/wday

I had not noticed your new change:

<arg name="gpu-bind"> $SHELL{ppn=`./xmlquery --value MAX_MPITASKS_PER_NODE`; if [ 64 -le $ppn ]; then echo $CIMEROOT/../cime_config/machines/scripts/pm-gpu_set_affinity_npergpu.sh $ppn; fi;} </arg>

Ah! Yes, that could work, nice find. I will try this.

Also, I modified your affinity script to remove the odd characters that prevent cat from working:

#!/bin/bash
tasks_per_node=$1
tasks_per_gpu=$(( ${tasks_per_node} / 4 ))
gpu=$(( (${SLURM_LOCALID} / ${tasks_per_gpu}) % 4 ))
export CUDA_VISIBLE_DEVICES=$gpu

# Prints the log line with the literal question marks (and a newline \n)
printf '?RANK= %s LOCAL_RANK= %s gpu= %s?\n' ${SLURM_PROCID} ${SLURM_LOCALID} ${gpu}

shift
"$@"

Any update on the CIME change needed?

rljacob · 2025-12-18T18:16:39Z

CIME update in #7947

amametjanov · 2025-12-18T18:31:55Z

Cases that should work with new S/M/L PE-layouts on 64/128/256 nodes (on next today):

./cime/scripts/create_test --machine pm-gpu SMS_PS.ne256pg2_r025_RRSwISC6to18E3r5.WCYCLXX2010

ndkeen · 2025-12-19T19:10:56Z

OK I tried the PS case on 64 nodes. Does this performance look to be what you might expect?

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        4096        0        4096   x 1       1      (1     ) 
  atm = scream     256         0        256    x 1       1      (16    ) 
  lnd = elm        4096        0        4096   x 1       1      (1     ) 
  ice = mpassi     4096        0        4096   x 1       1      (1     ) 
  ocn = mpaso      4096        0        4096   x 1       1      (1     ) 
  rof = mosart     4096        0        4096   x 1       1      (1     ) 
  glc = sglc       1           1        1      x 1       1      (1     ) 
  wav = swav       1           1        1      x 1       1      (1     ) 
  iac = siac       1           1        1      x 1       1      (1     ) 
  esp = sesp       1           1        1      x 1       1      (1     ) 

  total pes active           : 8192 
  mpi tasks per node         : 64 
  pe count for cost estimate : 4096 

  Overall Metrics: 
    Model Cost:          126288.39   pe-hrs/simulated_year 
    Model Throughput:         0.78   simulated_years/day 

    Init Time   :     486.497 seconds 
    Run Time    :    1520.488 seconds      304.098 seconds/day 
    Final Time  :       1.218 seconds 

    Actual Ocn Init Wait Time     :       0.665 seconds 
    Estimated Ocn Init Run Time   :       1.371 seconds 
    Estimated Run Time Correction :       0.706 seconds 
      (This correction has been applied to the ocean and total run times) 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:    1520.488 seconds      304.098 seconds/mday         0.78 myears/wday 
    CPL Run Time:     112.250 seconds       22.450 seconds/mday        10.54 myears/wday 
    ATM Run Time:     904.835 seconds      180.967 seconds/mday         1.31 myears/wday 
    LND Run Time:      19.147 seconds        3.829 seconds/mday        61.81 myears/wday 
    ICE Run Time:     351.409 seconds       70.282 seconds/mday         3.37 myears/wday 
    OCN Run Time:     987.462 seconds      197.492 seconds/mday         1.20 myears/wday 
    ROF Run Time:       3.226 seconds        0.645 seconds/mday       366.88 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:    539.398 seconds      107.880 seconds/mday         2.19 myears/wday

and then PM with 128 nodes:

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        8192        0        8192   x 1       1      (1     ) 
  atm = scream     512         0        512    x 1       1      (16    ) 
  lnd = elm        8192        0        8192   x 1       1      (1     ) 
  ice = mpassi     8192        0        8192   x 1       1      (1     ) 
  ocn = mpaso      8192        0        8192   x 1       1      (1     ) 
  rof = mosart     8192        0        8192   x 1       1      (1     ) 
  glc = sglc       1           1        1      x 1       1      (1     ) 
  wav = swav       1           1        1      x 1       1      (1     ) 
  iac = siac       1           1        1      x 1       1      (1     ) 
  esp = sesp       1           1        1      x 1       1      (1     ) 

  total pes active           : 16384 
  mpi tasks per node         : 64 
  pe count for cost estimate : 8192 

  Overall Metrics: 
    Model Cost:          153449.20   pe-hrs/simulated_year 
    Model Throughput:         1.28   simulated_years/day 

    Init Time   :     443.389 seconds 
    Run Time    :     369.500 seconds      184.750 seconds/day 
    Final Time  :       0.272 seconds 

    Actual Ocn Init Wait Time     :       0.345 seconds 
    Estimated Ocn Init Run Time   :       0.700 seconds 
    Estimated Run Time Correction :       0.355 seconds 
      (This correction has been applied to the ocean and total run times) 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:     369.500 seconds      184.750 seconds/mday         1.28 myears/wday 
    CPL Run Time:      39.466 seconds       19.733 seconds/mday        12.00 myears/wday 
    ATM Run Time:     206.630 seconds      103.315 seconds/mday         2.29 myears/wday 
    LND Run Time:       7.954 seconds        3.977 seconds/mday        59.52 myears/wday 
    ICE Run Time:      92.894 seconds       46.447 seconds/mday         5.10 myears/wday 
    OCN Run Time:     201.567 seconds      100.783 seconds/mday         2.35 myears/wday 
    ROF Run Time:       1.076 seconds        0.538 seconds/mday       439.99 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:    130.617 seconds       65.308 seconds/mday         3.62 myears/wday

amametjanov · 2025-12-20T00:56:16Z

Yes, i got

0.77 sypd on 64 nodes (https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU)
1.18 sypd on 128 nodes (Update gpu affinity on pm-gpu #7818 (comment))

ndkeen · 2025-12-22T21:20:37Z

I've done more testing.
Az: are you ok with the minor changes I made to the gpu affinity script?
Also fine going back to hardcoding to 4 GPU's per node

amametjanov · 2025-12-22T21:51:32Z

Yes, it's fine as long as the original semantics isn't modified: (MAX_MPITASKS_PER_NODE / num_gpus) tasks per gpu.

ndkeen · 2025-12-25T23:08:41Z

I had trouble working with this branch as-is, and decided to redo the changes into a new branch with a new PR #7962

ndkeen · 2026-01-09T22:06:43Z

It sounds like Rob wants to re-open this.
Can someone rebase it.
And I don't think the removal of the MMF section in config_batch.sml are related here.
Just like the other pieces I moved to other (now merged) PR's, we should do that for that change as well.

Also, please include the minor change I had made to PR #7962
So that same changes made for pm-gpu are done for muller/alvarez

Also, - remove --gpu-bind options - set max mpi+omp to 128 - fix spelling of --cpu-bind - clean-up omp env-vars from mpi-only runs

…at a merge will then use the new syntax

but it may be fine to just hardcode a value of 4 here

amametjanov · 2026-01-11T00:06:47Z

Ok, done: rebased onto latest 2026-Jan-9 master and addressed the MMF comment.

ndkeen · 2026-01-11T02:55:28Z

You said you rebased, but why do the files-diffs show the OMP changes? those were done in #7932

amametjanov · 2026-01-11T10:26:02Z

pm-cpu on master still has omp lines here.
Just a clean-up, which can be removed.

ndkeen · 2026-01-11T18:55:08Z

I may have missed those in my PR. I think this change should be a different PR than this though right?

What about #7980

amametjanov · 2026-01-11T19:49:18Z

Ok, done

ndkeen · 2026-01-14T18:14:48Z

OK thanks for that Az. I have this branch running some final tests and all look good so far. Will merge to next asap.

Update gpu affinity on pm-gpu (and muller-gpu, alvarez-gpu) for 64+ ppn such that 16+ processes per node see only one gpu (with CUDA_VISIBLE_DEVICES). Also, add S/M/L PE-layouts for ne256-wcyclxx on pm-gpu/muller-gpu/alvarez-gpu with exclusive-process-striding such that every 16th on-node mpi-process is exclusively reserved for EAMxx with all other on-node procs running other comps. [BFB]

ndkeen · 2026-01-14T21:00:30Z

merged to next

e3sm_eamxx_v1 tests passed and SMS_PS.ne256pg2_r025_RRSwISC6to18E3r5.WCYCLXX2010.pm-gpu_gnugpu had same performance as reported above.

…base of this PR E3SM-Project#7818

amametjanov added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Oct 21, 2025

amametjanov added BFB PR leaves answers BFB Performance labels Oct 21, 2025

rljacob reviewed Oct 21, 2025

View reviewed changes

rljacob assigned ndkeen Oct 21, 2025

rljacob approved these changes Dec 18, 2025

View reviewed changes

ndkeen mentioned this pull request Dec 25, 2025

For pm-gpu: add gpu affinity flag to srun and add pelayouts for coupled ne256 cases #7962

Closed

ndkeen closed this Dec 25, 2025

amametjanov reopened this Jan 10, 2026

amametjanov and others added 12 commits January 10, 2026 13:52

Add mpi_task to gpu affinity script

7fcc2ad

Also, - remove --gpu-bind options - set max mpi+omp to 128 - fix spelling of --cpu-bind - clean-up omp env-vars from mpi-only runs

Add S/M/L pe-layouts for ne256-wcyclxx on pm-gpu

7c30888

OK, maybe i needed to revert the cpu-bind flags to be original, so th…

30d2223

…at a merge will then use the new syntax

correct type in previous commit

b5fc941

Move affinity script to cime_config/machines/scripts

1d96663

Bind MPI tasks to GPUs only when 64+ MPI processes per node

4c26e95

correct a formatting issue in a comment

0820f7b

revert a change already merged in diff PR

62b3f25

revert change already merged in a diff PR

1ddc397

add same gpu bind treatment to other nersc machines similar to pm-gpu

a5fdf9e

minor enhancement to script to make it more general.

41cfe75

but it may be fine to just hardcode a value of 4 here

add other nersc machines with pm-gpu for these new pelayouts

3138334

amametjanov force-pushed the azamat/pes/add-xstrid-ne256-wcycl branch from f1ba8c6 to 3138334 Compare January 10, 2026 22:08

amametjanov added 2 commits January 10, 2026 14:34

Add changes from PRs #7851,#7932 after the rebase of this PR #7818

6636a41

Keep MMF-compsets bound to the first 4 mpi tasks on-node

59f90dd

amametjanov force-pushed the azamat/pes/add-xstrid-ne256-wcycl branch from ee73e6a to 59f90dd Compare January 10, 2026 23:41

Revert pm-cpu omp changes

6de1ee7

ndkeen merged commit a6d19ca into master Jan 15, 2026
5 of 6 checks passed

ndkeen deleted the azamat/pes/add-xstrid-ne256-wcycl branch January 15, 2026 23:34

odiazib pushed a commit to odiazib/E3SM that referenced this pull request Feb 5, 2026

Add changes from PRs E3SM-Project#7851,E3SM-Project#7932 after the re…

da99500

…base of this PR E3SM-Project#7818

Conversation

amametjanov commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amametjanov commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ndkeen commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rljacob Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

bishtgautam Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ndkeen Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bishtgautam Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

amametjanov commented Oct 22, 2025

Uh oh!

rljacob commented Nov 20, 2025

Uh oh!

ndkeen commented Nov 20, 2025

Uh oh!

rljacob commented Dec 11, 2025

Uh oh!

ndkeen commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amametjanov commented Dec 15, 2025

Uh oh!

ndkeen commented Dec 15, 2025

Uh oh!

ndkeen commented Dec 16, 2025

Uh oh!

amametjanov commented Dec 16, 2025

Uh oh!

ndkeen commented Dec 16, 2025

Uh oh!

rljacob commented Dec 18, 2025

Uh oh!

amametjanov commented Dec 18, 2025

Uh oh!

ndkeen commented Dec 19, 2025

Uh oh!

amametjanov commented Dec 20, 2025

Uh oh!

ndkeen commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amametjanov commented Dec 22, 2025

Uh oh!

ndkeen commented Dec 25, 2025

Uh oh!

ndkeen commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amametjanov commented Jan 11, 2026

Uh oh!

ndkeen commented Jan 11, 2026

Uh oh!

amametjanov commented Jan 11, 2026

Uh oh!

ndkeen commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amametjanov commented Jan 11, 2026

Uh oh!

ndkeen commented Jan 14, 2026

Uh oh!

ndkeen commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

amametjanov commented Oct 21, 2025 •

edited

Loading

amametjanov commented Oct 21, 2025 •

edited

Loading

ndkeen commented Oct 21, 2025 •

edited

Loading

ndkeen Dec 5, 2025 •

edited

Loading

ndkeen commented Dec 11, 2025 •

edited

Loading

ndkeen commented Dec 22, 2025 •

edited

Loading

ndkeen commented Jan 9, 2026 •

edited

Loading

ndkeen commented Jan 11, 2026 •

edited

Loading