Skip to content

Update gpu affinity on pm-gpu#7818

Merged
ndkeen merged 15 commits intomasterfrom
azamat/pes/add-xstrid-ne256-wcycl
Jan 15, 2026
Merged

Update gpu affinity on pm-gpu#7818
ndkeen merged 15 commits intomasterfrom
azamat/pes/add-xstrid-ne256-wcycl

Conversation

@amametjanov
Copy link
Member

@amametjanov amametjanov commented Oct 21, 2025

Update gpu affinity on pm-gpu (and muller-gpu, alvarez-gpu) for 64+ ppn
such that 16+ processes per node see only one gpu
(with CUDA_VISIBLE_DEVICES).

Also, add S/M/L PE-layouts for ne256-wcyclxx on pm-gpu/muller-gpu/alvarez-gpu
with exclusive-process-striding such that every 16th on-node mpi-process
is exclusively reserved for EAMxx with all other on-node procs running other comps.

[BFB]


More info on striding and benchmarking is at https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU

To get XML settings for EXCL_STRIDE, need to checkout a pending cime branch cime master (or a pending PR #7947):

$ cd cime
$ git fetch && git checkout master

This PR also had:

  • set max mpi+omp to 128 (done in PR 7932)
  • fix spelling of --cpu-bind (done in PR 7851)
  • clean-up omp env-vars from mpi-only runs (done in PR 7932)

@amametjanov amametjanov added Machine Files pm-gpu Perlmutter machine at NERSC (GPU nodes) labels Oct 21, 2025
@amametjanov
Copy link
Member Author

amametjanov commented Oct 21, 2025

Testing:

  • 64 nodes: base 0.18 sypd, this-branch 0.715 sypd -- 3.93x speedup
  • 128 nodes: base 0.34 sypd, this-branch 1.18 sypd -- 3.47x speedup

Example on 128 nodes:

  • base:
  component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        512         0        512    x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (1     )
  lnd = elm        512         0        512    x 1       1      (1     )
  ice = mpassi     512         0        512    x 1       1      (1     )
  ocn = mpaso      512         0        512    x 1       1      (1     )
  rof = mosart     512         0        512    x 1       1      (1     )
  glc = sglc       512         0        512    x 1       1      (1     )
  wav = swav       512         0        512    x 1       1      (1     )
  iac = siac       512         0        512    x 1       1      (1     )
  esp = sesp       512         0        512    x 1       1      (1     )

    TOT Run Time:     694.880 seconds      694.880 seconds/mday         0.34 myears/wday
    CPL Run Time:      21.708 seconds       21.708 seconds/mday        10.90 myears/wday
    ATM Run Time:     120.832 seconds      120.832 seconds/mday         1.96 myears/wday
    LND Run Time:       9.515 seconds        9.515 seconds/mday        24.88 myears/wday
    ICE Run Time:     185.903 seconds      185.903 seconds/mday         1.27 myears/wday
    OCN Run Time:     365.385 seconds      365.385 seconds/mday         0.65 myears/wday
    ROF Run Time:       0.628 seconds        0.628 seconds/mday       376.93 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     17.182 seconds       17.182 seconds/mday        13.78 myears/wday
  • test:
 component       comp_pes    root_pe   tasks  x threads instances (stride)
  ---------        ------     -------   ------   ------  ---------  ------
  cpl = cpl        8192        0        8192   x 1       1      (1     )
  atm = scream     512         0        512    x 1       1      (16    )
  lnd = elm        8192        0        8192   x 1       1      (1     )
  ice = mpassi     8192        0        8192   x 1       1      (1     )
  ocn = mpaso      8192        0        8192   x 1       1      (1     )
  rof = mosart     8192        0        8192   x 1       1      (1     )
  glc = sglc       1           1        1      x 1       1      (1     )
  wav = swav       1           1        1      x 1       1      (1     )
  iac = siac       1           1        1      x 1       1      (1     )
  esp = sesp       1           1        1      x 1       1      (1     )

    TOT Run Time:     200.377 seconds      200.377 seconds/mday         1.18 myears/wday
    CPL Run Time:      12.038 seconds       12.038 seconds/mday        19.66 myears/wday
    ATM Run Time:     125.552 seconds      125.552 seconds/mday         1.89 myears/wday
    LND Run Time:      12.020 seconds       12.020 seconds/mday        19.69 myears/wday
    ICE Run Time:      39.554 seconds       39.554 seconds/mday         5.98 myears/wday
    OCN Run Time:      99.807 seconds       99.807 seconds/mday         2.37 myears/wday
    ROF Run Time:       1.231 seconds        1.231 seconds/mday       192.29 myears/wday
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday
    CPL COMM Time:     70.586 seconds       70.586 seconds/mday         3.35 myears/wday

@ndkeen
Copy link
Contributor

ndkeen commented Oct 21, 2025

I think we want to keep those settings. You may need to find a conditional way to build if you are finding other settings are better for certain cases.

I do notice the correct syntax to srun is --cpu-bind, not what I had which is --cpu_bind, where it may be srun simply ignores the command rather than error. Testing this change now.

@amametjanov amametjanov added BFB PR leaves answers BFB Performance labels Oct 21, 2025
<arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu_bind=cores"; else echo "--cpu_bind=threads";fi;} </arg>
<arg name="binding"> $SHELL{if [ 64 -ge `./xmlquery --value MAX_MPITASKS_PER_NODE` ]; then echo "--cpu-bind=cores"; else echo "--cpu-bind=threads";fi;} </arg>
<arg name="placement"> -m plane=$SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>
<arg name="gpu-bind"> /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh $SHELL{echo `./xmlquery --value MAX_MPITASKS_PER_NODE`}</arg>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this set_affinity_npergu.sh file do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't set_affinity_npergu.sh be part of our e3sm repo itself?

Copy link
Contributor

@ndkeen ndkeen Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing. I'm slowly working thru how to best integrate.
Place in tools directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize that set_affinity_npergu.sh is getting added to the srun something like this:

srun  --label  -n 512 -N 8 -c 2  --cpu-bind=cores   -m plane=64  /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh 64 /pscratch/sd/g/gbisht/e3sm_scratch/pm-gpu/Dlnd.pm-gpu.gpu.NorthAmerica1km.GRFR.aa57ca56c8.NTASKS_32.gnugpu.2025-12-05/bld/e3sm.exe >> e3sm.log.$LID 2>&1

@amametjanov
Copy link
Member Author

It exports CUDA_VISIBLE_DEVICES=[0|1|2|3] at run-time depending on node-local mpi task id:

  • if 4 tasks per node, then 1 task per gpu: each task sees only 1 gpu (either of 0,1,2,3), like before
  • if 64 tpn, then 16 tasks per gpu: first 16 tasks on gpu 0, next 16 on gpu 1 etc.
  • without this, prior behavior is round-robin task 0 on gpu 0, task 1 on gpu 1: e.g. task 0 and 16 on gpu 0 with pstrid=16 leads to out-of-memory errors.
> cat /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh
#!/bin/bash
#num_gpus=$(nvidia-smi -L | wc -l)
tasks_per_node=$1
tasks_per_gpu=$(( ${tasks_per_node} / 4 ))
gpu=$(( (${SLURM_LOCALID} / ${tasks_per_gpu}) % 4 ))
export CUDA_VISIBLE_DEVICES=$gpu
echo “RANK= ${SLURM_PROCID} LOCAL_RANK= ${SLURM_LOCALID} gpu= ${gpu}”
shift
"$@"

e.g. with 64 tpn:

   0: “RANK= 0 LOCAL_RANK= 0 gpu= 0”
   1: “RANK= 1 LOCAL_RANK= 1 gpu= 0”
  15: “RANK= 15 LOCAL_RANK= 15 gpu= 0”
  16: “RANK= 16 LOCAL_RANK= 16 gpu= 1”
  63: “RANK= 63 LOCAL_RANK= 63 gpu= 3”
  64: “RANK= 64 LOCAL_RANK= 0 gpu= 0”
...

More info at https://docs.nersc.gov/jobs/affinity/ . But --gpu-bind there doesn't work for us, because of direct gpu-to-gpu comms with MPICH_GPU_SUPPORT_ENABLED=1: --gpu-bind leads to IPC cuIpcOpenMemHandle errors like

118: gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
118: (GTL DEBUG: 118) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 360

@rljacob
Copy link
Member

rljacob commented Nov 20, 2025

Status: need to run benchmarks to make sure performance not degraded.

@ndkeen
Copy link
Contributor

ndkeen commented Nov 20, 2025

I have been looking at this

@rljacob
Copy link
Member

rljacob commented Dec 11, 2025

Note: @ndkeen is still testing.

@ndkeen
Copy link
Contributor

ndkeen commented Dec 11, 2025

First of all, do we have now have the change mentioned above to CIME? (azamat/pes/add-xstrid-to-xml).
I would prefer we not merge this until that's in CIME.

Can suggestions be made for tests/scripts to use to test those new pelayouts?

I already made one of the changes in this PR (in #7851), so we can remove it here (regarding the --cpu-bind flag syntax).
I reverted this change in this branch.

For the MAX_TASKS_PER_NODE change to 128 (from 256), I now realize we do want this change -- 256 is for pm-cpu.
For pm-cpu, with 2 CPU's, the max is 256 (as we could have hyper-threading).
But for pm-gpu, only have 1 CPU, so max is 128.
It will not make a difference -- ie we could set this to very large number, but it provides info and prevents accidentally using too many tasks.
Will also make the change for muller/alvarez, but maybe in different PR.
I created #7932 to make some of the safe/unrelated changes.
Then I could create a fresh PR to do the small/important change in this PR.

We want the /global/cfs/cdirs/e3sm/tools/set_affinity_npergpu.sh file to be in the repo, I think. Maybe in tools.
Well, maybe current location is ok for now?

Ideally, I'd like to only use this for those compsets that need it.
I have some initial performance testing for eamxx cases with/without and while I don't see a difference, I would want to run more tests, especially at scale. So we can still merge the main concepts in the the PR and work toward improving in a another PR.

@amametjanov
Copy link
Member Author

Moved the affinity script to

cime_config/machines/scripts/pm-gpu_set_affinity_npergpu.sh

Documented a comparison of various pe-layouts (stacked, strided, xstrided, disjoint) -- up to 4x speedup stacked-vs-xstrided: https://e3sm.atlassian.net/wiki/spaces/DOC/pages/5698617345/Case+v3.ne256.wcycl+on+PM-GPU .

All runs with 4 tasks per node will continue to run at the same throughput as before: previously implicit/omitted binding, now explicit -- also recommended by NERSC at https://docs.nersc.gov/jobs/affinity/ (grep for CUDA_VISIBLE_DEVICES).

@ndkeen
Copy link
Contributor

ndkeen commented Dec 15, 2025

Yes I can share some results as well, but basically I'm not seeing any major red flags yet. One thing I wanted to test was using MPS. This is still ongoing, but it looks to have same performance as before using the gpu bind script.
I also wanted to test the NESAP 512-node benchmark.

As I've already merged some PR's that handle some of the other aspects of this PR, the only change we need is adding the gpu bind command to srun, and the updates to pelayouts. It may be better if I do that in another PR.

However, I really think we should have the CIME change mentioned above to do pelayout testing.

@ndkeen
Copy link
Contributor

ndkeen commented Dec 16, 2025

I spoke too soon about performance difference. With MPS on 1 pm-gpu node running ne30 scream case, I'm seeing that using this new binding is 2% slower than without.
I can continue testing.

I think something so core as this should maybe only be done for the cases where it is needed.

I suggested one way to do this above which is to find a way to only use this arg on the srun line for certain cases (I can make a stab). Another hack, which is how I've been testing myself, is to conditionally use the flag based on an env variable.

<arg name="gpu-bind">$SHELL{if [ -n "$GPUBINDTEST" ]; then echo "/global/cfs/cdirs/e3sm/tools/s.sh $(./xmlquery --value MAX_MPITASKS_PER_NODE)"; else echo ""; fi}</arg>

with this, if I set GPUBINDTEST=1, it will use it, otherwise it will not.

@amametjanov
Copy link
Member Author

2% might be expected due to run-to-run diffs.
Also, with MPS, there is more than 1 task per GPU (over-subscribed) and some slow-down is expected.
Anyways, with the latest commit, binding of MPI tasks to GPUs is now only when there are 64+ MPI processes per node. If fewer, then no explicit binding and all 4 GPUs are visible to all on-node tasks.

@ndkeen
Copy link
Contributor

ndkeen commented Dec 16, 2025

Yes I'm familiar with run-to-run diffs, which is why I've been running several tests.
With MPS, depending on the problem and number of nodes used, etc, I might see 9-20% improvement.
So it's not slowing down, but improving, but maybe you meant that it is expected to see slowdown with this affinity change. I'm still hoping to implement MPS better, but for now I'm still testing.

With binding:

ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262800.251215-134544:    TOT Run Time:     146.435 seconds       29.287 seconds/mday         8.08 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262806.251215-135605:    TOT Run Time:     146.695 seconds       29.339 seconds/mday         8.07 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262829.251215-162031:    TOT Run Time:     146.359 seconds       29.272 seconds/mday         8.09 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262830.251215-162431:    TOT Run Time:     146.448 seconds       29.290 seconds/mday         8.08 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.gbind.1262831.251215-162831:    TOT Run Time:     146.476 seconds       29.295 seconds/mday         8.08 myears/wday 

without:

ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262733.251215-125321:    TOT Run Time:     143.361 seconds       28.672 seconds/mday         8.26 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262819.251215-152845:    TOT Run Time:     143.656 seconds       28.731 seconds/mday         8.24 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262823.251215-161228:    TOT Run Time:     143.295 seconds       28.659 seconds/mday         8.26 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262824.251215-161631:    TOT Run Time:     143.405 seconds       28.681 seconds/mday         8.25 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262825.251215-162031:    TOT Run Time:     143.772 seconds       28.754 seconds/mday         8.23 myears/wday 
ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps/case_scripts/timing/e3sm_timing.ne30pg2_ne30pg2.F2010-SCREAMv1.hbc2.default.5d.n01.16x4.mps.1262826.251215-162431:    TOT Run Time:     143.311 seconds       28.662 seconds/mday         8.26 myears/wday 

I had not noticed your new change:

<arg name="gpu-bind"> $SHELL{ppn=`./xmlquery --value MAX_MPITASKS_PER_NODE`; if [ 64 -le $ppn ]; then echo $CIMEROOT/../cime_config/machines/scripts/pm-gpu_set_affinity_npergpu.sh $ppn; fi;} </arg>

Ah! Yes, that could work, nice find. I will try this.

Also, I modified your affinity script to remove the odd characters that prevent cat from working:

#!/bin/bash
tasks_per_node=$1
tasks_per_gpu=$(( ${tasks_per_node} / 4 ))
gpu=$(( (${SLURM_LOCALID} / ${tasks_per_gpu}) % 4 ))
export CUDA_VISIBLE_DEVICES=$gpu

# Prints the log line with the literal question marks (and a newline \n)
printf '?RANK= %s LOCAL_RANK= %s gpu= %s?\n' ${SLURM_PROCID} ${SLURM_LOCALID} ${gpu}

shift
"$@"

Any update on the CIME change needed?

@rljacob
Copy link
Member

rljacob commented Dec 18, 2025

CIME update in #7947

@amametjanov
Copy link
Member Author

Cases that should work with new S/M/L PE-layouts on 64/128/256 nodes (on next today):

./cime/scripts/create_test --machine pm-gpu SMS_PS.ne256pg2_r025_RRSwISC6to18E3r5.WCYCLXX2010

@ndkeen
Copy link
Contributor

ndkeen commented Dec 19, 2025

OK I tried the PS case on 64 nodes. Does this performance look to be what you might expect?

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        4096        0        4096   x 1       1      (1     ) 
  atm = scream     256         0        256    x 1       1      (16    ) 
  lnd = elm        4096        0        4096   x 1       1      (1     ) 
  ice = mpassi     4096        0        4096   x 1       1      (1     ) 
  ocn = mpaso      4096        0        4096   x 1       1      (1     ) 
  rof = mosart     4096        0        4096   x 1       1      (1     ) 
  glc = sglc       1           1        1      x 1       1      (1     ) 
  wav = swav       1           1        1      x 1       1      (1     ) 
  iac = siac       1           1        1      x 1       1      (1     ) 
  esp = sesp       1           1        1      x 1       1      (1     ) 

  total pes active           : 8192 
  mpi tasks per node         : 64 
  pe count for cost estimate : 4096 

  Overall Metrics: 
    Model Cost:          126288.39   pe-hrs/simulated_year 
    Model Throughput:         0.78   simulated_years/day 

    Init Time   :     486.497 seconds 
    Run Time    :    1520.488 seconds      304.098 seconds/day 
    Final Time  :       1.218 seconds 

    Actual Ocn Init Wait Time     :       0.665 seconds 
    Estimated Ocn Init Run Time   :       1.371 seconds 
    Estimated Run Time Correction :       0.706 seconds 
      (This correction has been applied to the ocean and total run times) 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:    1520.488 seconds      304.098 seconds/mday         0.78 myears/wday 
    CPL Run Time:     112.250 seconds       22.450 seconds/mday        10.54 myears/wday 
    ATM Run Time:     904.835 seconds      180.967 seconds/mday         1.31 myears/wday 
    LND Run Time:      19.147 seconds        3.829 seconds/mday        61.81 myears/wday 
    ICE Run Time:     351.409 seconds       70.282 seconds/mday         3.37 myears/wday 
    OCN Run Time:     987.462 seconds      197.492 seconds/mday         1.20 myears/wday 
    ROF Run Time:       3.226 seconds        0.645 seconds/mday       366.88 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:    539.398 seconds      107.880 seconds/mday         2.19 myears/wday 

and then PM with 128 nodes:

  component       comp_pes    root_pe   tasks  x threads instances (stride) 
  ---------        ------     -------   ------   ------  ---------  ------  
  cpl = cpl        8192        0        8192   x 1       1      (1     ) 
  atm = scream     512         0        512    x 1       1      (16    ) 
  lnd = elm        8192        0        8192   x 1       1      (1     ) 
  ice = mpassi     8192        0        8192   x 1       1      (1     ) 
  ocn = mpaso      8192        0        8192   x 1       1      (1     ) 
  rof = mosart     8192        0        8192   x 1       1      (1     ) 
  glc = sglc       1           1        1      x 1       1      (1     ) 
  wav = swav       1           1        1      x 1       1      (1     ) 
  iac = siac       1           1        1      x 1       1      (1     ) 
  esp = sesp       1           1        1      x 1       1      (1     ) 

  total pes active           : 16384 
  mpi tasks per node         : 64 
  pe count for cost estimate : 8192 

  Overall Metrics: 
    Model Cost:          153449.20   pe-hrs/simulated_year 
    Model Throughput:         1.28   simulated_years/day 

    Init Time   :     443.389 seconds 
    Run Time    :     369.500 seconds      184.750 seconds/day 
    Final Time  :       0.272 seconds 

    Actual Ocn Init Wait Time     :       0.345 seconds 
    Estimated Ocn Init Run Time   :       0.700 seconds 
    Estimated Run Time Correction :       0.355 seconds 
      (This correction has been applied to the ocean and total run times) 

Runs Time in total seconds, seconds/model-day, and model-years/wall-day 
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components 

    TOT Run Time:     369.500 seconds      184.750 seconds/mday         1.28 myears/wday 
    CPL Run Time:      39.466 seconds       19.733 seconds/mday        12.00 myears/wday 
    ATM Run Time:     206.630 seconds      103.315 seconds/mday         2.29 myears/wday 
    LND Run Time:       7.954 seconds        3.977 seconds/mday        59.52 myears/wday 
    ICE Run Time:      92.894 seconds       46.447 seconds/mday         5.10 myears/wday 
    OCN Run Time:     201.567 seconds      100.783 seconds/mday         2.35 myears/wday 
    ROF Run Time:       1.076 seconds        0.538 seconds/mday       439.99 myears/wday 
    GLC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    WAV Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    IAC Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    ESP Run Time:       0.000 seconds        0.000 seconds/mday         0.00 myears/wday 
    CPL COMM Time:    130.617 seconds       65.308 seconds/mday         3.62 myears/wday 

@amametjanov
Copy link
Member Author

@ndkeen
Copy link
Contributor

ndkeen commented Dec 22, 2025

I've done more testing.
Az: are you ok with the minor changes I made to the gpu affinity script?
Also fine going back to hardcoding to 4 GPU's per node

@amametjanov
Copy link
Member Author

Yes, it's fine as long as the original semantics isn't modified: (MAX_MPITASKS_PER_NODE / num_gpus) tasks per gpu.

@ndkeen
Copy link
Contributor

ndkeen commented Dec 25, 2025

I had trouble working with this branch as-is, and decided to redo the changes into a new branch with a new PR #7962

@ndkeen ndkeen closed this Dec 25, 2025
@ndkeen
Copy link
Contributor

ndkeen commented Jan 9, 2026

It sounds like Rob wants to re-open this.
Can someone rebase it.
And I don't think the removal of the MMF section in config_batch.sml are related here.
Just like the other pieces I moved to other (now merged) PR's, we should do that for that change as well.

Also, please include the minor change I had made to PR #7962
So that same changes made for pm-gpu are done for muller/alvarez

@amametjanov amametjanov reopened this Jan 10, 2026
@amametjanov amametjanov force-pushed the azamat/pes/add-xstrid-ne256-wcycl branch from f1ba8c6 to 3138334 Compare January 10, 2026 22:08
@amametjanov amametjanov force-pushed the azamat/pes/add-xstrid-ne256-wcycl branch from ee73e6a to 59f90dd Compare January 10, 2026 23:41
@amametjanov
Copy link
Member Author

Ok, done: rebased onto latest 2026-Jan-9 master and addressed the MMF comment.

@ndkeen
Copy link
Contributor

ndkeen commented Jan 11, 2026

You said you rebased, but why do the files-diffs show the OMP changes? those were done in #7932

@amametjanov
Copy link
Member Author

pm-cpu on master still has omp lines here.
Just a clean-up, which can be removed.

@ndkeen
Copy link
Contributor

ndkeen commented Jan 11, 2026

I may have missed those in my PR. I think this change should be a different PR than this though right?

What about #7980

@amametjanov
Copy link
Member Author

Ok, done

@ndkeen
Copy link
Contributor

ndkeen commented Jan 14, 2026

OK thanks for that Az. I have this branch running some final tests and all look good so far. Will merge to next asap.

ndkeen added a commit that referenced this pull request Jan 14, 2026
Update gpu affinity on pm-gpu (and muller-gpu, alvarez-gpu) for 64+ ppn
such that 16+ processes per node see only one gpu
(with CUDA_VISIBLE_DEVICES).

Also, add S/M/L PE-layouts for ne256-wcyclxx on pm-gpu/muller-gpu/alvarez-gpu
with exclusive-process-striding such that every 16th on-node mpi-process
is exclusively reserved for EAMxx with all other on-node procs running other comps.

[BFB]
@ndkeen
Copy link
Contributor

ndkeen commented Jan 14, 2026

merged to next

e3sm_eamxx_v1 tests passed and SMS_PS.ne256pg2_r025_RRSwISC6to18E3r5.WCYCLXX2010.pm-gpu_gnugpu had same performance as reported above.

@ndkeen ndkeen merged commit a6d19ca into master Jan 15, 2026
5 of 6 checks passed
@ndkeen ndkeen deleted the azamat/pes/add-xstrid-ne256-wcycl branch January 15, 2026 23:34
odiazib pushed a commit to odiazib/E3SM that referenced this pull request Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFB PR leaves answers BFB Machine Files Performance pm-gpu Perlmutter machine at NERSC (GPU nodes)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants