Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] aurora config, clean up #6916

Draft
wants to merge 36 commits into
base: master
Choose a base branch
from
Draft

Conversation

oksanaguba
Copy link
Contributor

No description provided.

@oksanaguba oksanaguba mentioned this pull request Jan 19, 2025
@oksanaguba
Copy link
Contributor Author

this commit is running on sunspot

~/ess/v1-jan19/run/atm.log.11365537.amn-0001.250119-204314 

@oksanaguba
Copy link
Contributor Author

oksanaguba commented Jan 19, 2025

i should have not committed to here turning off compose and mam, will remove once this PR is finalized.

@ambrad ambrad requested a review from bartgol January 19, 2025 21:29
@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

Looks good. A couple of points:

  1. In addition to any usual pre-merge testing, I strongly recommend running e3sm_scream_v1_medres with this PR on Frontier, pm-gpu, and Chrysalis, as well as homme_integration on Chrysalis.
  2. I'm thinking the EKAT update isn't right. No, it's fine. I looked more closely. I'm seeing some commits from Luca that diff, but I suspect that's intended. Edit 2: Ok, maybe my initial analysis was right: there seems to be some unintended diffs in ekat.

if (Kokkos_ENABLE_SYCL)
#enable_language(SYCL)
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE)
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)

@@ -23,7 +23,8 @@ set(BUILD_HOMME_PREQX_KOKKOS OFF CACHE BOOL "")
set(BUILD_HOMME_PESE OFF CACHE BOOL "")
set(BUILD_HOMME_SWIM OFF CACHE BOOL "")
set(BUILD_HOMME_PRIM OFF CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting here this is what needs to be removed from this PR.


#<<<<<<< HEAD
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib)
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.

@oksanaguba
Copy link
Contributor Author

i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch.

this is another issue to sort out (but i would not know how).

@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

i id not understant EKAT comment

Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383

@bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM?

If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch.

More generally, testing across machines will help to flush out issues, if there are any, including any with ekat.

@oksanaguba
Copy link
Contributor Author

there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks.

@bartgol
Copy link
Contributor

bartgol commented Jan 27, 2025

@ambrad I'm assuming you are compiling with Kokkos::Serial as a backend, right? The CI fails when configuring eamxx are exposing an old bug in ekat (and eamxx) that shows up when openmp is on. Oksana pulled in the commit that fixes things in ekat, but we need a fix in eamxx.

@oksanaguba I can push the fix straight to your branch, if you want. Or I can send you the patch.

@rljacob
Copy link
Member

rljacob commented Jan 27, 2025

What is the purpose of introducing 4 new machines? sunspot-pvc, sunspotcpu, sunspot-gen, and auroracpu ?

@ambrad
Copy link
Member

ambrad commented Jan 28, 2025

@bartgol do you have a branch with your fix in it? That said, I'm confident the fix I made to compose is needed: I see the change in behavior necessitating this fix was made in Kokkos 4.4.

@bartgol
Copy link
Contributor

bartgol commented Jan 28, 2025

@bartgol do you have a branch with your fix in it? That said, I'm confident the fix I made to compose is needed: I see the change in behavior necessitating this fix was made in Kokkos 4.4.

I don't. The fix depends on certain stranded ekat commit that got cherry picked in the ekat-kokkos-4.5 branch, so I kind of need that ekat branch to include them. I didn't want to create multiple branches where ekat is updated, so I may just piggy-back a fix to this branch. Again, the fix only shows up when Kokkos_ENABLE_OPENMP=TRUE AND a commit from ekat is present. This was a case of double-negative: two bugs canceling each other out by luck...

I haven't looked at your changes, so I can't comment. All I'm saying is that the fact that you were able to build makes me believe that Kokkos_ENABLE_OPENMP=FALSE in your build (I think in cime builds you get that to be ON if you ask for compile_threaded, not sure what the cime syntax is for asking that option)

@ambrad
Copy link
Member

ambrad commented Jan 28, 2025

@bartgol one thing that might be confusing here is I'm fixing the F90 dycore used with EAM, not the C++ dycore used with EAMxx. So EKAT is irrelevant in my case.

@oksanaguba
Copy link
Contributor Author

@ambrad yes, please push the fix, this branch is useless without 4.5.01 kokkos. regarding diffs -- if you gave a guess, it would be great. if not, i was going to revert a few changes to try to narrow it down, prob using standalone homme.

regarding your conversation above with Luca -- ii do not follow. but one test that uses hommexx dycore, ERS_D.ne4pg2_oQU480.F2010.chrysalis_intel.eam-hommexx , also freezes at the same spot in compose. i assume it is still not related to Luca's issue. i will try to re-read messages tomorrow to understand Luca's point better.

@oksanaguba
Copy link
Contributor Author

@rljacob Re: "What is the purpose of introducing 4 new machines? sunspot-pvc, sunspotcpu, sunspot-gen, and auroracpu ?" it is much easier to debug setups early on when they are separated. this is related to the issue that Az raised, that this should be cleaned up eventually.

@ambrad
Copy link
Member

ambrad commented Jan 28, 2025

@oksanaguba I've pushed the commits. I'm somewhat sure that the diffs occur only in the SL code; I'm currently hunting for the exact cause.

Update 1: At this point I have provisional evidence that at least the non-_D tests in e3sm_atm_developer will pass when the branch is rebased on current master. I'm testing that guess now. For Chrysalis, that will leave HOMMEBFB in homme_integration to figure out; the other test, HOMME, is already OK.

Update 2: Yes, e3sm_atm_developer passes against baselines with current master merged in (including a _D test), except for eam-hommexx. I'm going to tackle HOMMEBFB now and then come back to eam-hommexx if it doesn't get fixed along the way.

Update 3: I believe I have isolated the HOMMEBFB diff source. I will fix that, then retest everything on Chrysalis. If all is well, I will push the commit to this branch. Next, I'll move to pm-gpu and Frontier to address a GPU-specific issue.

Update 4: homme_integration passes on Chrysalis. Now moving on to GPU stuff. Also running e3sm_atm_integration against baselines on Chrysalis with upstream/master merged into this branch; will report back when that's done.

Update 5: homme_integration and e3sm_atm_integration (with master merged to branch) pass against baselines on Chrysalis. So I think Chrysalis is done.

Update 6: I found another way to handle the view of views in compose that avoids the hang while not requiring SequentialHostInit. Chrysalis is still fine, and the dycore now builds on Frontier, but there are issues it looks like at the EAMxx level that I'll look into next.

Update 7: With Jim's PR https://github.com/E3SM-Project/E3SM/pull/6879/commits, we can get past the spurious -x that was leading to a build error on Frontier (crayclang-scream). With SCREAM_ENABLE_MAM=Off, an ne30pg2 ERS test passes. With SCREAM_ENABLE_MAM=On, the build fails as reported here: #6916 (comment)

Update 8: ne30pg2 ERS passes with frontier-scream-gpu_craygnuamdgpu, and SCREAM_ENABLE_MAM=On is OK.

option(SCREAM_ENABLE_MAM "Whether to enable MAM aerosol support" ON)
if (Kokkos_ENABLE_SYCL)
option(SCREAM_ENABLE_MAM "Whether to enable MAM aerosol support" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work as you intend. You may have to simply do set(SCREAM_ENABLE_MAM OFF)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, but that one unlike SET worked.

//Whether to enable MAM aerosol support
SCREAM_ENABLE_MAM:BOOL=OFF

if you have a preference i will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled. This

option(BLAH "" ON) 
message ("BLAH: ${BLAH}")
option(BLAH "" OFF)
message ("BLAH: ${BLAH}")

prints

BLAH: ON
BLAH: ON

Are you sure you did not have another line before those that set SCREAM_ENABLE_MAM to OFF? Maybe in the mach file? Also, was it a clean build?

My preference is for using set(... CACHE ..) (or option(...)) only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this in Professional CMake:

An important difference between normal and cache variables is that the set() command will only
overwrite a cache variable if the FORCE keyword is present, unlike normal variables where the set()
command will always overwrite a pre-existing value. The set() command acts more like set-if-not-
set when used to define cache variables, as does the option() command (which has no FORCE
capability). The main reason for this is that cache variables are primarily intended as a
customization point for developers.

which makes me think something else may have caused SCREAM_ENABLE_MAM=OFF in your cache file...

@ambrad
Copy link
Member

ambrad commented Jan 30, 2025

@bartgol did you say you have a fix for

CMake Error at /gpfs/fs1/home/ac.ambradl/e3smdev1/externals/ekat/cmake/EkatSetCompilerFlags.cmake:294 (message):
  [SetOmpFlags] You need to specify at least one language

This is on Chrysalis, where I'm trying to build an EAMxx medres test with this branch. If you have a fix, feel free to push it to this PR branch.

@ambrad
Copy link
Member

ambrad commented Jan 30, 2025

On Frontier, it seems that flags.make for certain EAMxx interface files gets a spurious -x that leads to

clang-15: error: language not recognized: '-std=gnu++17'

One sees it in flags.make here: -x hip -fno-gpu-rdc -x --offload-arch=gfx900. The second -x shouldn't be there. I haven't found what's injecting it. An example is bld/cmake-bld/eamxx/src/physics/rrtmgp/CMakeFiles/scream_rrtmgp.dir/flags.make. Another: bld/cmake-bld/eamxx/src/mct_coupling/CMakeFiles/atm.dir/flags.make.

@tcclevenger
Copy link
Contributor

tcclevenger commented Jan 30, 2025

On Frontier, it seems that bld/cmake-bld/eamxx/src/physics/rrtmgp/CMakeFiles/scream_rrtmgp.dir/flags.make gets a spurious -x that leads to

clang-15: error: language not recognized: '-std=gnu++17'

One sees it in flags.make here: -x hip -fno-gpu-rdc -x --offload-arch=gfx900. The second -x shouldn't be there. I haven't found what's injecting it.

I was getting language not recognized: '-Md'. Also coming from rrtmgp.

@ambrad
Copy link
Member

ambrad commented Jan 30, 2025

@tcclevenger I just updated my comment with another example. rrtmgp itself has nothing to do with it; it just comes first. As for your -Md case, check the build line for a spurious -x, possibly right before the -Md. As far as I can tell, the language not recognized message is meaningless and seems to get emitted due to bailing on a -x with no actual argument to -x.

Later: rrtmgp did have something to do with it. I don't know the exact issue, but I recorded the fix above.

@bartgol
Copy link
Contributor

bartgol commented Jan 30, 2025

After fixing the CMake issues, I do get an error building eamxx, which I am not sure why you don't get building via CIME:

/storage/workdir/e3sm/e3sm-src/master/externals/ekat/src/ekat/ekat_session.cpp:65:6: error: 'class Kokkos::InitializationSettings' has no member named 'set_num_devices'
   65 |     .set_num_devices(nd)

It is a file that is always compiled (by eamxx, not standalone homme), regardless of CIME-vs-standalone. It needs to be fixed in ekat.

Edit: ah, I am disabling deprecated code 4 in kokkos, and set_num_devices is deprecated. I highly recommend we start adding option (Kokkoe_ENABLE_DEPRECATED_CODE_4 "" OFF) in all our mach files, and get rid of deprecated code now, in view of Kokkos 5.0 (which will likely remove these deprecations).

@@ -212,8 +225,8 @@ endif()
# #cmakedefine RRTMGP_EXPENSIVE_CHECKS
option (SCREAM_RRTMGP_DEBUG "Turn on extra debug checks in RRTMGP" ${SCREAM_DEBUG})

option(SCREAM_RRTMGP_ENABLE_YAKL "Use YAKL under rrtmgp" TRUE)
option(SCREAM_RRTMGP_ENABLE_KOKKOS "Use Kokkos under rrtmgp" FALSE)
option(SCREAM_RRTMGP_ENABLE_YAKL "Use YAKL under rrtmgp" FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an important change. I'm checking whether reverting this fixes Frontier. FYI, @tcclevenger @jgfouca

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this must be the issue. I have a branch coming where Kokkos will be fully supported and have good performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No luck. Still get the same issue. Will try something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgfouca @tcclevenger the above commit does the trick. I got an ne30pg2 ERS test to run and pass with this and a commit (pushed to this branch) to fix a line of F90.

@ambrad
Copy link
Member

ambrad commented Jan 31, 2025

Once #6879 is merged, let's rebase this branch on master so Frontier builds work.

@ambrad
Copy link
Member

ambrad commented Jan 31, 2025

On Frontier, with SCREAM_ENABLE_MAM=On and crayclang-scream, eamxx_mam_microphysics_process_interface.cpp fails to build with error: unhandled SGPR spill to memory. I need to check with this PR rebased on master. With craygnuamdgpu, On is fine.

Update: Similar error for the same file, eamxx_mam_microphysics_process_interface.cpp, when updated to current master: error: ran out of registers during register allocation.

@bartgol
Copy link
Contributor

bartgol commented Jan 31, 2025

@oksanaguba I pushed a fix on the ekat branch for the kokkos initialization setting error. When we are ready to merge this, I recommend we first issue a PR in ekat, to get stuff integrated in ekat/master, and then we update this PR, to point to an ekat/master version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files wip work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants