- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20200512
        Geoffrey Paulsen edited this page May 13, 2020 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Austen Lauria (IBM)
 - Christoph Niethammer (HLRS)
 - David Bernhold (ORNL)
 - Edgar Gabriel (UH)
 - Geoffrey Paulsen (IBM)
 - George Bosilca (UTK)
 - Harumi Kuno (HPE)
 - Howard Pritchard (LANL)
 - Jeff Squyres (Cisco)
 - Joseph Schuchart
 - Josh Hursey (IBM)
 - Joshua Ladd (nVidia/Mellanox)
 - Matthew Dosanjh (Sandia)
 - Michael Heinz (Intel)
 - Naughton III, Thomas (ORNL)
 - Ralph Castain (Intel)
 - Todd Kordenbrock (Sandia)
 
- Barrett, Brian (AWS)
 - Brendan Cunningham (Intel)
 - William Zhang (AWS)
 - Akshay Venkatesh (NVIDIA)
 - Artem Polyakov (nVidia/Mellanox)
 - Brandon Yates (Intel)
 - Charles Shereda (LLNL)
 - Erik Zeiske
 - Geoffroy Vallee (ARM)
 - Mark Allen (IBM)
 - Matias Cabral (Intel)
 - Nathan Hjelm (Google)
 - Noah Evans (Sandia)
 - Scott Breyer (Sandia?)
 - Shintaro iwasaki
 - Xin Zhao (nVidia/Mellanox)
 - mohan (AWS)
 
Back to 2020 WeeklyTelcon-2020
- If you change your MTT to startup PRRTE at begining of session, and just use prun.
 - Can see times cut in half or more.
 - This is good, but also need to test mpirun wrapper.
 - Cisco is converting half of MPI installs to use prrte/prun
 
- OMPI master submodule pointers setup to track PMIx and PRRTE master.
 
Blockers All Open Blockers
Review v4.0.x Milestones v4.0.4
- v4.0.4rc1 - available this last weekend (see: https://www.open-mpi.org/software/ompi/v4.0/)
- 7616 - ABI break introduced in OMPI v4.0.3 for some f08 symbols.
 - Got feedback - It hangs on launch.
- Mailing list devel.
 
 - Pull Request checker - Something is going on with SLES 12 / AWS automation
- Blocking ALL PRs.
 
 - Probably do an RC2
 
 - Discuss if we want to take https://github.com/open-mpi/ompi/pull/7698 to v4.0.4?
- NEWs Worthy.
 - Doesn't break backward or forward guarntees.
 - Les, lets take it.
 - history: libevent - changed their library name to libevent_core / libevent_pthread
- libevent is sum of libevent_core and libevent_extra.
 - libevent_core is the code OMPI uses, and libevent_extra is other functionality that OMPI doesn't use.
 
 - Why on v4.0.x before master?
 - v4.0.x was "complete" solution, but on master, need to split the fix up into ompi, pmix, and prrte pieces.
 
 
Review v5.0.0 Milestones v5.0.0
- Schedule:
- Can't fork until configure changes are in.
- PRTE is still chugging alone.
 - Slipping to at least May 22nd.
 - Taking it week-by-week.
 
 - Feature Freeze: May 14 (slipped from April 30)
- Please Post a PR ASAP as place holder
 
 - Release: End of June
 - Pandoc - got a little pushback on Open HPC
- Not all MTT systems have pandoc - Absoft, AWS, HLRS.
 
 
 - Can't fork until configure changes are in.
 - PMIx v4.0.0 - on track
- Schedule: Still a number of issues, but probably not blocking
 
 - Hwloc - Are we still going to support older 1.x ?
- Issue -master build failure on Ubuntu, because it has too old of hwloc.
 - Distros won't use the embedded hwloc 2.x
 - If Open MPI doesn't support hwloc 1.x, Open-MPI
 - What's the effort to support hwloc 1.x?
- Coding effort. Got to build it against older, and adjust accordingly.
 
 - Sounds like we don't have a choice but to do it.
 - PRRTE could handle it, but not true with new binding stuff (in Ralph's branch)
 - Also HWLOC ABI break between hwloc 2.1 and 2.2.
 - Need to drop in a major release.
 - Master MPICH dropped support for hwloc 1.x
 - A lot of supported distros still at hwloc: https://github.com/open-mpi/ompi/wiki/OMPI-Third-Party-Packages
 - Need to test both versions of hwloc.
 - what 1.x version? 1.8 or 1.10
 
 - PRRTE v2.0 -
- Went through issues to discuss remaining issues.
 - MCA usage is very different in PRRTE than in ORTE.
- ORTE was a "one-shot" launcher, but PRRTE is persistant.
 - When launching PRRTE you can set "defaults" for the deamon
 - individual pruns override these defaults via command line parameters not mca parameters.
 
 - A lot or change.
 - Now have two MCA users in the job. OPAL / PRRTE - if setting something in the wrong one, then it gets ignored and is confusing.
 - There will be a lot of mca param files, won't do what people expect them to do.
- Might want to Open some issues on OMPI side to do some docs.
 
 - report bindings doesn't make sense to set this as a "default setting" in PRRTE, so is always a per-job basis.
 - RC1 Blockers things to get done before RC1  (Maybe 2-3 weeks?)
- Need to get User-facing stuff done to reduce use confusion.
 - Binding reporting should be done (confusing) 523 - Needs thinking/careful work.
 - A bunch of knarly issues in here:
- Call tomorrow?
 - Socket -> Package name change - Should we do now or later?
- Already a lot of change, but hwloc has already moved onto the new name.
 
 - Also what to do with NUMA? - Doesn't even make sense anymore on some archetecture.
 
 - Depends on what versions of hwloc we support. Will be tricky (or more expensive) to support hwloc both 1 and 2.
 - Is there a list of distros and hwloc versions? Brian will put together list.
 
 
 - Discussing Features on google sheets document
- (https://docs.google.com/spreadsheets/d/1OXxoxT9P_YLtepHg6vsW3-vp4pdzGQgyknNbkzenYvw/edit#gid=0) which were taken from the face to face wiki.
 - Please post PRs ASAP even if not quite ready yet.
 
 - Please send collective tuning data to AWS to help select new defaults.
 - Today with libevent, we default to prefering libevent if it's version is newer (we redistribute 2.0.22)
- Still a bunch of distros that ship a libevent not newer than 2.0.22, but works.
 
 - For v5.0 we're continuing down the path of NOT encouraging users to use the internal libs.
- So probably should just use external if it's found, as long as it's newer than 2.0.21 (trusted version)
 
 - Issues not tracked on spreadsheet.
- libopal isn't slurped into Open-MPI correctly (related to 7560)
- Jeff and Brian will meet Friday
 
 
 - libopal isn't slurped into Open-MPI correctly (related to 7560)
 
- 
Heriarchacal collectives
- If someone wants to do, PMIx has much of this information already.
 - Not too hard to do, and they're much faster. Will be in next version of competitor MPI
 - Probably not for v5.0
 
 - 
SLURM PMIx plugin has been locked on PMIx v2 for some time.
- There are some NEW PMIx calls that SHOULD be added to bring it up.
- Ralph has started a PR, but needs help.
 
 - PR???
 - So for now, there's some optional info that won't be passed correctly.
- No OMPI_INFO for now.
 - Ralph gets pinged occasionally.
 
 - Not sure priority of this.
 
 - There are some NEW PMIx calls that SHOULD be added to bring it up.
 - 
MTT on master is looking pretty good.
 
- Defered.
 
- scale-testing, PRs have to opt-into it.
 
Review Master Master Pull Requests
- CI testing only tests build and did it run, but doesn't test HOW it ran.
- Environment setup can be a bit different.
 - For example no-permissions in 
/tmp. Might pass on one machine, and fail on another without/tmppermissions.