- 
                Notifications
    
You must be signed in to change notification settings  - Fork 929
 
WeeklyTelcon_20180724
        Geoffrey Paulsen edited this page Jan 15, 2019 
        ·
        1 revision
      
    - Dialup Info: (Do not post to public mailing list or public wiki)
 
- Geoff Paulsen
 - Josh Hursey
 - Joshua Ladd
 - Brian
 - akshay
 - Geoffroy Vallee
 - Nathan Hjelm
 - Peter Gottesman (Cisco)
 - Todd Kordenbrock
 - Xin Zhao
 
- Matias Cabral
 - Howard.
 - Edgar Gabriel
 - Jeff Squyres
 - Ralph Castain
 - Thomas Naughton
 - Akvenkatesh (nVidia)
 - Matthew Dosanjh
 - Howard Pritchard
 - Dan Topa (LANL)
 - David Bernholdt
 - Dan Topa (LANL)
 
- 
default to external for v4.0
- should this be a blanket statement, or is there a version limit? For example, if someone has a v1.1 version of PMIx installed in a default location, do we really want to use that versus the internal v3.0?
- if external support is found and is "compatible"... keep it fuzy.
- If we know it won't work, just hardcode versioning.
 - If it's compatible, but lower than internal, we'd emit a warning, but still use external one.
 
 - no issue with PMIx v3.0 vs 2.x, since we haven't implemented new features anyway.
 - A little worried about PMIx, since older slrums are still pmix v1.x based.
 - if they're using pmix v1.2.5, ompi needs to use v1.2.5, since older pmix won't use newer pmix.
 - Does PMIx website have this compatibility chart somewhere? We should point to it.
 - We like checking at configure time, since it's a nice early failure, but this will have to be runtime failures.
 - configure summary at end. opal_summary_add m4.
 
 - if external support is found and is "compatible"... keep it fuzy.
 
 - should this be a blanket statement, or is there a version limit? For example, if someone has a v1.1 version of PMIx installed in a default location, do we really want to use that versus the internal v3.0?
 - 
Comm_Spawn issue:
- George has raised a comm_spawn issue about inheriting MCA params https://github.com/open-mpi/ompi/issues/5376
- Child job is launched using the same MCA params as first node. Isn't an easy way to overwrite those, and some can't be at all (other than don't use that).
 - He proposed mca parameters should not be inherited by child jobs. If it's set on command line, as opposed to mca param file, then it shouldn't be passed to child jobs.
 
 - for Open MPI v4.0
- Just a couple of lines of code to NOT propagate the MCA params related to launch.
 - map-by option - can't turn off. (perhaps we need a sentinal meaning "default")
 
 
 - George has raised a comm_spawn issue about inheriting MCA params https://github.com/open-mpi/ompi/issues/5376
 - 
nVidia - update for UCX CUDA support
- ompi_opal converter does not need any changes (verified by Akshay), changes thought needed, not needed.
 - See more on UCX and --with-cuda below.
 
 
Review All Open Blockers
Review v2.x Milestones v2.1.4
- v2.1.4 - Final release on v2.1.x
 - Aug 10th is release date.
 - Targeting RC1 for v2.1.4 next week. - Howard will drive as Jeff is on vacation.
 - Will need PMIx fix from Issue 5336
- Still need JUl24.
 
 
Review v3.0.x Milestones v3.0.3
- Schedule:
 - v3.0.3 - targeting Sept 1st
- Anticipate RC1 after Aug 10th release of v2.1.4 releases.
 - Cisco is seeing some weirdness in v3.0 and v3.1
- Haven't nailed down, but haven't reported yet. PMIx / runtime.
 - Ralph wants to see what happens when upgrade to PMIx v2.2, but probably a problem in ORTE.
- resolved?
 
 
 
 - Growing Pile of PRs that don't have reviews.
 
Review v3.1.x Milestones v3.1.0
- v3.1.2 release process, starts after Sept 1st release of v3.0.3
 - Lots of PRs
 - Power9 hang in make check
- Nathan fixed in https://github.com/open-mpi/ompi/pull/5374
 - --disable builtin works around.
 - Worth fixing on v3.0 and v3.1
 - reported on gcc v7.2
 
 - Issue 5363 - teardown in shmem that trying to ulink a file already ulinked.
- Fixed in master Cherry picked to other branches.
 
 - Issue 5336 - Brian merged PR that will print a bit more info.  PMIx + libevent issue hit in cisco.
- Giles tracked it down last night, and Ralph will PR it. Will need to go into v2.1.x also
 
 - PMIx v2.1.2 in Open MPI v3.1 and v3.0 - Ralph about to release PMIx v2.1.2 does Open MPI want to embed it?
- Ralph
 
 
- 
Schedule: branch: July 18. release: Sept 17
- Date for all MTT testing online - July 22? -
 - Date for first RC - Aug 13 (after sunset of 2.1.4)
 
 - 
Targeting UCX v1.4 to support CUDA buffers.
- May be changes in UCX PML and/or datatype converter.
 - Will have more info by next week.
 
 - 
Create the branch, then after that's created, then create nightly tarballs, and only after that can MTT be setup.
- ACTION - send notes to brian to pre-emptively tell us if there will be issues.
 
 - 
Cuda support - cudasm, and openib
- Still a couple of steps away of being on par in UCX regarding CUDA support.
 - Does nVidia want if --with-cuda, then openib included by default?
- Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
 - Warning message will mention deficate openib vs ucx
 
 
 - 
NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
 - 
Sent email to ompi-packagers list with schedule and info on
- Debian (Allister)
 - Jeff has a half-typed out reply.  Allister asked Do we really need to change the major .so version?
- His point is that they have more and more packages compiled against Open MPI.
 - The real question is Do we mark the version as backwards incompatible?
 - The idea that you have to have the same version of Open MPI everywhere.
 - Nathan would say (watching what's gone in), that we should be okay, but that we should look.
 
 - In v3.1 --mpi-compat was on by default.  In v4.0 there's a flag where the ABI didn't change.
- MPI_UB was #define to &ompi_datatypeT
 - Could put the symbols back so they're there.
 - verification - enable cxx - Would Paul Hargrove help with .so testing.
 
 - Fork in the road here... A couple of options:
- Set the CRA (.so versions) (Assuming --mpi-compat is NOT enabled). An app compiled against v3.1 won't run with v4.0
 - Dynamicly set the CRA values (.so versions) based on the (--mpi-compat flag) bunch of maintence work.
- Really HAIRY!
 
 - NICE: Could Make --mpi-compat only affects the mpi.h, but don't affect the symbols in back end.
- As long as they go away eventually - no point in removing from standard if we don't eventually remove them.
 - Raise C by 10, AND raise A by 10 - Make sure we get it right (do it like a minor release bump)
- use rules from v3.0.x to v3.0.x+1 -
 
 
 
- Don't know how this affects Fortran. (seperate MPI_UB and MPI_LB). It's the same as in C. Fail to compile without --mpi-compat
 
 - Symbols WILL go away in v5.0
 - Geoff and Howard will build test suites with v3.1.x and run with master/v4.0 to see if anything breaks.
 
 - 
Still at risk features for branch of v4.0 on July 15.
- UTK ULFM - Fault Tollerant - Geoff JUST emailed George after meeting.
 - external preferences configury - Giles did libevent.
- Ralph will update PMIx.
 - leaves hwloc - Jeff will review Giles code, and see if it can be easily translated to hwloc
- potentially miss v4.0
 
 
 
 
- Ralph merged in some PMIx v3.0
 - Overall Runtime Discussion (talking v5.0 timeframe, 2019)
- TODAY - Geoff Paulsen will send out doodle for next week to devel-core.
 - 10am Central July 18
 
 
- From last week:
- MTT License discussion - MTT needs to be de-GPL-ified.
- All go try the python. - All the GPL is in the perl modules (using python works around that).
 
 - Last week Brian had an interesting proposal to remove all of the perl out, or the python out?
 - Schedule - Like resolution by end of july.
- What does this look like? Run our MTT python client.
 
 
 - MTT License discussion - MTT needs to be de-GPL-ified.
 
- Will discuss this in a sperate call 2nd week in July.
 - Two Options:
- Keep going on our current path, and taking updates to ORTE, etc.
 - Shuffle our code a bit (new ompi_rte framework merged with orte_pmix frame work moved down and renamed)
- Opal used to be single process abstraction, but not as true anymore.
 - API of foo, looks pretty much like PMIx API.
- Still have PMIx v2.0, PMI2 or other components (all retooled for new framework to use PMIx)
 
 - to call just call opal_foo.spawn(), etc then you get whatever component is underneath.
 - what about mpirun? Well, PRTE comes in, it's the server side of the PMIx stuff.
 - Could use their prun and wrap in a new mpirun wrapper
 - PRTE doesn't just replace ORTE.   PRTE and OMPI layer don't really interact with each other, they both call the same OPAL layer (which contains PMIx, and other OPAL stuff).
- prun has a lam-boot looking approach.
 
 - Build system about opal, etc. Code Shufflling, retooling of components.
 - We want to leverage the work the PMIx community is doing correctly.
 
 
 - If we do this, we still need people to do runtime work over in PRTE.
- In some ways it might be harder to get resources from management for yet another project.
 - Nice to have a componentized interface, without moving runtime to a 3rd party project.
 - Need to think about it.
 
 - Concerns with working adding ORTE PMIx integration.
 - Want to know the state of SLURM PMIx Plugin with PMIx v3.x
- It should build, and work with v3. They only implemented about 5 interfaces, and they haven't changed.
 
 - A few related to OMPIx project, talking about how much to contribute to this effort.
- How to factor in requirements of OSHMEM (who use our runtimes), and already doing things to adapt.
 - Would be nice to support both groups with a straight forward component to handle both of these.
 
 - Thinking about how much effort this will be. and manage these tasks in a timely manor.
 - Testing, will need to discuss how to best test all of this.
 - ACTION:  Lets go off and reflect and discuss at next week's Web-Ex.
- We aren't going to do this before v4.0 branches in mid-July.
 - Need to be thinking about the Schedule, action items, and owners.
 
 
Review Master Master Pull Requests
- PR for setting VERSION on master Have we broken any VERSIONs
 
Review Master MTT testing
- 
Annual committer by July 18 -
- AMD, ARM, Ructers, Mellanox, nVidia - Please go do Annual Commiter Reviews.
 
 - 
Hope to have better Cisco MTT in a week or two
- Peter is going through, and he found a few failures, which some have been posted.
- one-sided - nathan's looking at.
 - some more coming.
 
 - OSC_pt2pt will exclude yourself in a MT run.
- One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
- Now that osc_pt2pt is ineligible, many tests fail.
 - on Master, this will fix itself 'soon'
 - BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
 - Probably an issue on v3.x also.
 
 - Did this for release branches, Nathan's not sure if on Master. - v4.0.x has RMA capable vader. Once
 
 - One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
 
 - Peter is going through, and he found a few failures, which some have been posted.
 - 
OSHMEM v1.4 - cleanup work
- How do we look for test coverage of this? Right now just basic API tests.
 
 - 
Next Face to Face?
- When? Early fall Septemberish.
 - Where? San Jose - Cisco, Albuquerque - Sandia
 - Geoff will Doodle this.
 
 
- Mellanox, Sandia, Intel
 - LANL, Houston, IBM, Fujitsu
 - Amazon,
 - Cisco, ORNL, UTK, NVIDIA