-
Notifications
You must be signed in to change notification settings - Fork 432
UCC Virtual F2F Meeting Information
Please fill in the form here
Meeting Notes
Time | Topic | Telecon |
---|---|---|
7:00 am - 7:30 PT | Kickoff and Opening Remarks (Gilad Shainer) | |
7:30 - 8:15 PT | Highlights of UCC API (Review) (Manju) | |
8:15 - 8:30 AM PT | Break | |
8:30 - 9:30 AM PT | Teams API (Manju; All/Discussion) | |
9:30 - 9:45 AM PT | Break | |
9:45 - 11:00 AM PT | Endpoints / Collective Operations (Manju; All/Discussion) |
- Manjunath Gorentla Venkata
- Alex Margolin
- Sergey Lebedev
- Valentin Petrov
- Rami Nudelman
- Baker, Matthew
- Tony
- Gilad Shainer
- James S Dinan .
- Chambreau, Chris
- Gil Bloch
- Dmitry Gladkov
- Arturo
- Pavel Shamis
- Ravi, Naveen
- Raffenetti, Kenneth J.
- Akshay Venkatesh
-
Initialization
- Have a flexible infrastructure for initialization and selection of library functionality
- Discuss final options during component arch discussion
- UCC config interface to follow UCS config.
- Rename ucc_config to ucc_params to reflect UCX style
-
Context
- Do we need sync model config on the context create ?
- Yes for enabling RDMA based implementations
- The drawback - might have to create more contexts (sync and non-sync)
- Yes, might require multiple objects but not necessarily multiple resources
- Explore explicit device abstraction and ability to express affinity and propose to the WG group
- Do we need sync model config on the context create ?
-
Team Creation
- Need to revisit endpoints (as this seems to be implementation specific) after presentation from Alex
- Can we hide endpoint from interface and enable agnostic way of creating teams
-
Collective Operations
- Need to define the mapping of programming model (src, dst) to UCC (src, dst) for cases like MPI broadcast, which has only set of buffers.
- Is there a need for multiple outstanding persistent collective operations of same type ? No use case yet.
Time | Topic | Telecon |
---|---|---|
7:00 am - 7:45 PT | Topology Aware Collectives (Sameh) | |
7:45 - 8:00 AM PT | Break | |
8:00 am - 8:45 PT | Collectives API - the Reactive alternative (Alex) | |
8:45 - 9:00 AM PT | Break | |
9:00 - 11:00 PT | Task and Plan API Discussion |
- Manjunath Gorentla Venkata
- Richard Graham
- Sameh
- Gil Bloch
- Ravi, Naveen
- Alex Margolin
- Tony
- Raffenetti, Kenneth J.
- Sergey Lebedev
- Rami Nudelman
- Arturo
- James Dinan
- Pavel Shamis
- Geoffroy
- Valentine Petrov
WG to sync with Sameh (IBM) about topology definition as we abstract topology, device, and affinity
Option 1: Standardize ucc and ucc_mpi interfaces Option 2: Standardize only ucc interfaces Discussion on UCC base, UCC MPI
- For now focus on UCC base and continue the discussion on UCC MPI in the working group
- Option for UCC MPI (driver) - provide as a part of UCC project (example contrib directory)
- (Alex correct this if needed)
Task API is use-full (feedback from the WG)
- To be considered for a later version of API (not the first version)
- It is useful to address the use-cases that include
- computation + communication
- Pipelined protocols
- provide a use case for bundled collectives
- Propose Task API to the working group
What topology information to abstract and what to pass?
- Capture distance between various processes/threads that forms the team/groups
- Capture distance between context (resource) and devices (GPU/CPU)
- Where to pass this information team creation or init?
- AI for the working group: Propose an API that covers the above requirements
- Endpoint in UCC is member_index in UCG
- Move the endpoint to the team_config structure
- Make endpoint an input
- If no input is provided the library will create the endpoints and it will be available via get_attrib interface
Time | Topic | Telecon |
---|---|---|
7:00 am - 8:00 PT | GPUs/DL (NVIDIA/IBM/All) | |
8:00 - 8:45 PT | Multirail Discussion (Sergey;All) | |
8:45 - 9:00 PT | Break | |
9:00 - 9:30 PT | Algorithm Selection Models (All) | |
9:30 - 10:00 PT | Memory registration and Global Symmetric Memory (All) | |
10:00 - 11:00 PT | Document on differences and plan to converge |
- Manjunath Gorentla Venkata
- Sameh
- Arturo
- Valentin Petrov
- Devendar Bureddy
- Sergey Lebedev
- Rami Nudelman
- Alex Margolin
- James Dinan
- Sreeram Potluri
- Pavel Shamis
- Raffenetti, Kenneth
- Geoffroy Vallee
- Gil Bloch
-
Goals
- UCC should support GPU-aware MPI collectives
- UCC should be cognizant of DL/AI requirements and should design interfaces for it
- (participants were in consensus)
-
Relevant use cases/interfaces besides MPI and OpenSHMEM
- Single process/thread utilizing multiple GPUs
- Aggregate or bundled collectives - the motivation is to reduce the launch overhead.
- A series of collectives launched
- NCCL addresses this with ncclGroupStart/End interfaces
-
Missing abstractions from the UCC interface proposals
- Memory type: The library should know the memory passed to the collective operation.
- Host memory, device memory
- Where to abstract this information?
- Passing this information to the team creation operation should be enough. The user might have to create a team that is specific to memory type.
- Passing this information to each invocation is useful, but there is no use case yet.
- The abstraction should support other accelerators and memory types (CUDA, ROCM, Smart NIC, DRAM, HBM
- Device abstraction and affinity
- How do you handle the GPU device context?
- Can this be abstracted onto the UCC context?
- How do you handle CUDA streams?
- Memory type: The library should know the memory passed to the collective operation.
-
Next steps / Questions
- Design for missing abstractions
- Ping AMD and IBM
- Error handling / Managing asynchronous errors
- More details required
-
Goal
- UCC should support multirail collectives (participants were in consensus)
-
Lessons from Sergey’s implementation
- Multirail support can be implemented “easily” if we have basic collectives expressed as components and these components can be composed to implement the UCC API.
- Hierarchical collectives are implemented like this in XCCL
- Missing abstractions
- The team create operation should pass in multiple UCC contexts (resources) to the team create operation
- The information about the distance between the contexts (assuming contexts are mapped as one context per HCA)
-
The topology information is needed for multirail, UCG’s group create operation, and GPU-aware collectives
-
What topology information is needed?
- Distance between the participants of the team in the team create operation
- Distance between the network resources (HCA’s) and thread invoking the team create operation
- Distance between the GPUs and thread invoking the team create operation
-
Who should implement it? UCC or an external library?
- Can we pass this information from the external libraries (hwloc, ompi)? If so, how to abstract it?
- Can the library implement it?
- This is an expensive operation and a huge undertaking.
-
Next steps:
- Prototype interfaces and work with IBM to understand the pitfalls.
- HCOLL model
- libcoll/Intel model
- User query model
- Adaptive model
A common thread for all the models is the selection attributes. The selection attributes can include algorithm type, message range, collective implementation type (XCCL, XUCG, hardware), and more.
- Next steps:
- Define the selection attributes.
- In version 1.0, design the interfaces that are not external but internal. Gather experience and then make it public.
Time | Topic | Telecon |
---|---|---|
7:00 am - 7:45 PT | OMPI-X / ADAPT (George Bosilca/Talk) | |
7:45 am - 8:00 PT | Break | |
8:00 am - 9:00 PT | Component Architecture (Review for non-WG participants)(Alex/Val/Discussion) | |
9:00 am - 9:30 PT | Memory registration and symmetric memory API (Manju; All; Discussion) | |
9:30 am - 9:45 PT | Break | |
9:45 am - 10:30 PT | Library initialization parameters | |
10:30 am - 11:00 | Documentation / Code Structure |
- Manjunath Gorentla Venkata
- George
- Arturo
- Valentin Petrov
- Sergey Lebedev
- Rami Nudelman
- Alex Margolin
- Pavel Shamis
- Raffenetti, Kenneth
- Geoffroy Vallee
- Tony
-
Component Architecture Overview
-
Abstractions
- Collective layer with multiple collective implementations (XCCL, XUCG, Hardware)
- Basic collective layer with primitive collectives (p2p_collectives, SHARP)
- P2P layer
- Services layer
-
Resolves:
- It addresses a majority of the requirements for component architecture that was identified by the previous iteration of component architecture such as
- Avoiding circular dependencies
- Ability to provide a thin layer over hardware collectives
-
To address
- Ability to share resources between multiple implementations. For example, sharing p2p (or SHARP) resources between XCCL and XUCG
- Ability to choose multiple collective components (.i.e. say all reduce from XCCL, and a2a from XUCG). Add a selection component that encompasses multiple collective implementations.
- Ability to share and reuse code at the fine-grained level.
-
-
Next Steps
- Develop fine-grained component architecture for XCCL and XUCG
- Identify the components that can be shared
- Identify a way to share resources between different implementations
Time | Topic | Telecon |
---|---|---|
7:00 am - 8:30 PT | Flesh out the component architecture | |
8:30 am - 8:45 PT | ||
8:45 am - 10:30 PT | Review and flesh out the spec document | |
10:30 am - 11:00 PT | Next Steps |
(Laundry List)
- Kickoff (Gilad)
- Highlights of UCC API (Review for non-WG participants) (Manju)
- OMPI-X / ADAPT (George Bosilca/Talk)
- Requirements from the AI Users/Deep Learning/GPUs (NVIDIA; All)
- API Discussion (Incase not completed in WG)
- Library Initialization
- Resource Abstraction (Contexts)
- Teams API (Manju; All/Discussion)
- Endpoints (Manju; All/Discussion)
- Collective Operations (Manju; All/Discussion)
- Task API (Manju; All/Discussion)
- Alternative Control-path API (Initialization and communicator creation) (Alex; All/Discussion)
- Alternative Data-path API (Starting and progressing collectives) (Alex; All/Discussion)
- Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
- Flesh out UCC.H Header (All)
- Unit tests and CI infrastructure (?)
- Documentation (doxygen ?)(?)
- Multirail Support (Sergey)
- Topology-aware collectives (Sameh/Talk)
- Memory registration (Discussion)
- Algorithm selection (Discussion)