-
Notifications
You must be signed in to change notification settings - Fork 432
UCF 2023 Schedule
Date | Time | Topic | Speaker/Moderator |
---|---|---|---|
12/5 | 09:00-09:15 | Opening Remarks and UCFUnified Communication Framework (UCF) - Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute. |
Gilad Shainer, NVIDIAGilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel. |
09:15-10:00 | Recent Advances in UCX for AMD GPUsThis talk will focus on recent developments in UCX to support AMD GPUs and the ROCm software stack. The presentation will go over some of the most relevant enhancements to the ROCm components in UCX over the last year, including: 1) enhancements to the uct/rocm-copy components zero-copy functionality to enhance device-to-host and host-to-device transfers: by allowing the zero-copy operations to perform asynchronously, device-to-host and host-device transfers can overlap the various stages in the rendezvous protocol, leading to up to 30% performance improvements in our measurements, 2) adding support for dma-buf based memory registration for ROCm devices: the linux kernels dma-buf mechanism is a portable mechanism which enables sharing device buffers across multiple devices by creating a dma-buf handle at the source and importing the handle at the consumer side. ROCm 5.6 introduced the runtime functionality to export a user-space dma-buf handle of GPU device memory, and support has been added to the ROCm memory domain in UCX starting from release 1.15.0, 3) updates required to support the ROCm versions released during the year (ROCm 5.4, 5.5, 5.6). Furthermore, the presentation will also include some details on the ongoing work to take advantage of new interfaces available starting from ROCm 5.7 which will allow to explicitly control and select which DMA engine(s) to use for a inter-process device-to-device data transfer operations. |
Edgar Gabriel, AMDBIO |
|
10:00-11:00 | UCX Backend for Realm: Design, Benefits, and Feature GapsRealm is a fully asynchronous low-level runtime for heterogeneous distributed memory machines. It is a key part of the LLR software stack (Legate/Legion/Realm) which provides the foundation for the construction of a composable software ecosystem that can transparently scale to multi-GPU and multinode systems. Realm enables the users to express a parallel application in terms of an explicit and dynamic task graph that is managed by the runtime system. With direct access to the task graph, Realm takes responsibility for all synchronization and scheduling, which not only removes the burden from programmers’ shoulders, but can also yield higher performance, as well as better performance portability. Being a dynamic, explicit task graph management system, Realm must manage the task graph as it is generated on-line at runtime. Thus, it is essential to lower the runtime system overheads, including the additional cost of communications in distributed memory machines. In this talk, we will presentthe design and implementation of the UCX network module for Realm. More specifically, we describe how we implement Realm’s active message API over UCX and discuss the advantages of using UCX compared to the existing GASNetEx backend. We also point out the challenges and feature gaps that we face for implementing Realm’s network module over UCX, the workarounds we use to avoid them, as well as the potential ways to address them within UCX itself. |
Hessam Mirsadeghi, NVIDIABIO Akshay Venkatesh , NVIDIABIO Jim Dinan, NVIDIABIO Sreeram Potluri , NVIDIABIO Nishank Chandawala, NVIDIABIO |
|
11:00-11:30 | Lunch | ||
11:30-12:15 | Use In-Chip Memory for RDMA OperationsSome modern RDMA devices contain fast on-chip memory that can be accessed without crossing the PCI. This hardware capability can be leveraged to improve performance of SHMEM atomic operations on large scale. In this work, we extend the UCX API to support allocating on-chip memory, use it in the OpenSHMEM layer to implement shmem_alloc_with_hint and demonstrate performance improvement in existing atomic benchmarks. |
Roie Danino, NVIDIABIO |
|
12:15-13:00 | Low-Latency MPI RMA: Implementation and ChallengesMany applications rely on an active and local completion semantics, such as point-to-point or active MPI-RMA. The point-to-point approach has been heavily optimized over time in most of the MPI distribution. However, the MPI-RMA semantics has been overlooked over the past few years and therefore suffers from inefficiencies in its current implementations, especially when considering GPU-to-GPU communications. In this talk, we will present our current effort towards a low latency MPI-RMA semantics. While MPI-RMA offers a low latency, especially when considering local completion, it can be easily overwhelmed by the cost of synchronization and notification. In this work, we will investigate different strategies for an local completion mechanism similar to the existing PSCW flavor. We will first detail those strategies as well as their implementations. Then, we will present our results comparing the different approaches and identify gaps in the interface that could be addressed as part of the MPI-5 standard. |
Thomas Gillis, ANLBIO Ken Raffenetti, ANLBIO Yanfei Guo, ANLBIO |
|
13:00-13:45 | UCX Protocols for NVIDIA Grace HopperNVIDIA Grace Hopper provides developer productivity features such as hardware-accelerated CPU-GPU memory coherence and the ability to perform inter-process communication across OS instance over NVLINK in the presence of NVSwitches. While developers can pass the same malloc memory to GPU kernels and communication routines on the Grace Hopper platform, as pages belonging to this memory can migrate between CPU and GPU based, there are performance tradeoffs associated with communication cost based on memory allocation choices for multi-GPU workloads. Also, the availability of inter- process communication across OS instances over NVLINK is available for specific memory types. In this talk, we will discuss 1. The roadmap for protocol choices that UCX communication library can make on Grace Hopper platform to take advantage of features such as on-demand-paging (ODP), multinode NVLINK, page residence queries, and more techniques; 2. Expected performance from using different communication paths at the UCT level; 3. Potential options for UCP protocols v2 in selecting communication paths that the UCT layer will expose 4. How protocol choice affects execution time and subsequent potential page migrations; 5. How the application layer can help communication layer by using memory binding, hints with allocation API, and more to avoid common overheads. |
Akshay Venkatesh, NVIDIABIO Hessam Mirsadeghi, NVIDIABIO Jim Dinan, NVIDIABIO Nishank Chandawala, NVIDIABIO Sreeram Potluri, NVIDIABIO |
|
13:45-14:00 | Adjourn | ||
12/6 | 09:00-09:15 | Day 2 Open and Recap |
Pavel Shamis (Pasha), NVIDIAPavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI. |
09:15-10:00 | An Implementation of LCI Backend Using UCXThe Lightweight Communication Interface is a communication library and research tool aiming for efficient support of multithreaded, irregular communications. LCI improves the performance of multi-threaded communication over classical MPI by supporting a flexible API, implementing specifically designed concurrent data structures, and using atomic operations/fine-grained nonblocking locks. Prior to this project, LCI ran on two network backends, libfabric and libibverbs. In this project, we develop a new UCX backend for LCI to investigate how efficiently UCX can support the LCI mechanisms. The implementation is based on the UCP layer operations, including tagged send, receive, and RDMA. Key aspects of the implementation include out-of-band initialization of UCP workers using the process management interface (PMI), memory pre-registration for send/receive and RDMA buffers to enable faster performance, callback functions that signal the completion of operations, and a single completion queue managed by a dedicated progress thread to reduce the need for locks in a multi-threaded environment. We will further discuss several design decisions we made when developing the UCX backend and compare the performance of the UCX backend with the other two LCI network backends using a set of microbenchmarks/mini-apps. |
Weixuan Zheng, University of IllinoisBIO Jiakun Yan, University of IllinoisBIO Omri Mor, University of IllinoisBIO Marc Snir, University of IllinoisBIO |
|
10:00-11:00 | Spark Shuffle Offload on DPUThis work aims to improve the performance of the shuffle procedure on Spark clusters by making use of DPU, NVMe storage and UCX, offloading data transfer burden from the CPU. Spark shuffle offload is a service running on the DPU and responds to fetch block requests from shuffle clients, without context switches or consuming cycles from the host CPU. The shuffle data is stored on NVMe storage device, accessed directly from the DPU. |
Ofir Farjon, NVIDIABIO Mikhail Brinskii , NVIDIABIO Artemy Kovalyov, NVIDIABIO Leonid Genkin, NVIDIABIO |
|
11:00-11:30 | Lunch | ||
11:30-12:00 | Cross-GVMI UMR Mkey PoolThis talk will present modifications made to UCX to be able to use the XGVMI capabilities for BlueField offloaded support. |
Yong Qin, NVIDIABIO |
|
12:00-12:30 | Enhanced Deferment for Aggregation Contexts in OpenSHMEMOpenSHMEM is a one-sided communication PGAS API that excels in low latency RDMA operations but can be sensitive to small message rates generated by the irregular access patterns many applications employ. Recent work has introduced a novel solution to this problem through the use of a message aggregation strategy that automatically defers small messages to later send in bulk batches. These aggregation contexts were shown to allow applications with small and irregular access patterns to dramatically improve network performance while maintaining their algorithmic simplicity, however the implications of their use in large scale settings has yet to be explored. |
Aaron WelchBIO Oscar Hernandez, ORNLBIO Steve Poole, LANLBIO |
|
12:30-13:15 | Wire Compatibility in UCXWire compatibility is an important concept ensuring seamless support of using different releases of UCX by the peers. Wire compatibility is important for performing software upgrades across the clusters with many nodes, or for the frameworks where different component runs different software stacks (e.g. management and compute nodes in high-performance environment clusters). In this presentation I'll describe: the current state of wire-compatibility support in UCX, latest updates we made in UCX version 1.16 (including new tests in Continuous Integration) and further plans |
Mickail Brinskii , NVIDIABIO |
|
13:15-14:00 | Dynamic Transport SelectionThis work aims to achieve a better balance between scalability and performance, by dynamic allocation of transport resources to connections. Optimal efficiency is reached by detecting traffic-intensive endpoints according to usage statistics collected at runtime and providing them with high-performance transports. e.g. in RDMA networks, RC transport is used for the most prioritized connections, while the remaining ones use DC. |
Shachar Hasson, NVIDIABIO |
|
14:00 | Adjourn | ||
12/7 | 09:00-09:15 | Day 3 Open and Recap
|
Gilad Shainer, NVIDIABIO |
09:15-10:15 | Symmetric Remote Key with UCXIn partitioned global address space (PGAS) applications, all ranks are creating the same mirrored memory segments. Those memory segments are all registered for remote memory access. One rank can then use any remote memory key of a remote rank to read or write on that corresponding remote segment. In the general PGAS case, all ranks need to possess all remote keys for every segment for all other ranks. Considering that remote keys can consume more than 100 bytes, this can lead to significant memory consumption at scale (6GB per node, for 128 ppn, 2 segments and 1k nodes). The symmetric remote key functionality achieves full memory saving by allowing the systematic reuse of remote keys. In this presentation we'll cover: a method to use common keys for different remote segments and an implementation for the OpenSHMEM case. |
Thomas Vegas, NVIDIABIO Artem Polyakov, NVIDIABIO |
|
10:15-11:00 | Extending OpenSHMEM for Rust-on-RISC-V, Python, Nim, and HPXThe OpenSHMEM specification and its implementations are living documents. Expanding adoption of OpenSHMEM drives requirements, features, and further development of the specification. Expanding adoption of OpenSHMEM to support new and different user communities, applications, and workloads increases the benefits that can be derived from the Partitioned Global Address Space (PGAS) model, the OpenSHMEM specification, and the wider HPC community. Increasing adoption of OpenSHMEM requires a continuing effort to demonstrate the technology's adaptability and benefits with respect to existing software solutions, increasing the breadth of programming languages that can utilize OpenSHMEM, and bringing OpenSHMEM to new hardware architectures. This presentation will provide a look at the successful bring up of the Rusty2 OpenSHMEM bindings, using the OSSS-UCX OpenSHMEM implementation, on two RISC-V hardware environments. The presentation will provide an overview of how OpenSHMEM was successfully integrated into both Exaloop's Codon LLVM-based Python compiler and the Nim programming language. Finally, the presentation will show how OpenSHMEM was successfully integrated into the STELLAR Group's asynchronous many task runtime system called HPX. |
Christopher Taylor, Tactical Computing LaboratoriesBIO |
|
11:00-11:30 | Lunch | ||
11:30-12:15 | Offloading Tag Matching Logic to the Data Path Accelerator (DPA)MPI is the de facto standard for parallel programming in distributed memory systems. Tag matching is a fundamental concept in MPI, where senders transmit messages with specific tags, allowing their peers to selectively receive messages based on these tags. Various approaches are known for optimizing the matching process, including tag caching, various message queue management algorithms, and hardware offloading, among others. In this presentation, we will introduce another optimization approach based on offloading the matching logic to the Data Path Accelerator (DPA). DPA is an NVIDIA® BlueField® embedded subsystem designed to accelerate workloads that require high-performance access to NIC engines in specific packet and I/O processing tasks. Unlike other programmable embedded technologies, such as FPGAs, DPA offers a high degree of programmability using the C programming model, multi-process support, toolchains like compilers and debuggers, SDKs, dynamic application loading, and management. During the presentation, we will cover the basics of DPA infrastructure, hints for writing DPA applications/frameworks, discuss some tag matching optimization techniques, and present preliminary results obtained on a cluster equipped with NVIDIA® BlueField®-3 DPUs. |
Savatore Di Girolamo, NVIDIABIO Mikhail Brinskii, NVIDIABIO |
|
12:15-13:00 | UCX 2023: Latest and GreatestIn this talk we will present at high-level the major new features and improvements, including: new APIs, protocols v2 and GPU memory support, new RDMA offloads, system topology detection, and more. |
Yossi Itigin, NVIDIABIO |
|
13:00-13:05 | Adjourn | ||