Skip to content

Build and run ROCm UCX OpenMPI

Pak Lui edited this page Oct 10, 2022 · 4 revisions

Preparation

Device

The LargeBar feature must be enabled on the AMD GPU, like the AMD Radeon Instinct (MI series) of the GPU cards:

The physical address of all system memory and PCI exported GPU device memory must be under (1<<44) for GFX9 GPUs like Vega10 (MI25) and Vega20 (MI50 or MI60). This can be set in the system BIOS according to this page:

Sanity Check for Large BAR setting

You can use the code below as a sanity check for BAR setting. If the Large BAR is not enabled, running this example would cause a segmentation fault. The setting is usually enabled in the BIOS. If it is not set up correctly, segmentation fault in ROCm UCT would show up in UCX (or when UCX is used in Open MPI).

#include <stdio.h>
#include "hip/hip_runtime.h"

int main(int argc, char ** argv) {
  int * buf;
  hipMalloc((void**)& buf, 100);
  printf("address buf %p \n", buf);
  printf("Buf[0] = %d\n", buf[0]);
  buf[0] = 1;
  printf("Buf[0] = %d\n", buf[0]);
  return 0;
}

Compile the check_large_bar.c example:

$ /opt/rocm/bin/hipcc $(/opt/rocm/bin/hipconfig --cpp_config) -L/opt/rocm/lib/ -lamdhip64 check_large_bar.c -o check_large_bar 

For example, output on a system with LargeBar enabled (no issue):

$ ./check_large_bar
address buf 0x14642b400000
Buf[0] = -1094795586
Buf[0] = 1

For example, output on a system with BIOS that is not set up for LargeBar, you would get a segmentation fault:

$ ./check_large_bar
address buf 0x7efa41c00000
Segmentation fault

BIOS settings

To resolve the BAR memory issue mentioned above, you should check the following BIOS settings.

  • For typical scenario relates to BAR, please visit BAR Memory Overview in the ROCm documentation
  • In the System BIOS, look for the "PCIe/PCI/PnP Configuration"
    • For example on the Supermicro server (SYS-4029GP-TRT2), you should look into the following:
Advanced->PCIe/PCI/PnP configuration-> Above 4G Decoding = Enabled
Advanced->PCIe/PCI/PnP Configuration->MMIOH Base = 512G
Advanced->PCIe/PCI/PnP Configuration->MMIO High Size = 256G

For descriptions to all relevant BIOS settings for AMD EPYC Rome CPUs:

IOMMU and AMD I/O Virtualization Technology

There are occasion where we notice system hang when IOMMU is enabled. If such case happens, you can try to following workarounds:

  • There is a BIOS option should be called the “AMD I/O Virtualization Technology”, you can set the option to disable to disable IOMMU. You may get a prompt that shows disabling AMD I/O Virtualization Technology will disable both SMT and set x2apic to auto, which is fine.

  • The above BIOS option is more preferred, but if you do have a use case that requires IOMMU to be enabled, the alternate approach would be the use the boot params:

    iommu=pt amd_iommu=on

Append the above to GRUB_CMDLINE_LINUX in /etc/default/grub and "update-grub2" to boot with the params, and use "cat /proc/cmdline" to verify the params were being used to boot.

Install ROCm

Install instruction is here: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html Note: For multi-node with InfiniBand, please install MLNX_OFED first, reboot, then install ROCm next.

Build

Install Pre-requisites for Open MPI on Ubuntu

sudo apt install git m4 autoconf automake libtool flex

Set environment variable

export INSTALL_DIR=/path/to/install
export UCX_DIR=$INSTALL_DIR/ucx
export OMPI_DIR=$INSTALL_DIR/ompi

UCX

git clone https://github.com/openucx/ucx.git
cd ucx
# git checkout v1.10.x # optional to use v1.10.x branch
./autogen.sh
mkdir build
cd build
../contrib/configure-opt --prefix=$UCX_DIR --with-rocm=/opt/rocm --without-knem --without-cuda --enable-gtest --enable-examples 
make
make install

Open MPI

git clone --recursive -b v4.1.x https://github.com/open-mpi/ompi.git
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR --without-verbs
make
make install

ROCm enabled OSU

The OSU Micro Benchmarks v5.7 (12/11/2020) (link) added to evaluate the performance of various primitives with AMD GPU device and ROCm support. This functionality is exposed when configured with --enable-rocm option. We can use the following steps to compile OMB:


wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.7.tar.gz
tar xvf osu-micro-benchmarks-5.7.tar.gz
mv osu-micro-benchmarks-5.7 osu
cd osu
./configure --enable-rocm --with-rocm=/opt/rocm CC=$OMPI_DIR/bin/mpicc CXX=$OMPI_DIR/bin/mpicxx LDFLAGS="-L$OMPI_DIR/lib/ -lmpi -L/opt/rocm/lib/ $(hipconfig -C) -lamdhip64" CPPFLAGS="-std=c++11"

Run

Run Intranode

Use UCX_RNDV_THRESH environment variable to lower the rendezvous threshold from the 8KB default to 128B to improve the midrange performance.

cd osu
$OMPI_DIR/bin/mpirun -np 2 --mca btl '^openib' -x UCX_TLS=sm,self,rocm_copy,rocm_ipc --mca pml ucx -x UCX_RNDV_THRESH=128 osu_bw -d rocm D D

For GPUs that are connected via PCIe (and not XGMI), use the UCX_RNDV_PIPELINE_SEND_THRESH and UCX_RNDV_FRAG_SIZE environment variables to improve the large messages performance.

Note: UCX v1.12 introduced some new rendezvous protocols that may cause some issues for PCIe (non-XGMI) systems.

  • Setting this UCX rendezvous scheme to put_zcopy with this environment variable to mpirun with -x UCX_RNDV_SCHEME=put_zcopy could avoid such issue.
  • The UCX_RNDV_FRAG_SIZE option was changed from UCX_RNDV_FRAG_SIZE=4m to UCX_RNDV_FRAG_SIZE=rocm:4m in UCX v1.12.
cd osu
$OMPI_DIR/bin/mpirun -np 2 -mca btl '^openib' --mca pml ucx -x UCX_RNDV_SCHEME=put_zcopy -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m mpi/pt2pt/osu_bw -d rocm D D

osu_bw example for GPUs to communicate over XGMI:

The environment variable HSA_ENABLE_SDMA=0 is used for XGMI interconnect to show the effective transfer bandwidth for inter-die data transfer between GPU device 2 and 3 (same MI250/MI250X OAM). For messages larger than 64MB, an effective utilization of about 150GB/s is achieved, which corresponds to 75% of the peak transfer bandwidth of 200GB/s for that connection. See GPU-Enabled MPI in ROCm docs for reference.

mpirun -np 2 \
--mca osc ucx -mca pml ucx -x UCX_TLS=sm,self,rocm_copy,rocm_ipc \
-x HSA_ENABLE_SDMA=0 -x UCX_RNDV_THRESH=256k -x LD_LIBRARY_PATH \
-x ROCR_VISIBLE_DEVICES=0,1 \
osu_bw -m 2:268435456 -d rocm D D
# OSU MPI-ROCM Bandwidth Test v6.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
2                       0.97
4                       1.96
8                       3.93
16                      7.92
32                     15.78
64                     26.90
128                    61.11
256                    92.35
512                   160.11
1024                  167.99
2048                  217.86
4096                  240.78
8192                  317.69
16384                1472.15
32768                2824.13
65536                5545.13
131072              10417.34
262144              19003.88
524288              31896.64
1048576             51322.15
2097152             75204.86
4194304            114297.43
8388608            129588.08
16777216           140224.55
33554432           142330.08
67108864           149904.54
134217728          151245.19
268435456          151106.14

osu_bw example for GPUs to communicate over PCIe (non-XGMI):

/opt/mpi/ompi/bin/mpirun -np 2 -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m -x UCX_RNDV_THRESH=128 --mca osc ucx --mca spml ucx -x LD_LIBRARY_PATH -x UCX_LOG_LEVEL=TRACE_DATA --allow-run-as-root -mca pml ucx -x UCX_TLS=sm,self,rocm_copy,rocm_ipc osu/mpi/pt2pt/osu_bw -d rocm D D
# OSU MPI-ROCM Bandwidth Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.74
2                       1.49
4                       2.98
8                       5.96
16                      9.41
32                     10.14
64                     11.98
128                    14.03
256                    28.42
512                    58.70
1024                  117.59
2048                  234.75
4096                  463.15
8192                  866.08
16384                1575.19
32768                3130.02
65536                3625.22
131072               4729.94
262144              15873.00
524288              20361.94
1048576             22413.33
2097152             25365.81
4194304             25456.85

osu_latency example for GPUs to communicate over PCIe (non-XGMI):

/opt/mpi/ompi/bin/mpirun -np 2 -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m -x UCX_RNDV_THRESH=128 --mca osc ucx --mca spml ucx -x LD_LIBRARY_PATH -x UCX_LOG_LEVEL=TRACE_DATA --allow-run-as-root -mca pml ucx -x UCX_TLS=sm,self,rocm_copy,rocm_ipc osu/mpi/pt2pt/osu_latency -d rocm D D
# OSU MPI-ROCM Latency Test v5.7
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       0.19
1                       1.76
2                       1.75
4                       1.76
8                       1.78
16                      2.18
32                      3.65
64                      5.82
128                     8.41
256                     8.34
512                     8.09
1024                    8.69
2048                    8.11
4096                    8.17
8192                    8.69
16384                   8.54
32768                   9.80
65536                  17.43
131072                 24.93
262144                 16.02
524288                 25.19
1048576                56.21
2097152                97.90
4194304               173.49

Run Internode

Use the UCX_RNDV_PIPELINE_SEND_THRESH and UCX_RNDV_FRAG_SIZE environment variables to improve the large messages performance between 2 nodes.

cd osu
$OMPI_DIR/bin/mpirun -np 2 -host <host1 ip>,<host2 ip> --mca btl '^openib' --mca pml ucx -x UCX_RNDV_PIPELINE_SEND_THRESH=256k -x UCX_RNDV_FRAG_SIZE=4m mpi/pt2pt/osu_bw -d rocm D D

Troubleshooting

If you see OSU hang with error message:

WARNING: There was an error initializing an OpenFabrics device.
rc_iface.c:778  UCX  ERROR error modifying QP to RTR: Invalid argument

you are running on GID-based multi-host setup, try again with UCX_IB_ADDR_TYPE=ib_global.

BIOS options

To troubleshoot performance issue, you may also want to check the system BIOS settings.

For the descriptions of the BIOS settings, they can be found in the "Chapter 3.4 HPC and Telco Settings" on pages 23-25 of the Workload Tuning Guide for AMD EPYC™ 7002 Series Processor Based Servers

The recommended options for AMD EPYC Rome CPU with the Instinct GPUs are shown below:

PCIe Settings Options
Advanced > PCIe > Above 4G Decoding Enable
SMT Settings Options
Advanced > AMD CBS > CPU Common Options > Performance > CCD/Core/Thread Enablement Accept
Advanced > AMD CBS > CPU Common Options > Performance > CCD/Core/Thread Enablement > SMT Control Disable
Global Core C-States Options
Advanced > AMD CBS > CPU Common Options > Global C-state Control Auto
NUMA Nodes (NPS) Options
Advanced > AMD CBS > DF Common Options > Memory Addressing > NUMA nodes per socket NPS4
Memory Interleaving Options
Advanced > AMD CBS > DF Common Options > Memory Addressing > Memory Interleaving Auto
IOMMU Options
Advanced > AMD CBS > NBIO Common Options > IOMMU Disabled
PCIe Ten Bit Tag Support Options
Advanced > AMD CBS > NBIO Common Options > PCIe Ten Bit Tag Support Enable
Determinism Slider Options
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Determinism Control Manual
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Determinism Slider Power
Set cTDP Options
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > cTDP Control Manual
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > cTDP 240 (EPYC 7742)
Package Power Limit Options
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Package Power Limit Control Manual
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Package Power Limit 240 (EPYC 7742)
AMD CPU xGMI Options
Set AMD CPU xGMI width to 16 bits and speed to 18 Gbps if CPU xGMI is supported by Chassis design
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > xGMI Link Width Control Manual
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > xGMI Force Link Width 2
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > xGMI Force Link Width Control Force
Advanced > AMD CBS > DF Common Options > Link > 4-Link xGMI Max Speed 18Gbps
Advanced > AMD CBS > DF Common Options > Link > 3-Link xGMI Max Speed 18Gbps
APBDIS Options
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > APBDIS 1
DF C-States Options
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > DF Cstates Auto
Fixed SoC P State Options
Advanced > AMD CBS > NBIO Common Options > SMU Common Options > Fixed SOC Pstate P0
Memory Speed Options
Set to max Memory Speed, if using 3200MHz DIMMs 1600MHz
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Timing Configuration Accept
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Timing Configuration > Overclock Enabled
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Timing Configuration > Memory Clock Speed 1600MHz (for DDR4 3200MHz)
RAM Power Down Options
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > DRAM Controller Configuration > DRAM Power Options > Power Down Enable Disabled
SME Options
Advanced > AMD CBS > UMC Common Options > DDR4 Common Options > Security > TSME Disabled
Clone this wiki locally