MatRIS is a true heterogeous BLAS+LAPACK library packaged with compute unit specific optimized BLAS libraries provided by different vendors. BLAS+LAPACK has been a crucial and core computation component in high performance computing (HPC) and machine learning applications. Chip vendors and researchers have provided hand optimized BLAS kernels for specific compute units. Though there are many implementations of BLAS kernels, application developers hard code the compute unit specific BLAS function calls in application and doesn’t leave much choice of using the underlying heterogeneous compute unit on the fly. Some state of the art work abstracts the BLAS library APIs with wrappers for portability to select the BLAS library at compile time but lacks run-time selection of the library function for different compute units. In this paper, we provide an MatRIS BLAS APIs written with an extended IRIS framework to support vendor specific library function calls. Our MatRIS BLAS functions will call the appropriate compute unit specific BLAS library function based on the run-time task mapping of heterogeneous cores. We demonstrate the proof of concept with an example DGEMM/SGEMM BLAS library function with OpenBLAS, CuBLAS, CLBlast, XilinxBLAS and HIPBlas implementations, with a runtime selection of compute unit. We show the different performance metrics collected by running the GEMM on heterogeneous compute units
MatRIS
|--- src
|--- algorithms
|--- matris_algorithm.h
|--- blas
|--- matris_algo_gemm.c
|--- matris_algo_<method>_gemm.c. If multiple methods exists.
|--- node
|--- matris_node_gemm.c
|--- lapack
|--- matris_algo_getrf_no_pivot.c
|--- node
|--- matris_node_getrf.c
|--- utility
|--- matris_algo_util_tile2flat.c
|--- matris_algo_util_flat2tile.c
|--- kernel
|--- matris_kernel.h
|--- blas
|--- matris_kernel_blas.h
|--- cuBLAS
|--- L3
|--- matris_cuda_gemm_kernel.c
|--- L3
|--- matris_gemm_kernel.h
|--- lapack
|--- matris_kernel_lapack.h
|--- cuBLAS
|--- L3
|--- matris_cuda_getrf_kernel.c
|--- L3
|--- matris_blas_getrf.h
|--- utils
|--- deffe
|--- scripts
|--- tests
|--- matris_test_getrf.c
|--- matris_test_gemm.c
void init_matris_blas(int argc, char **argv);
void finalize_matris_blas();
void set_matris_blas_target(int target);
int get_matris_blas_target();
void matris_core_sgemm(...);
void matris_core_dgemm(...);
Set up sources for nvidia toolchain
export NVIDIA_PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/21.7
export MATH_LIBS=${NVIDIA_PATH}/math_libs
export MATH_LIBS=${NVIDIA_PATH}/math_libs/11.4/targets/x86_64-linux/
export PATH=$NVIDIA_PATH/cuda/bin/:$PATH
export LD_LIBRARY_PATH=$NVIDIA_PATH/cuda/lib64/:$MATH_LIBS/lib:$LD_LIBRARY_PATH
Set up sources for AMD toolchain
export ROCMLOC=/opt/rocm-4.1.0
export ROCMLOC=/opt/rocm
export HIP_CLANG_PATH=$ROCMLOC/llvm/bin/
export ROCM_PATH=$ROCMLOC
export HIPCC="$ROCMLOC/bin/hipcc"
export HIP_CFLAGS="-std=c++11 -I$ROCMLOC/hip/include -I$ROCMLOC/hsa/include"
export HIP_LDFLAGS="-L$ROCMLOC/hip/lib -lamdhip64 -L$ROCMLOC/hsa/lib -lhsa-runtime64 -L$ROCMLOC/lib64 -L$ROCMLOC/lib -lamd_comgr -lhsakmt -Wl,-rpath=/$ROCMLOC/lib,--enable-new-dtags"
export PATH=$ROCMLOC/hip/bin:$ROCMLOC/bin:$ROCMLOC/opencl/bin:$PATH
export LD_LIBRARY_PATH=$ROCMLOC/lib:$ROCMLOC/lib64:$LD_LIBRARY_PATH
The below steps lead you to install vendor specific BLAS Libraries and set the environment variables
git clone https://github.com/CNugteren/CLBlast.git
cd CLBlast
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$PWD/../install
make -j install
cd ../install
export CLBLAST=$PWD
export LD_LIBRARY_PATH=$CLBLAST/lib:$LD_LIBRARY_PATH
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$PWD/../install -DBUILD_SHARED_LIBS=ON
make -j install
cd ../install
export OPENBLAS=$PWD
export LD_LIBRARY_PATH=$OPENBLAS/lib:$LD_LIBRARY_PATH
git clone https://github.com/ROCmSoftwarePlatform/hipBLAS.git
cd hipBLAS
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$PWD/../install -DBUILD_SHARED_LIBS=ON
make -j install
cd ../install
export HIPBLAS=$PWD
export LD_LIBRARY_PATH=$HIPBLAS/lib:$LD_LIBRARY_PATH
git clone https://github.com/ORNL/iris.git
cd iris
mkdir build
cd build
cmake ../ -DCMAKE_INSTALL_PREFIX=$PWD/../install -DAUTO_FLUSH=ON -DAUTO_PARALLEL=ON
make -j install
cd ..
source install/setup.source
To enable CUDA compute, use '-DUSE_CUDA=ON'. To enable AMD compute, use '-DUSE_HIP=ON'.
mkdir build
cd build
cmake ../ -DCMAKE_INSTALL_PREFIX=$PWD/../install -DUSE_CUDA=ON
make -j install
cd ..
source install/setup.source
Run SGEMM with input matrix size 8
cd tests
make
export IRIS_ARCHS=openmp; ./sgemm.x 8 0
export IRIS_ARCHS=cuda; ./sgemm.x 8 1
export IRIS_ARCHS=hip; ./sgemm.x 8 1
export IRIS_ARCHS=opencl; ./sgemm.x 8 1
Run DGEMM with input matrix size 8
cd tests
make
export IRIS_ARCHS=openmp; ./dgemm.x 8 0
export IRIS_ARCHS=cuda; ./dgemm.x 8 1
export IRIS_ARCHS=hip; ./dgemm.x 8 1
export IRIS_ARCHS=opencl; ./dgemm.x 8 1
Run Multi-task DGEMM example with input matrix size 8, target cpu or gpu, number of tile 2
cd install/bin/
export IRIS_ARCHS=openmp; ./dbig_gemm.x 8 0 2 // Matrix Size 8, target cpu, number of tiles 2 which enables 2x2 decomposition where each tile is 4x4
export IRIS_ARCHS=cuda; ./dbig_gemm.x 8 1 2
export IRIS_ARCHS=hip; ./dbig_gemm.x 8 1 2
export IRIS_ARCHS=opencl; ./dbig_gemm.x 8 1 2
Monil MAH, Miniskar NR, Teranishi K, Vetter JS, Valero-Lara P. MatRIS: Multi-level Math Library Abstraction for Heterogeneity and Performance Portability using IRIS Runtime. InProceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis 2023 Nov 12 (pp. 1081-1092).
- Narasinga Rao Miniskar
- Monil Mohammad Alaul Haque
- Pedro Valero-Lara