Important note: while the implementation is expected to be high-performant on all Ada/Ampere/Volta/Turing devices, it was specifically fine-tuned for and tested on NVIDIA RTX 3090 (GA102 chip - RTX 3080, A10, A40, A6000).
Avoid using WSL for performance measurements. To ensure accurate and reliable results, please use a native Linux environment.
To benchmark the code, specify compute capability of your CUDA device and run benchmark.sh
. For example, on RTX 3090:
bash benchmark.sh 86
The benchmark settings such as minimum/maximum matrix sizes, step size, number of warm-up iterations etc. can be adjusted in the benchmark.sh
file.
To visualize benchmark results, please install matplotlib
and run
python plot_benchmark_data.py benchmark_results
Use test.sh
to test the implementation for correctness. For example, on RTX 3090:
bash test.sh 86
Test environment:
- OS: Ubuntu 24.04.1 LTS
- GPU: NVIDIA RTX 3090
- Driver Version: 550.120
- CUDA Driver: 12.4, CUDA Runtime: 12.6, V12.6.85
- CMake 3.28.3
- g++ 13.3