Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425

Merged
merged 23 commits into from
Feb 8, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
1ce2427
OpenCL-BE/LIBSMM: verbose output and documentation. Improved auto-tun…
hfp Feb 4, 2021
69a0e85
Fixed Makefile used to build acc_bench_trans/acc_bench_smm with CUDA …
hfp Feb 4, 2021
3bf854c
Disabled ACC_OPENCL_THREADLOCAL_CONTEXT since DBCSR calls init/finali…
hfp Feb 4, 2021
f14eee4
Updated LIBXSMM prior to v1.17.
hfp Feb 4, 2021
9598017
Attempt to runtime-test OpenCL BE/LIBSMM.
hfp Feb 4, 2021
6721baa
Reduced console output to potentially improve runtime of (CI-)tests.
hfp Feb 4, 2021
ee12e07
Increased timeout from 15m to 20m.
hfp Feb 4, 2021
8baf7ae
Fetch all commits before referring to some SHA.
hfp Feb 4, 2021
2b9335f
Revert "Attempt to runtime-test OpenCL BE/LIBSMM."
hfp Feb 4, 2021
20f9d25
Revised enabling ACC_OPENCL_THREADLOCAL_CONTEXT.
hfp Feb 5, 2021
a26e779
Repeated note about combining auto-tuned parameters for SP and DP in …
hfp Feb 5, 2021
cb91474
Only print device name if the device changed (and avoid duplicated ve…
hfp Feb 5, 2021
9221d1b
Removed tabs from source file (minor/unrelated change).
hfp Feb 5, 2021
8ac6f1d
More prefixes in follow-up of #419 (c_dbcsr_).
hfp Feb 5, 2021
b58a37b
Supply platform when forming context.
hfp Feb 5, 2021
6f3c910
Code cleanup.
hfp Feb 5, 2021
b5cb129
Try to avoid MPS issue (temporarily) testing with only one rank. Sync…
hfp Feb 5, 2021
367a117
Enabled OpenCL based runtime tests.
hfp Feb 5, 2021
6c9f84c
Fixed CI-scripts.
hfp Feb 5, 2021
9ff03d9
Fixed another variable which was left unbound (CI-script).
hfp Feb 5, 2021
d17f03f
Incorporated #428.
hfp Feb 8, 2021
1265b26
Merge branch 'develop' of https://github.com/cp2k/dbcsr into oclverbose
hfp Feb 8, 2021
2e57682
Warn about potentially exclusive device-mode.
hfp Feb 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions src/acc/acc_bench_smm.c
Original file line number Diff line number Diff line change
Expand Up @@ -106,15 +106,18 @@ int main(int argc, char* argv[])
printf("%s%s%i %i %i %i %i %i %i %i\n", 0 < argc ? argv[0] : "", 0 < argc ? " " : "",
nrepeat, stack_size, m, n, k, nc, na, nb);
CHECK(c_dbcsr_acc_init(), &result);
/* note: libsmm_acc_init() may imply acc_init() */
CHECK(libsmm_acc_init(), &result);
CHECK(c_dbcsr_acc_get_ndevices(&ndevices), &result);
if (0 < ndevices) {
#if defined(_DEBUG)
fprintf(stderr, "number of devices found: %i\n", ndevices);
#endif
}
else {
#if defined(_DEBUG)
fprintf(stderr, "Error: no device found!\n");
fprintf(stderr, "No ACC-device found!\n");
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
return result;
Expand Down Expand Up @@ -165,14 +168,14 @@ int main(int argc, char* argv[])
CHECK(libsmm_acc_transpose(trans_dev, 0/*offset*/, nb, bmat_dev,
DBCSR_TYPE(ELEM_TYPE), n, k, MAX_KERNEL_DIM, stream), &result);
}
#if defined(USE_LIBXSMM)
# if defined(USE_LIBXSMM)
CHECK(c_dbcsr_acc_stream_sync(stream), &result);
start = libxsmm_timer_tick();
#endif
# endif
/* to perform NN-SMMs on the device, all B-matrices are transposed upfront (SMM-kernel is limited to NT) */
CHECK(libsmm_acc_transpose(trans_dev, 0/*offset*/, nb, bmat_dev,
DBCSR_TYPE(ELEM_TYPE), k, n, MAX_KERNEL_DIM, stream), &result);
#if defined(USE_LIBXSMM)
# if defined(USE_LIBXSMM)
CHECK(c_dbcsr_acc_stream_sync(stream), &result);
transpose = libxsmm_timer_duration(start, libxsmm_timer_tick());
# endif
Expand Down Expand Up @@ -282,6 +285,9 @@ int main(int argc, char* argv[])
CHECK(c_dbcsr_acc_dev_mem_deallocate(bmat_dev), NULL);
CHECK(c_dbcsr_acc_dev_mem_deallocate(cmat_dev), NULL);
CHECK(c_dbcsr_acc_stream_destroy(stream), NULL);
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
if (EXIT_SUCCESS != result) {
fprintf(stderr, "FAILED\n");
Expand Down
10 changes: 8 additions & 2 deletions src/acc/acc_bench_trans.c
Original file line number Diff line number Diff line change
Expand Up @@ -91,15 +91,18 @@ int main(int argc, char* argv[])
assert(m <= (mn / n) && 0 == (mn % n));
printf("%s%s%i %i %i %i\n", 0 < argc ? argv[0] : "", 0 < argc ? " " : "", nrepeat, stack_size, m, n);
CHECK(c_dbcsr_acc_init(), &result);
/* note: libsmm_acc_init() may imply acc_init() */
CHECK(libsmm_acc_init(), &result);
CHECK(c_dbcsr_acc_get_ndevices(&ndevices), &result);
if (0 < ndevices) {
#if defined(_DEBUG)
fprintf(stderr, "number of devices found: %i\n", ndevices);
#endif
}
else {
#if defined(_DEBUG)
fprintf(stderr, "Error: no device found!\n");
fprintf(stderr, "No ACC-device found!\n");
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
return result;
Expand Down Expand Up @@ -210,6 +213,9 @@ int main(int argc, char* argv[])
CHECK(c_dbcsr_acc_dev_mem_deallocate(stack_dev), NULL);
CHECK(c_dbcsr_acc_dev_mem_deallocate(mat_dev), NULL);
CHECK(c_dbcsr_acc_stream_destroy(stream), NULL);
#if !defined(__CUDA)
CHECK(libsmm_acc_finalize(), NULL);
#endif
CHECK(c_dbcsr_acc_finalize(), NULL);
if (EXIT_SUCCESS != result) {
fprintf(stderr, "FAILED\n");
Expand Down
9 changes: 3 additions & 6 deletions src/acc/cuda/Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
INCACC := $(wildcard *.h*) ../acc.h
SRCACC := $(wildcard *.cpp)
OBJACC := $(SRCACC:.cpp=.o) acc_cublas.o
OBJACC := $(SRCACC:.cpp=.o)

GPUSMM := $(wildcard ../libsmm_acc/kernels/*.h*)
INCSMM := $(wildcard ../libsmm_acc/*.h*) ../acc_libsmm.h \
Expand Down Expand Up @@ -130,10 +130,7 @@ test: ../dbcsr_acc_test
../libsmm_acc/smm_acc_kernels.h: $(GPUSMM) Makefile ../libsmm_acc/generate_kernels.py ../libsmm_acc/parameters/parameters_$(WITH_GPU).json
@cd ../libsmm_acc && $(PYTHON) ../libsmm_acc/generate_kernels.py ../libsmm_acc/kernels

acc_cublas.o: acc_cublas.cu Makefile
$(NVCC) $(addprefix -Xcompiler $(NULL),$(CXXFLAGS)) -c $< -o $@

../dbcsr_acc.a: $(OBJACC) acc_cublas.o ../libsmm_acc/libsmm_acc_init.o
../dbcsr_acc.a: $(OBJACC) ../libsmm_acc/libsmm_acc_init.o
$(AR) -rs $@ $^

../dbcsr_acc_smm.a: $(OBJSMM)
Expand All @@ -153,7 +150,7 @@ acc_bench_trans.o: ../acc_bench_trans.c Makefile
$(CXX) $^ $(LDFLAGS) -o $@

dbcsr_acc_test.o: ../../../tests/dbcsr_acc_test.c Makefile
$(CC) $(CFLAGS) -c $< -o $@
$(CC) $(CFLAGS) -I../.. -c $< -o $@
../dbcsr_acc_test: dbcsr_acc_test.o ../dbcsr_acc_smm.a ../dbcsr_acc.a
$(CXX) $^ $(LDFLAGS) -o $@

Expand Down
5 changes: 4 additions & 1 deletion src/acc/opencl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The OpenCL backend implements the [ACC interface](https://github.com/cp2k/dbcsr/

### Compile-time Settings

Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h) (adjusting the build-line as per `-D` is possible as well but less convenient). For example, `ACC_OPENCL_STREAM_PRIORITIES` is enabled by default (and further confirmed at runtime/build-time) but can be disabled, or `ACC_OPENCL_VERBOSE` (which is disabled by default) can be enabled for debug purpose. More sensitive/private compile-time settings may be available within particular translation units like in `acc_opencl_mem.c`.
Compile-time settings are (implicitly) documented and can be adjusted by editing [acc_opencl.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/acc_opencl.h) (adjusting the build-line as per `-D` is possible as well but less convenient). For example, `ACC_OPENCL_STREAM_PRIORITIES` is enabled by default (and further confirmed at runtime/build-time) but can be disabled, or `ACC_OPENCL_DEBUG` (which is disabled by default) can be enabled for debug purpose. More sensitive/private compile-time settings may be available within particular translation units like in `acc_opencl_mem.c`.

An application of compile-time settings (and perhaps a valuable contribution) might be to call a GPU library in OpenCL-based LIBSMM. In such case, Shared Virtual Memory support (SVM) in OpenCL comes handy and can be enabled per `ACC_OPENCL_SVM`. The latter allows then to simply take the raw pointer out of an `cl_mem` object, and pass it into such library/function (which in turn can work across language borders, etc.).

Expand All @@ -19,6 +19,9 @@ Runtime settings are made by the means of environment variables (implemented in
* `ACC_OPENCL_VENDOR`: character string matching the vendor of the OpenCL device in an case-insensitive fashion, e.g., "intel".
* `ACC_OPENCL_DEVTYPE`: character string matching the device-kind like "cpu", "gpu", or another kind if neither CPU or GPU.
* `ACC_OPENCL_DEVICE`: non-negative integer number to select a device from the (internally enumerated) list of devices.
* `ACC_OPENCL_VERBOSE`: verbosity level (integer).
* `ACC_OPENCL_VERBOSE=1`: outputs (stderr) the number of devices found and the name of the selected device.
* `ACC_OPENCL_VERBOSE=2`: outputs (stderr) the duration needed to generate a requested kernel.

The OpenCL backend enumerates and orders devices primarily by device-kind (GPU, CPU, and others in that order) and by memory capacity (secondary criterion). Device IDs are zero-based as per ACC interface (and less than what is permitted/returned by `acc_get_ndevices`).

Expand Down
35 changes: 27 additions & 8 deletions src/acc/opencl/acc_opencl.c
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,11 @@ int c_dbcsr_acc_init(void)
{
#if defined(_OPENMP)
/* initialization/finalization is not meant to be thread-safe */
int result = (0 == omp_in_parallel() ? EXIT_SUCCESS : EXIT_FAILURE);
int result = ((0 == omp_in_parallel()
# if /*WORKAROUND*/defined(__DBCSR_ACC)
|| 0/*master*/ == omp_get_thread_num()
# endif
) ? EXIT_SUCCESS : EXIT_FAILURE);
#else
int result = EXIT_SUCCESS;
#endif
Expand Down Expand Up @@ -177,7 +181,6 @@ int c_dbcsr_acc_init(void)
if (device_id < acc_opencl_ndevices) {
if (NULL != env_device_vendor && '\0' != *env_device_vendor) {
for (i = 0; i < (cl_uint)acc_opencl_ndevices;) {
buffer[0] = '\0';
if (CL_SUCCESS == clGetDeviceInfo(acc_opencl_devices[i],
CL_DEVICE_VENDOR, ACC_OPENCL_BUFFERSIZE, buffer, NULL))
{
Expand Down Expand Up @@ -216,7 +219,9 @@ int c_dbcsr_acc_init(void)
}
}
if (EXIT_SUCCESS == result) {
const char *const env_verbose = getenv("ACC_OPENCL_VERBOSE");
cl_device_id active_device;
acc_opencl_options.verbosity = (NULL == env_verbose ? 0 : atoi(env_verbose));
result = c_dbcsr_acc_opencl_set_active_device(device_id, &active_device);
#if defined(_OPENMP) && defined(ACC_OPENCL_THREADLOCAL_CONTEXT)
if (EXIT_SUCCESS == result) {
Expand Down Expand Up @@ -284,7 +289,11 @@ int c_dbcsr_acc_finalize(void)
{
#if defined(_OPENMP)
/* initialization/finalization is not meant to be thread-safe */
int result = (0 == omp_in_parallel() ? EXIT_SUCCESS : EXIT_FAILURE);
int result = ((0 == omp_in_parallel()
# if /*WORKAROUND*/defined(__DBCSR_ACC)
|| 0/*master*/ == omp_get_thread_num()
# endif
) ? EXIT_SUCCESS : EXIT_FAILURE);
#else
int result = EXIT_SUCCESS;
#endif
Expand Down Expand Up @@ -325,7 +334,6 @@ void c_dbcsr_acc_clear_errors(void)
int c_dbcsr_acc_get_ndevices(int* ndevices)
{
int result;

#if defined(__DBCSR_ACC)
/* DBCSR calls acc_get_ndevices before calling acc_init(). */
result = c_dbcsr_acc_init();
Expand Down Expand Up @@ -375,7 +383,6 @@ int c_dbcsr_acc_opencl_device_vendor(cl_device_id device, const char* vendor)
char buffer[ACC_OPENCL_BUFFERSIZE];
int result = EXIT_SUCCESS;
assert(NULL != device && NULL != vendor);
buffer[0] = '\0';
ACC_OPENCL_CHECK(clGetDeviceInfo(device,
CL_DEVICE_VENDOR, ACC_OPENCL_BUFFERSIZE, buffer, NULL),
"retrieve device vendor", result);
Expand Down Expand Up @@ -477,8 +484,20 @@ int c_dbcsr_acc_opencl_set_active_device(int device_id, cl_device_id* device)
ACC_OPENCL_CHECK(result, "create context", result);
}
}
if (NULL != device) {
*device = (EXIT_SUCCESS == result ? active_id : NULL);
if (EXIT_SUCCESS == result) {
if (NULL != device) *device = active_id;
if (0 != acc_opencl_options.verbosity) {
char buffer[ACC_OPENCL_BUFFERSIZE];
if (CL_SUCCESS == clGetDeviceInfo(active_id,
CL_DEVICE_NAME, ACC_OPENCL_BUFFERSIZE, buffer, NULL))
{
fprintf(stderr, "INFO ACC/OpenCL: ndevices=%i device%i=\"%s\"\n",
acc_opencl_ndevices, device_id, buffer);
}
}
}
else {
if (NULL != device) *device = NULL;
}
}
ACC_OPENCL_RETURN(result);
Expand Down Expand Up @@ -546,7 +565,7 @@ int c_dbcsr_acc_opencl_wgsize(cl_device_id device, cl_kernel kernel,
int c_dbcsr_acc_opencl_kernel(const char* source, const char* build_options,
const char* kernel_name, cl_kernel* kernel)
{
char buffer[ACC_OPENCL_BUFFERSIZE] = "\0";
char buffer[ACC_OPENCL_BUFFERSIZE] = "";
cl_int result;
assert(NULL != kernel);
if (NULL != acc_opencl_context) {
Expand Down
9 changes: 6 additions & 3 deletions src/acc/opencl/acc_opencl.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@
#if !defined(ACC_OPENCL_MEM_ASYNC) && 1
# define ACC_OPENCL_MEM_ASYNC
#endif
#if !defined(ACC_OPENCL_VERBOSE) && 0
# define ACC_OPENCL_VERBOSE
#if !defined(ACC_OPENCL_DEBUG) && 0
# define ACC_OPENCL_DEBUG
#endif
#if !defined(ACC_OPENCL_SVM) && 0
# if defined(CL_VERSION_2_0)
Expand Down Expand Up @@ -189,9 +189,12 @@ extern "C" {

/** Settings depending on OpenCL vendor or standard level (discovered/setup in acc_init). */
typedef struct acc_opencl_options_t {
/** Asynchronous memory operations may crash for some OpenCL implementations. */
/** Asynchronous memory operations (may crash for some OpenCL implementations). */
cl_bool async_memops;
/** Runtime SVM support (needs ACC_OPENCL_SVM at compile-time). */
cl_bool svm_interop;
/** Runtime verbosity (output on stderr). */
cl_int verbosity;
} acc_opencl_options_t;

extern acc_opencl_options_t acc_opencl_options;
Expand Down
4 changes: 2 additions & 2 deletions src/acc/opencl/acc_opencl_event.c
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ int c_dbcsr_acc_event_query(void* event, acc_bool_t* has_occurred)
}
assert(NULL != has_occurred);
*has_occurred = (CL_COMPLETE == status || 0 > status);
#if defined(ACC_OPENCL_VERBOSE) && defined(_DEBUG)
#if defined(ACC_OPENCL_DEBUG) && defined(_DEBUG)
fprintf(stderr, "c_dbcsr_acc_event_query(%p, %i)\n", event, *has_occurred);
#endif
ACC_OPENCL_RETURN(result);
Expand All @@ -118,7 +118,7 @@ int c_dbcsr_acc_event_synchronize(void* event)
{ /* Waits on the host-side. */
int result = EXIT_SUCCESS;
assert(NULL != event);
#if defined(ACC_OPENCL_VERBOSE) && defined(_DEBUG)
#if defined(ACC_OPENCL_DEBUG) && defined(_DEBUG)
fprintf(stderr, "c_dbcsr_acc_event_synchronize(%p)\n", event);
#endif
ACC_OPENCL_CHECK(clWaitForEvents(1, ACC_OPENCL_EVENT(event)),
Expand Down
4 changes: 2 additions & 2 deletions src/acc/opencl/acc_opencl_stream.c
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ int c_dbcsr_acc_stream_sync(void* stream)
{ /* Blocks the host-thread. */
int result = EXIT_SUCCESS;
assert(NULL != stream);
#if defined(ACC_OPENCL_VERBOSE) && defined(_DEBUG)
#if defined(ACC_OPENCL_DEBUG) && defined(_DEBUG)
fprintf(stderr, "c_dbcsr_acc_stream_sync(%p)\n", stream);
#endif
ACC_OPENCL_CHECK(clFinish(*ACC_OPENCL_STREAM(stream)),
Expand All @@ -178,7 +178,7 @@ int c_dbcsr_acc_stream_wait_event(void* stream, void* event)
{ /* Wait for an event (device-side). */
int result = EXIT_SUCCESS;
assert(NULL != stream && NULL != event);
#if defined(ACC_OPENCL_VERBOSE) && defined(_DEBUG)
#if defined(ACC_OPENCL_DEBUG) && defined(_DEBUG)
fprintf(stderr, "c_dbcsr_acc_stream_wait_event(%p, %p)\n", stream, event);
#endif
#if defined(ACC_OPENCL_STREAM_SYNCFLUSH)
Expand Down
19 changes: 14 additions & 5 deletions src/acc/opencl/smm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,16 @@ The `OPENCL_LIBSMM_DEBUG` compile-time setting enables side-by-side validation o

### Runtime Settings

Runtime settings are made by the means of environment variables (implemented in `opencl_libsmm.c`). There are two categories (for the two major functions) like matrix transpose (`OPENCL_LIBSMM_TRANS_*`) and matrix multiplication (`OPENCL_LIBSMM_SMM_*`). For tranposing matrices:
Runtime settings are made by the means of environment variables (implemented in `opencl_libsmm.c`). There are two categories (for the two major functions) like matrix transpose (`OPENCL_LIBSMM_TRANS_*`) and matrix multiplication (`OPENCL_LIBSMM_SMM_*`). Common settings are (see OpenCL backend documentation for more details):

* `ACC_OPENCL_VENDOR`: character string matching the vendor of the OpenCL device in an case-insensitive fashion, e.g., "intel".
* `ACC_OPENCL_DEVTYPE`: character string matching the device-kind like "cpu", "gpu", or another kind if neither CPU or GPU.
* `ACC_OPENCL_DEVICE`: non-negative integer number to select a device from the (internally enumerated) list of devices.
* `ACC_OPENCL_VERBOSE`: verbosity level (integer).
* `ACC_OPENCL_VERBOSE=1`: outputs (stderr) the number of devices found and the name of the selected device.
* `ACC_OPENCL_VERBOSE=2`: outputs (stderr) the duration needed to generate a requested kernel.

For tranposing matrices:

* `OPENCL_LIBSMM_TRANS_BUILDOPTS`: character string with build options (compile and link) supplied to the OpenCL runtime compiler.
* `OPENCL_LIBSMM_TRANS_INPLACE`: Boolean value (zero or non-zero integer) for inplace matrix transpose not relying on local memory.
Expand All @@ -28,13 +37,13 @@ For multiplying matrices:
* `OPENCL_LIBSMM_SMM_BLOCK_M`: non-negative integer number (less/equal than the M-extent) denoting the blocksize in M-direction.
* `OPENCL_LIBSMM_SMM_BLOCK_N`: non-negative integer number (less/equal than the N-extent) denoting the blocksize in N-direction.

**NOTE**: above runtime settings may be non-smooth in the sense of enabling a distinct code-path depending on a specific value, e.g., `OPENCL_LIBSMM_SMM_BATCHSIZE=1`.
**NOTE**: LIBSMM's tunable runtime settings may be non-smooth in the sense of enabling a distinct code-path depending on a specific value, e.g., `OPENCL_LIBSMM_SMM_BATCHSIZE=1` vs. `OPENCL_LIBSMM_SMM_BATCHSIZE=2`.

## Auto Tuning

Auto tuning code for performance is a practical way to find the "best" setting for parameterized code (e.g., GPU kernels). Introducing effective parameters is a prerequisite, and exploring the (potentially) high-dimensional parameter space in an efficient way is an art. It is desirable to have reasonable defaults even without auto-tuning the parameters. It would be even better to avoid auto-tuning if best performance was possible right away, i.e., if auto-tuning is not able to find better settings.

For the OpenCL based LIBSMM, `OPENCL_LIBSMM_SMM_BATCHSIZE`, `OPENCL_LIBSMM_SMM_BLOCK_M`, and `OPENCL_LIBSMM_SMM_BLOCK_N` are explored using [OpenTuner](http://opentuner.org/). The script [tune_multiply.py](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/smm/tune_multiply.py) leverages for instance the [acc_bench_smm](index.html) benchmark by parsing console output (timing, data type, etc.). This way, the tuning is implemented without being intermingled with subject being tuned. To build the benchmarks:
For the OpenCL based LIBSMM, `OPENCL_LIBSMM_SMM_BATCHSIZE`, `OPENCL_LIBSMM_SMM_BLOCK_M`, and `OPENCL_LIBSMM_SMM_BLOCK_N` are explored using [OpenTuner](http://opentuner.org/). The script [tune_multiply.py](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/smm/tune_multiply.py) leverages for instance the `acc_bench_smm` benchmark by parsing console output (timing, data type, etc.). This way, the tuning is implemented without being intermingled with subject being tuned. To build the benchmarks:

```bash
cd src/acc/opencl
Expand Down Expand Up @@ -66,7 +75,7 @@ The OpenTuner script implements multiple objectives ("cost"), primarily "accurac
[ 67s] INFO opentuner.search.plugin.DisplayPlugin: tests=53, best {'BS': 48, 'BM': 8, 'BN': 1}, cost accuracy=32.20000000, size=1.0, found by UniformGreedyMutation
```

The script finally writes a JSON-file with a filename like `tune_multiply-float-12x12x12-60gflops.json` which is encoding the benchmark (multiply), the precision (float), the kernel (12x12x12), and the achieved performance (60gflops). Tuninig starts from an internal default that is supposed to match LIBSMM's internal default parameters. However, tuning can be (re-)started with specific parameters (e.g., `-bs 64`, `-bm 13`, `-bn 1` for `OPENCL_LIBSMM_SMM_BATCHSIZE`, `OPENCL_LIBSMM_SMM_BLOCK_M`, and `OPENCL_LIBSMM_SMM_BLOCK_N` respectively).
The script finally writes a JSON-file with a filename like `tune_multiply-float-12x12x12-60gflops.json` which is encoding the benchmark (multiply), the precision (float), the kernel (12x12x12), and the achieved performance (60gflops). The script handles SIGINT (like Ctrl-C), and output is still written despite of not terminating normally (can abused to tune interactively). Tuninig starts from an internal default that is supposed to match LIBSMM's internal default parameters. However, tuning can be (re-)started with specific parameters (e.g., `-bs 64`, `-bm 13`, `-bn 1` for `OPENCL_LIBSMM_SMM_BATCHSIZE`, `OPENCL_LIBSMM_SMM_BLOCK_M`, and `OPENCL_LIBSMM_SMM_BLOCK_N` respectively).

## Optimized Kernels

Expand Down Expand Up @@ -114,4 +123,4 @@ cd src/acc/opencl/smm
./tune_multiply.sh 300 8 1 4 10 15, 6 7 8, 23
```

The script `tune_multiply.sh` is tuning 1444 kernels by default (`./acc_bench_smm 300 8 1` taking approximately 15 hours per part).
The script `tune_multiply.sh` is tuning 1444 kernels by default (`./acc_bench_smm 300 8 1` taking approximately 15 hours per part). If the process is interrupted earlier (per SIGINT or Ctrl-C), the execution terminates for all requested kernels (triplet specification) unless an environment variable `CONTINUE=1` is set (proceeds to the next kernel).
Loading