Skip to content
Nallani Bhaskar edited this page Mar 18, 2026 · 3 revisions

Frequently Asked Questions

Getting Started

Which GEMM variant should I use?

It depends on your precision needs and hardware:

Scenario Recommended variant
Maximum accuracy aocl_gemm_f32f32f32of32
Good accuracy, less memory (Zen4+) aocl_gemm_bf16bf16f32of32
Good accuracy (Zen1-3, no AVX512_BF16) aocl_gemm_bf16bf16f32of32 (auto-falls back to f32)
Quantized inference aocl_gemm_u8s8s32os32 or aocl_gemm_s8s8s32os32
Weight-only quantization aocl_gemm_bf16s4f32of32
Half-precision pipeline (Zen5+) aocl_gemm_f16f16f16of16

See the GEMM Guide for the full data type matrix.

How do I check if my CPU supports AVX512_BF16?

# Linux
grep -o 'avx512_bf16' /proc/cpuinfo | head -1

# If output is empty, your CPU does not have native BF16 support.
# AOCL-DLP will automatically fall back to f32 kernels.

Can I use AOCL-DLP on Intel CPUs?

Yes. AOCL-DLP is optimized for AMD processors but is compatible with any x86_64 CPU that meets the minimum ISA requirements (AVX2 for f32/bf16, AVX512_VNNI for integer). Performance is tuned for AMD microarchitectures, so you may see different performance characteristics on Intel hardware.

Building & Linking

Do I need --whole-archive for static linking?

Yes. When statically linking AOCL-DLP, the --whole-archive flag is required. Without it, the linker may discard constructor functions that initialize internal kernel dispatch tables, leading to silent performance degradation or runtime failures.

# Correct
gcc -o app main.c -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive -lstdc++ -lm -fopenmp

# Wrong -- may silently break
gcc -o app main.c -laocl-dlp_static -lstdc++ -lm -fopenmp

With CMake 3.24+, use $<LINK_LIBRARY:WHOLE_ARCHIVE,...>. See the Integration Guide for full details.

CMake can't find AoclDlp

Set CMAKE_PREFIX_PATH to the AOCL-DLP install location:

cmake -DCMAKE_PREFIX_PATH=/usr/local ..

Library not found at runtime

Set LD_LIBRARY_PATH to include the install directory:

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Or use rpath during linking. See Integration Guide - Troubleshooting.

Threading

What is the difference between DLP_NUM_THREADS and OMP_NUM_THREADS?

Both control the thread count, but DLP_NUM_THREADS takes higher precedence:

  1. API calls (dlp_thread_set_num_threads()) -- highest
  2. DLP_NUM_THREADS -- library-specific, overrides OpenMP
  3. OpenMP API (omp_set_num_threads())
  4. OMP_NUM_THREADS -- OpenMP environment variable
  5. System default -- number of available cores

Use DLP_NUM_THREADS when you want DLP threading independent of your application's OpenMP settings.

What are DLP_JC_NT and DLP_IC_NT?

These control 2D thread decomposition for the GEMM loops:

  • DLP_JC_NT -- threads for the outer (JC) loop
  • DLP_IC_NT -- threads for the inner (IC) loop

When both are set, DLP_NUM_THREADS is ignored and total threads = JC_NT * IC_NT.

My application uses OpenMP too. Will threads conflict?

Potentially. If both your application and DLP use OpenMP, thread over-subscription can occur. Consider:

  • Set DLP_NUM_THREADS to control DLP independently
  • Avoid calling DLP from within an OpenMP parallel region
  • Use dlp_thread_set_num_threads() to limit DLP threads when needed

See Environment Variables for full threading configuration.

Performance

Why is my BF16 code running slower than expected?

On CPUs without native AVX512_BF16 (Zen1-3, Intel pre-Cooper Lake), AOCL-DLP transparently falls back to f32 kernels. This means:

  • BF16 inputs are converted to f32 before computation
  • Computation runs on f32 kernels
  • Output is converted back to bf16 if needed

This fallback is correct but incurs conversion overhead and uses 2x memory bandwidth. Check your hardware with grep avx512_bf16 /proc/cpuinfo. See Library Overview for details.

How do I get the best performance on multi-socket systems?

Use NUMA-aware thread binding:

OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

See the Performance Guide and Environment Variables for detailed tuning.

Should I reorder my weight matrices?

Yes, if you reuse the same weight matrix across multiple GEMM calls (common in inference). Reordering transforms the matrix into a cache-friendly layout that the GEMM kernel accesses optimally. The overhead of a single reorder call is amortized across many subsequent GEMM calls.

See GEMM Guide - Matrix Reordering.

API Usage

How do I get the library version at runtime?

int major, minor, patch;
dlp_version_query(&major, &minor, &patch);
printf("AOCL-DLP version: %d.%d.%d\n", major, minor, patch);

See the version.c example in the examples directory.

How do I check for errors after a GEMM call?

Pass a dlp_metadata_t struct and inspect error_hndl.error_code:

dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);

if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
    printf("Error: %d\n", meta.error_hndl.error_code);
}

Error codes are defined in dlp_errors.h. See the GEMM Guide for the full list.

See Also

Clone this wiki locally