-
Notifications
You must be signed in to change notification settings - Fork 4
FAQ
It depends on your precision needs and hardware:
| Scenario | Recommended variant |
|---|---|
| Maximum accuracy | aocl_gemm_f32f32f32of32 |
| Good accuracy, less memory (Zen4+) | aocl_gemm_bf16bf16f32of32 |
| Good accuracy (Zen1-3, no AVX512_BF16) |
aocl_gemm_bf16bf16f32of32 (auto-falls back to f32) |
| Quantized inference |
aocl_gemm_u8s8s32os32 or aocl_gemm_s8s8s32os32
|
| Weight-only quantization | aocl_gemm_bf16s4f32of32 |
| Half-precision pipeline (Zen5+) | aocl_gemm_f16f16f16of16 |
See the GEMM Guide for the full data type matrix.
# Linux
grep -o 'avx512_bf16' /proc/cpuinfo | head -1
# If output is empty, your CPU does not have native BF16 support.
# AOCL-DLP will automatically fall back to f32 kernels.Yes. AOCL-DLP is optimized for AMD processors but is compatible with any x86_64 CPU that meets the minimum ISA requirements (AVX2 for f32/bf16, AVX512_VNNI for integer). Performance is tuned for AMD microarchitectures, so you may see different performance characteristics on Intel hardware.
Yes. When statically linking AOCL-DLP, the --whole-archive flag is required. Without it, the linker may discard constructor functions that initialize internal kernel dispatch tables, leading to silent performance degradation or runtime failures.
# Correct
gcc -o app main.c -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive -lstdc++ -lm -fopenmp
# Wrong -- may silently break
gcc -o app main.c -laocl-dlp_static -lstdc++ -lm -fopenmpWith CMake 3.24+, use $<LINK_LIBRARY:WHOLE_ARCHIVE,...>. See the Integration Guide for full details.
Set CMAKE_PREFIX_PATH to the AOCL-DLP install location:
cmake -DCMAKE_PREFIX_PATH=/usr/local ..Set LD_LIBRARY_PATH to include the install directory:
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATHOr use rpath during linking. See Integration Guide - Troubleshooting.
Both control the thread count, but DLP_NUM_THREADS takes higher precedence:
-
API calls (
dlp_thread_set_num_threads()) -- highest -
DLP_NUM_THREADS-- library-specific, overrides OpenMP -
OpenMP API (
omp_set_num_threads()) -
OMP_NUM_THREADS-- OpenMP environment variable - System default -- number of available cores
Use DLP_NUM_THREADS when you want DLP threading independent of your application's OpenMP settings.
These control 2D thread decomposition for the GEMM loops:
-
DLP_JC_NT-- threads for the outer (JC) loop -
DLP_IC_NT-- threads for the inner (IC) loop
When both are set, DLP_NUM_THREADS is ignored and total threads = JC_NT * IC_NT.
Potentially. If both your application and DLP use OpenMP, thread over-subscription can occur. Consider:
- Set
DLP_NUM_THREADSto control DLP independently - Avoid calling DLP from within an OpenMP parallel region
- Use
dlp_thread_set_num_threads()to limit DLP threads when needed
See Environment Variables for full threading configuration.
On CPUs without native AVX512_BF16 (Zen1-3, Intel pre-Cooper Lake), AOCL-DLP transparently falls back to f32 kernels. This means:
- BF16 inputs are converted to f32 before computation
- Computation runs on f32 kernels
- Output is converted back to bf16 if needed
This fallback is correct but incurs conversion overhead and uses 2x memory bandwidth. Check your hardware with grep avx512_bf16 /proc/cpuinfo. See Library Overview for details.
Use NUMA-aware thread binding:
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_applicationSee the Performance Guide and Environment Variables for detailed tuning.
Yes, if you reuse the same weight matrix across multiple GEMM calls (common in inference). Reordering transforms the matrix into a cache-friendly layout that the GEMM kernel accesses optimally. The overhead of a single reorder call is amortized across many subsequent GEMM calls.
See GEMM Guide - Matrix Reordering.
int major, minor, patch;
dlp_version_query(&major, &minor, &patch);
printf("AOCL-DLP version: %d.%d.%d\n", major, minor, patch);See the version.c example in the examples directory.
Pass a dlp_metadata_t struct and inspect error_hndl.error_code:
dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);
if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
printf("Error: %d\n", meta.error_hndl.error_code);
}Error codes are defined in dlp_errors.h. See the GEMM Guide for the full list.
- Quick Start -- Build and run your first program
- Integration Guide -- Comprehensive linking reference
- Performance Guide -- Optimization tips
- Environment Variables -- Complete variable reference
Getting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference