-
Notifications
You must be signed in to change notification settings - Fork 4
Environment
This document provides a comprehensive guide to environment variables that can be used to configure the AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) library for optimal performance and behavior.
AOCL-DLP follows a specific precedence order when determining the number of threads to use:
-
API calls -
dlp_thread_set_num_threads()ordlp_thread_set_ways() - DLP_NUM_THREADS - Library-specific environment variable
-
OpenMP API -
omp_set_num_threads() - OMP_NUM_THREADS - OpenMP environment variable
- System default - Number of available CPU cores
| Variable | Type | Description | Example Values |
|---|---|---|---|
DLP_NUM_THREADS |
Integer | Sets the total number of threads for GEMM operations. Overrides OpenMP settings when set. |
1, 4, 8, 16
|
DLP_IC_NT |
Integer | Sets the number of threads for inner loop parallelization (IC dimension). When used with DLP_JC_NT, DLP_NUM_THREADS is ignored. |
1, 2, 4
|
DLP_JC_NT |
Integer | Sets the number of threads for outer loop parallelization (JC dimension). When used with DLP_IC_NT, DLP_NUM_THREADS is ignored. |
1, 2, 4
|
| Variable | Type | Description | Supported Values |
|---|---|---|---|
AOCL_ENABLE_INSTRUCTIONS |
String | Forces the library to use specific instruction sets, overriding auto-detection depending on whether the datatype support. Case-insensitive. |
zen5, zen4, zen3, zen2, zen, avx512,avx512_ymm, avx2, avx, sse4_2, sse4_1, sse4a, sse4, ssse3, sse3, sse2
|
Datatype-Specific Behavior with AOCL_ENABLE_INSTRUCTIONS = avx512_ymm:
-
avx512_ymm: This option is only applicable for float32 (f32) datatypes. It forces the use of 256-bit YMM registers on AVX512-capable architectures, which can be beneficial in certain scenarios where 256-bit operations provide better performance than 512-bit operations. -
Other datatypes (bf16, int8): These datatypes will use the default AVX512 implementation when
avx512is specified, regardless of theavx512_ymmsetting. Theavx512_ymmoption has no effect on these datatypes. -
BFloat16 fallback behavior: When BFloat16 operations are rerouted to float32 due to hardware limitations, they will use the default AVX512 implementation, even if
avx512_ymmis specified.
| Variable | Type | Description | Example Values |
|---|---|---|---|
AOCL_ENABLE_LPGEMM_LOGGER |
Boolean | Enables detailed logging for low-precision GEMM operations. Logs are written to files with pattern aocl_lpgemm_P<pid>_T<tid>.log. |
1, true, yes, on (enable)0, false, no, off (disable) |
AOCL-DLP leverages OpenMP for parallel execution. The following OpenMP environment variables can significantly impact performance:
| Variable | Type | Description | Example Values |
|---|---|---|---|
OMP_NUM_THREADS |
Integer | Sets the number of OpenMP threads. Used when DLP_NUM_THREADS is not set. |
1, 4, 8, 16
|
OMP_PROC_BIND |
String | Controls thread affinity policy. |
true, false, master, close, spread
|
OMP_PLACES |
String | Specifies places where threads should be bound. |
threads, cores, sockets, {0,1,2,3}
|
| Variable | Type | Description | Example Values |
|---|---|---|---|
OMP_WAIT_POLICY |
String | Sets the behavior of idle threads. Recommended: active for best performance. |
active, passive
|
OMP_DYNAMIC |
Boolean | Enables dynamic adjustment of thread count. |
true, false
|
OMP_NESTED |
Boolean | Enables nested parallelism (deprecated, use OMP_MAX_ACTIVE_LEVELS). |
true, false
|
OMP_MAX_ACTIVE_LEVELS |
Integer | Maximum number of nested parallel regions. |
1, 2, 3
|
OMP_STACKSIZE |
Size | Sets the stack size for OpenMP threads. |
4M, 8M, 16M
|
| Variable | Type | Description | Example Values |
|---|---|---|---|
GOMP_CPU_AFFINITY |
String | Binds threads to specific CPU cores (GNU OpenMP specific). |
0 1 2 3, 0-7, 0,2,4,6
|
GOMP_SPINCOUNT |
Integer | Number of spin iterations before blocking. |
300000, 1000000
|
# Use 8 threads for all GEMM operations
export DLP_NUM_THREADS=8
# Use 2x4 thread decomposition (2 threads for JC, 4 for IC)
export DLP_JC_NT=2
export DLP_IC_NT=4# Force AVX512 instructions on Zen4 processors
export AOCL_ENABLE_INSTRUCTIONS=avx512
# Use Zen3-optimized kernels
export AOCL_ENABLE_INSTRUCTIONS=zen3# Bind threads close to master thread
export OMP_PROC_BIND=close
export OMP_PLACES=cores
# Use active wait policy for better performance
export OMP_WAIT_POLICY=active
# Bind to specific cores (0-15)
export GOMP_CPU_AFFINITY="0-15"# Enable detailed logging
export AOCL_ENABLE_LPGEMM_LOGGER=1
# Keep threads active (good for benchmarking)
export OMP_WAIT_POLICY=activeFor optimal AOCL-DLP performance on multi-socket systems, use this comprehensive command template:
# Optimal configuration for second socket with 128 cores
# Adjust OMP_NUM_THREADS based on your system's core count per socket
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_applicationParameter Explanation:
-
OMP_WAIT_POLICY=active- Keeps threads active for best benchmark performance -
OMP_NUM_THREADS=128- Set to total cores in target socket (machine dependent) -
OMP_PLACES=cores- Bind threads to physical cores for better locality -
OMP_PROC_BIND=close- Keep threads close together within the socket -
numactl --cpunodebind=1- Bind to NUMA node 1 (second socket) -
numactl --interleave=1- Interleave memory allocation within node 1
# Multi-socket system - bind to specific NUMA node with interleaved memory
# Example: 128 cores total, using second socket (cores 64-127, NUMA node 1)
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl --cpunodebind=1 --interleave=1 ./your_application
# Alternative: Bind to specific core range
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=64
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl -C 64-127 --interleave=1 ./your_application
# Single-socket system - keep threads and memory local
export OMP_WAIT_POLICY=active
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=16
numactl --cpunodebind=0 --membind=0 ./your_application- Use
OMP_PROC_BIND=closewithnumactl --cpunodebindto bind to specific NUMA nodes - Set
OMP_PLACES=coresfor fine-grained thread control - Use
numactl --interleavefor memory interleaving across NUMA nodes - Set
OMP_WAIT_POLICY=activefor optimal performance - Consider binding to the second socket for better performance (machine dependent)
- Use
OMP_PROC_BIND=closeto keep threads near each other - Set
OMP_PLACES=coresfor better cache locality - Set
OMP_WAIT_POLICY=activefor reduced thread wake-up overhead - Use
numactl --cpunodebind=0 --membind=0for local memory binding
- Consider using fewer threads than physical cores
- Set
OMP_WAIT_POLICY=activefor consistent thread responsiveness - Use numactl for optimal memory placement
- Use all available cores with
OMP_NUM_THREADS=<core_count> - Set
OMP_WAIT_POLICY=activeto reduce thread wake-up overhead
- When both
DLP_IC_NTandDLP_JC_NTare set,DLP_NUM_THREADSis ignored -
AOCL_ENABLE_INSTRUCTIONSoverrides automatic CPU feature detection - GOMP-specific variables only apply when using GNU OpenMP runtime
- Setting
DLP_NUM_THREADSdisables OpenMP environment variable effects on thread count
Problem: Poor performance despite setting threading variables
Solution:
- Verify thread affinity with
OMP_PROC_BIND=trueandOMP_PLACES=cores - Check for thread over-subscription (threads > physical cores)
- Monitor CPU utilization to ensure threads are active
Problem: Inconsistent performance across runs
Solution:
- Set
OMP_WAIT_POLICY=activeto keep threads spinning - Ensure consistent thread affinity with
OMP_PROC_BIND=close - Use numactl for consistent NUMA binding
- Disable CPU frequency scaling during benchmarks
Problem: Library not respecting DLP_NUM_THREADS
Solution:
- Ensure no conflicting
DLP_IC_NT/DLP_JC_NTsettings - Check that OpenMP is enabled in the library build
- Verify the variable is set in the correct shell environment
Problem: Application hangs or deadlocks
Solution:
- Check
OMP_MAX_ACTIVE_LEVELSif using nested parallelism - Avoid mixing DLP threading with application-level OpenMP
- Ensure sufficient stack size with
OMP_STACKSIZE
Problem: Library not using optimal instruction set
Solution:
- Explicitly set
AOCL_ENABLE_INSTRUCTIONSto desired architecture - Verify CPU capabilities with
/proc/cpuinfoon Linux - Check for virtualization overhead affecting instruction set detection
-
Enable logging: Set
AOCL_ENABLE_LPGEMM_LOGGER=1to see detailed operation logs -
Check environment: Use
env | grep -E "(DLP_|OMP_|GOMP_)"to verify settings -
Monitor resources: Use
htoportopto verify thread count and CPU usage -
Profile performance: Use tools like
perfor Intel VTune for detailed analysis
- Performance Guide - Detailed performance optimization strategies
- GEMM Guide - Specific GEMM operation optimization
- DLP Testing - Testing and validation procedures
Getting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference