Skip to content

Environment

Vishal edited this page Feb 11, 2026 · 4 revisions

AOCL-DLP Environment Variables Configuration Guide

This document provides a comprehensive guide to environment variables that can be used to configure the AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) library for optimal performance and behavior.

Threading Control Precedence

AOCL-DLP follows a specific precedence order when determining the number of threads to use:

  1. API calls - dlp_thread_set_num_threads() or dlp_thread_set_ways()
  2. DLP_NUM_THREADS - Library-specific environment variable
  3. OpenMP API - omp_set_num_threads()
  4. OMP_NUM_THREADS - OpenMP environment variable
  5. System default - Number of available CPU cores

AOCL-DLP Specific Environment Variables

Threading Configuration

Variable Type Description Example Values
DLP_NUM_THREADS Integer Sets the total number of threads for GEMM operations. Overrides OpenMP settings when set. 1, 4, 8, 16
DLP_IC_NT Integer Sets the number of threads for inner loop parallelization (IC dimension). When used with DLP_JC_NT, DLP_NUM_THREADS is ignored. 1, 2, 4
DLP_JC_NT Integer Sets the number of threads for outer loop parallelization (JC dimension). When used with DLP_IC_NT, DLP_NUM_THREADS is ignored. 1, 2, 4

Architecture and Instruction Set Control

Variable Type Description Supported Values
AOCL_ENABLE_INSTRUCTIONS String Forces the library to use specific instruction sets, overriding auto-detection depending on whether the datatype support. Case-insensitive. zen5, zen4, zen3, zen2, zen, avx512,avx512_ymm, avx2, avx, sse4_2, sse4_1, sse4a, sse4, ssse3, sse3, sse2

Miscellaneous Notes

Datatype-Specific Behavior with AOCL_ENABLE_INSTRUCTIONS = avx512_ymm:

  • avx512_ymm: This option is only applicable for float32 (f32) datatypes. It forces the use of 256-bit YMM registers on AVX512-capable architectures, which can be beneficial in certain scenarios where 256-bit operations provide better performance than 512-bit operations.

  • Other datatypes (bf16, int8): These datatypes will use the default AVX512 implementation when avx512 is specified, regardless of the avx512_ymm setting. The avx512_ymm option has no effect on these datatypes.

  • BFloat16 fallback behavior: When BFloat16 operations are rerouted to float32 due to hardware limitations, they will use the default AVX512 implementation, even if avx512_ymm is specified.

Debugging and Logging

Variable Type Description Example Values
AOCL_ENABLE_LPGEMM_LOGGER Boolean Enables detailed logging for low-precision GEMM operations. Logs are written to files with pattern aocl_lpgemm_P<pid>_T<tid>.log. 1, true, yes, on (enable)
0, false, no, off (disable)

OpenMP Environment Variables

AOCL-DLP leverages OpenMP for parallel execution. The following OpenMP environment variables can significantly impact performance:

Core Threading Variables

Variable Type Description Example Values
OMP_NUM_THREADS Integer Sets the number of OpenMP threads. Used when DLP_NUM_THREADS is not set. 1, 4, 8, 16
OMP_PROC_BIND String Controls thread affinity policy. true, false, master, close, spread
OMP_PLACES String Specifies places where threads should be bound. threads, cores, sockets, {0,1,2,3}

Performance Tuning Variables

Variable Type Description Example Values
OMP_WAIT_POLICY String Sets the behavior of idle threads. Recommended: active for best performance. active, passive
OMP_DYNAMIC Boolean Enables dynamic adjustment of thread count. true, false
OMP_NESTED Boolean Enables nested parallelism (deprecated, use OMP_MAX_ACTIVE_LEVELS). true, false
OMP_MAX_ACTIVE_LEVELS Integer Maximum number of nested parallel regions. 1, 2, 3
OMP_STACKSIZE Size Sets the stack size for OpenMP threads. 4M, 8M, 16M

GNU OpenMP (GOMP) Specific Variables

Variable Type Description Example Values
GOMP_CPU_AFFINITY String Binds threads to specific CPU cores (GNU OpenMP specific). 0 1 2 3, 0-7, 0,2,4,6
GOMP_SPINCOUNT Integer Number of spin iterations before blocking. 300000, 1000000

Usage Examples

Basic Threading Configuration

# Use 8 threads for all GEMM operations
export DLP_NUM_THREADS=8

# Use 2x4 thread decomposition (2 threads for JC, 4 for IC)
export DLP_JC_NT=2
export DLP_IC_NT=4

Architecture-Specific Optimization

# Force AVX512 instructions on Zen4 processors
export AOCL_ENABLE_INSTRUCTIONS=avx512

# Use Zen3-optimized kernels
export AOCL_ENABLE_INSTRUCTIONS=zen3

OpenMP Optimization for NUMA Systems

# Bind threads close to master thread
export OMP_PROC_BIND=close
export OMP_PLACES=cores

# Use active wait policy for better performance
export OMP_WAIT_POLICY=active

# Bind to specific cores (0-15)
export GOMP_CPU_AFFINITY="0-15"

Debugging and Performance Analysis

# Enable detailed logging
export AOCL_ENABLE_LPGEMM_LOGGER=1

# Keep threads active (good for benchmarking)
export OMP_WAIT_POLICY=active

Recommended Production Command

For optimal AOCL-DLP performance on multi-socket systems, use this comprehensive command template:

# Optimal configuration for second socket with 128 cores
# Adjust OMP_NUM_THREADS based on your system's core count per socket
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

Parameter Explanation:

  • OMP_WAIT_POLICY=active - Keeps threads active for best benchmark performance
  • OMP_NUM_THREADS=128 - Set to total cores in target socket (machine dependent)
  • OMP_PLACES=cores - Bind threads to physical cores for better locality
  • OMP_PROC_BIND=close - Keep threads close together within the socket
  • numactl --cpunodebind=1 - Bind to NUMA node 1 (second socket)
  • numactl --interleave=1 - Interleave memory allocation within node 1

NUMA-Optimized Examples

# Multi-socket system - bind to specific NUMA node with interleaved memory
# Example: 128 cores total, using second socket (cores 64-127, NUMA node 1)
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl --cpunodebind=1 --interleave=1 ./your_application

# Alternative: Bind to specific core range
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=64
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl -C 64-127 --interleave=1 ./your_application

# Single-socket system - keep threads and memory local
export OMP_WAIT_POLICY=active
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=16
numactl --cpunodebind=0 --membind=0 ./your_application

Performance Recommendations

For Multi-Socket Systems

  • Use OMP_PROC_BIND=close with numactl --cpunodebind to bind to specific NUMA nodes
  • Set OMP_PLACES=cores for fine-grained thread control
  • Use numactl --interleave for memory interleaving across NUMA nodes
  • Set OMP_WAIT_POLICY=active for optimal performance
  • Consider binding to the second socket for better performance (machine dependent)

For Single-Socket Systems

  • Use OMP_PROC_BIND=close to keep threads near each other
  • Set OMP_PLACES=cores for better cache locality
  • Set OMP_WAIT_POLICY=active for reduced thread wake-up overhead
  • Use numactl --cpunodebind=0 --membind=0 for local memory binding

For Memory-Bound Workloads

  • Consider using fewer threads than physical cores
  • Set OMP_WAIT_POLICY=active for consistent thread responsiveness
  • Use numactl for optimal memory placement

For Compute-Bound Workloads

  • Use all available cores with OMP_NUM_THREADS=<core_count>
  • Set OMP_WAIT_POLICY=active to reduce thread wake-up overhead

Environment Variable Interactions

  • When both DLP_IC_NT and DLP_JC_NT are set, DLP_NUM_THREADS is ignored
  • AOCL_ENABLE_INSTRUCTIONS overrides automatic CPU feature detection
  • GOMP-specific variables only apply when using GNU OpenMP runtime
  • Setting DLP_NUM_THREADS disables OpenMP environment variable effects on thread count

Troubleshooting Common Issues

Performance Issues

Problem: Poor performance despite setting threading variables
Solution:

  • Verify thread affinity with OMP_PROC_BIND=true and OMP_PLACES=cores
  • Check for thread over-subscription (threads > physical cores)
  • Monitor CPU utilization to ensure threads are active

Problem: Inconsistent performance across runs
Solution:

  • Set OMP_WAIT_POLICY=active to keep threads spinning
  • Ensure consistent thread affinity with OMP_PROC_BIND=close
  • Use numactl for consistent NUMA binding
  • Disable CPU frequency scaling during benchmarks

Threading Issues

Problem: Library not respecting DLP_NUM_THREADS
Solution:

  • Ensure no conflicting DLP_IC_NT/DLP_JC_NT settings
  • Check that OpenMP is enabled in the library build
  • Verify the variable is set in the correct shell environment

Problem: Application hangs or deadlocks
Solution:

  • Check OMP_MAX_ACTIVE_LEVELS if using nested parallelism
  • Avoid mixing DLP threading with application-level OpenMP
  • Ensure sufficient stack size with OMP_STACKSIZE

Architecture Detection Issues

Problem: Library not using optimal instruction set
Solution:

  • Explicitly set AOCL_ENABLE_INSTRUCTIONS to desired architecture
  • Verify CPU capabilities with /proc/cpuinfo on Linux
  • Check for virtualization overhead affecting instruction set detection

Debugging Tips

  1. Enable logging: Set AOCL_ENABLE_LPGEMM_LOGGER=1 to see detailed operation logs
  2. Check environment: Use env | grep -E "(DLP_|OMP_|GOMP_)" to verify settings
  3. Monitor resources: Use htop or top to verify thread count and CPU usage
  4. Profile performance: Use tools like perf or Intel VTune for detailed analysis

See Also

Clone this wiki locally