Environment

AOCL-DLP Environment Variables Configuration Guide

This document provides a comprehensive guide to environment variables that can be used to configure the AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) library for optimal performance and behavior.

Threading Control Precedence

AOCL-DLP follows a specific precedence order when determining the number of threads to use:

API calls - dlp_thread_set_num_threads() or dlp_thread_set_ways()
DLP_NUM_THREADS - Library-specific environment variable
OpenMP API - omp_set_num_threads()
OMP_NUM_THREADS - OpenMP environment variable
System default - Number of available CPU cores

AOCL-DLP Specific Environment Variables

Threading Configuration

Variable	Type	Description	Example Values
`DLP_NUM_THREADS`	Integer	Sets the total number of threads for GEMM operations. Overrides OpenMP settings when set.	`1`, `4`, `8`, `16`
`DLP_IC_NT`	Integer	Sets the number of threads for inner loop parallelization (IC dimension). When used with `DLP_JC_NT`, `DLP_NUM_THREADS` is ignored.	`1`, `2`, `4`
`DLP_JC_NT`	Integer	Sets the number of threads for outer loop parallelization (JC dimension). When used with `DLP_IC_NT`, `DLP_NUM_THREADS` is ignored.	`1`, `2`, `4`

Architecture and Instruction Set Control

Variable	Type	Description	Supported Values
`AOCL_ENABLE_INSTRUCTIONS`	String	Forces the library to use specific instruction sets, overriding auto-detection depending on whether the datatype support. Case-insensitive.	`zen5`, `zen4`, `zen3`, `zen2`, `zen`, `avx512`,`avx512_ymm`, `avx2`, `avx`, `sse4_2`, `sse4_1`, `sse4a`, `sse4`, `ssse3`, `sse3`, `sse2`

Miscellaneous Notes

Datatype-Specific Behavior with AOCL_ENABLE_INSTRUCTIONS = avx512_ymm:

avx512_ymm: This option is only applicable for float32 (f32) datatypes. It forces the use of 256-bit YMM registers on AVX512-capable architectures, which can be beneficial in certain scenarios where 256-bit operations provide better performance than 512-bit operations.
Other datatypes (bf16, int8): These datatypes will use the default AVX512 implementation when avx512 is specified, regardless of the avx512_ymm setting. The avx512_ymm option has no effect on these datatypes.
BFloat16 fallback behavior: When BFloat16 operations are rerouted to float32 due to hardware limitations, they will use the default AVX512 implementation, even if avx512_ymm is specified.

Debugging and Logging

Variable	Type	Description	Example Values
`AOCL_ENABLE_LPGEMM_LOGGER`	Boolean	Enables detailed logging for low-precision GEMM operations. Logs are written to files with pattern `aocl_lpgemm_P<pid>_T<tid>.log`.	`1`, `true`, `yes`, `on` (enable) `0`, `false`, `no`, `off` (disable)

OpenMP Environment Variables

AOCL-DLP leverages OpenMP for parallel execution. The following OpenMP environment variables can significantly impact performance:

Core Threading Variables

Variable	Type	Description	Example Values
`OMP_NUM_THREADS`	Integer	Sets the number of OpenMP threads. Used when `DLP_NUM_THREADS` is not set.	`1`, `4`, `8`, `16`
`OMP_PROC_BIND`	String	Controls thread affinity policy.	`true`, `false`, `master`, `close`, `spread`
`OMP_PLACES`	String	Specifies places where threads should be bound.	`threads`, `cores`, `sockets`, `{0,1,2,3}`

Performance Tuning Variables

Variable	Type	Description	Example Values
`OMP_WAIT_POLICY`	String	Sets the behavior of idle threads. Recommended: `active` for best performance.	`active`, `passive`
`OMP_DYNAMIC`	Boolean	Enables dynamic adjustment of thread count.	`true`, `false`
`OMP_NESTED`	Boolean	Enables nested parallelism (deprecated, use `OMP_MAX_ACTIVE_LEVELS`).	`true`, `false`
`OMP_MAX_ACTIVE_LEVELS`	Integer	Maximum number of nested parallel regions.	`1`, `2`, `3`
`OMP_STACKSIZE`	Size	Sets the stack size for OpenMP threads.	`4M`, `8M`, `16M`

GNU OpenMP (GOMP) Specific Variables

Variable	Type	Description	Example Values
`GOMP_CPU_AFFINITY`	String	Binds threads to specific CPU cores (GNU OpenMP specific).	`0 1 2 3`, `0-7`, `0,2,4,6`
`GOMP_SPINCOUNT`	Integer	Number of spin iterations before blocking.	`300000`, `1000000`

Usage Examples

Basic Threading Configuration

# Use 8 threads for all GEMM operations
export DLP_NUM_THREADS=8

# Use 2x4 thread decomposition (2 threads for JC, 4 for IC)
export DLP_JC_NT=2
export DLP_IC_NT=4

Architecture-Specific Optimization

# Force AVX512 instructions on Zen4 processors
export AOCL_ENABLE_INSTRUCTIONS=avx512

# Use Zen3-optimized kernels
export AOCL_ENABLE_INSTRUCTIONS=zen3

OpenMP Optimization for NUMA Systems

# Bind threads close to master thread
export OMP_PROC_BIND=close
export OMP_PLACES=cores

# Use active wait policy for better performance
export OMP_WAIT_POLICY=active

# Bind to specific cores (0-15)
export GOMP_CPU_AFFINITY="0-15"

Debugging and Performance Analysis

# Enable detailed logging
export AOCL_ENABLE_LPGEMM_LOGGER=1

# Keep threads active (good for benchmarking)
export OMP_WAIT_POLICY=active

Recommended Production Command

For optimal AOCL-DLP performance on multi-socket systems, use this comprehensive command template:

# Optimal configuration for second socket with 128 cores
# Adjust OMP_NUM_THREADS based on your system's core count per socket
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

Parameter Explanation:

OMP_WAIT_POLICY=active - Keeps threads active for best benchmark performance
OMP_NUM_THREADS=128 - Set to total cores in target socket (machine dependent)
OMP_PLACES=cores - Bind threads to physical cores for better locality
OMP_PROC_BIND=close - Keep threads close together within the socket
numactl --cpunodebind=1 - Bind to NUMA node 1 (second socket)
numactl --interleave=1 - Interleave memory allocation within node 1

NUMA-Optimized Examples

# Multi-socket system - bind to specific NUMA node with interleaved memory
# Example: 128 cores total, using second socket (cores 64-127, NUMA node 1)
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=128
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl --cpunodebind=1 --interleave=1 ./your_application

# Alternative: Bind to specific core range
export OMP_WAIT_POLICY=active
export OMP_NUM_THREADS=64
export OMP_PLACES=cores
export OMP_PROC_BIND=close
numactl -C 64-127 --interleave=1 ./your_application

# Single-socket system - keep threads and memory local
export OMP_WAIT_POLICY=active
export OMP_PROC_BIND=close
export OMP_PLACES=cores
export OMP_NUM_THREADS=16
numactl --cpunodebind=0 --membind=0 ./your_application

Performance Recommendations

For Multi-Socket Systems

Use OMP_PROC_BIND=close with numactl --cpunodebind to bind to specific NUMA nodes
Set OMP_PLACES=cores for fine-grained thread control
Use numactl --interleave for memory interleaving across NUMA nodes
Set OMP_WAIT_POLICY=active for optimal performance
Consider binding to the second socket for better performance (machine dependent)

For Single-Socket Systems

Use OMP_PROC_BIND=close to keep threads near each other
Set OMP_PLACES=cores for better cache locality
Set OMP_WAIT_POLICY=active for reduced thread wake-up overhead
Use numactl --cpunodebind=0 --membind=0 for local memory binding

For Memory-Bound Workloads

Consider using fewer threads than physical cores
Set OMP_WAIT_POLICY=active for consistent thread responsiveness
Use numactl for optimal memory placement

For Compute-Bound Workloads

Use all available cores with OMP_NUM_THREADS=<core_count>
Set OMP_WAIT_POLICY=active to reduce thread wake-up overhead

Environment Variable Interactions

When both DLP_IC_NT and DLP_JC_NT are set, DLP_NUM_THREADS is ignored
AOCL_ENABLE_INSTRUCTIONS overrides automatic CPU feature detection
GOMP-specific variables only apply when using GNU OpenMP runtime
Setting DLP_NUM_THREADS disables OpenMP environment variable effects on thread count

Troubleshooting Common Issues

Performance Issues

Problem: Poor performance despite setting threading variables
Solution:

Verify thread affinity with OMP_PROC_BIND=true and OMP_PLACES=cores
Check for thread over-subscription (threads > physical cores)
Monitor CPU utilization to ensure threads are active

Problem: Inconsistent performance across runs
Solution:

Set OMP_WAIT_POLICY=active to keep threads spinning
Ensure consistent thread affinity with OMP_PROC_BIND=close
Use numactl for consistent NUMA binding
Disable CPU frequency scaling during benchmarks

Threading Issues

Problem: Library not respecting DLP_NUM_THREADS
Solution:

Ensure no conflicting DLP_IC_NT/DLP_JC_NT settings
Check that OpenMP is enabled in the library build
Verify the variable is set in the correct shell environment

Problem: Application hangs or deadlocks
Solution:

Check OMP_MAX_ACTIVE_LEVELS if using nested parallelism
Avoid mixing DLP threading with application-level OpenMP
Ensure sufficient stack size with OMP_STACKSIZE

Architecture Detection Issues

Problem: Library not using optimal instruction set
Solution:

Explicitly set AOCL_ENABLE_INSTRUCTIONS to desired architecture
Verify CPU capabilities with /proc/cpuinfo on Linux
Check for virtualization overhead affecting instruction set detection

Debugging Tips

Enable logging: Set AOCL_ENABLE_LPGEMM_LOGGER=1 to see detailed operation logs
Check environment: Use env | grep -E "(DLP_|OMP_|GOMP_)" to verify settings
Monitor resources: Use htop or top to verify thread count and CPU usage
Profile performance: Use tools like perf or Intel VTune for detailed analysis

Environment

AOCL-DLP Environment Variables Configuration Guide

Threading Control Precedence

AOCL-DLP Specific Environment Variables

Threading Configuration

Architecture and Instruction Set Control

Miscellaneous Notes

Debugging and Logging

OpenMP Environment Variables

Core Threading Variables

Performance Tuning Variables

GNU OpenMP (GOMP) Specific Variables

Usage Examples

Basic Threading Configuration

Architecture-Specific Optimization

OpenMP Optimization for NUMA Systems

Debugging and Performance Analysis

Recommended Production Command

NUMA-Optimized Examples

Performance Recommendations

For Multi-Socket Systems

For Single-Socket Systems

For Memory-Bound Workloads

For Compute-Bound Workloads

Environment Variable Interactions

Troubleshooting Common Issues

Performance Issues

Threading Issues

Architecture Detection Issues

Debugging Tips

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally