-
Notifications
You must be signed in to change notification settings - Fork 4
Performance Guide
Nallani Bhaskar edited this page Mar 18, 2026
·
3 revisions
Practical tips to get the best out of AOCL-DLP for optimal performance on AMD processors.
- Set threads appropriately using dlp_thread_set_num_threads
- Use
DLP_NUM_THREADSenvironment variable for global control - For GEMM operations, consider 2D thread decomposition with
DLP_IC_NTandDLP_JC_NT
-
Always use
OMP_WAIT_POLICY=activefor best benchmark performance - Set
OMP_PROC_BIND=closeto keep threads close together - Use
OMP_PLACES=coresfor fine-grained thread control - Avoid
OMP_SCHEDULE- not applicable to DLP's internal parallelization
Multi-Socket Systems:
# Recommended configuration for second socket (128 cores example)
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_applicationSingle-Socket Systems:
# Local memory binding for single-socket
OMP_WAIT_POLICY=active \
OMP_PROC_BIND=close \
OMP_PLACES=cores \
OMP_NUM_THREADS=16 \
numactl --cpunodebind=0 --membind=0 \
./your_application- Prefer row-major layout where applicable for better cache utilization
- Align matrix buffers to cache line boundaries (64-byte alignment)
- Use leading dimension values that avoid cache bank conflicts
-
Reorder weights for repeated GEMMs using matrix tags:
-
mtagA: "pack"- Pack matrix A for better performance -
mtagB: "reorder"- Reorder matrix B for optimal access patterns
-
- Pre-process weight matrices once, reuse multiple times
- Consider matrix sizes relative to cache hierarchy:
- L1 Cache: ~32KB per core - optimize for small matrices
- L2 Cache: ~512KB per core - medium matrices
- L3 Cache: ~32MB shared - large matrices
- Use memory interleaving on NUMA systems with
numactl --interleave
Force optimal instruction sets using AOCL_ENABLE_INSTRUCTIONS:
# For Zen4+ processors with AVX512 support
export AOCL_ENABLE_INSTRUCTIONS=avx512
# For Zen3 processors
export AOCL_ENABLE_INSTRUCTIONS=zen3
# For Zen2 processors
export AOCL_ENABLE_INSTRUCTIONS=zen2- Use AVX512_VNNI-capable systems (Zen4+) for best int8 throughput
- Leverage AVX512_BF16 instructions for bfloat16 operations
- Consider mixed-precision workflows: train in FP32, infer in BF16/INT8
- Use all available cores with appropriate thread count
- Set
OMP_WAIT_POLICY=activeto eliminate thread wake-up overhead - Consider hyperthreading benefits for your specific workload
- Use larger matrix sizes that fully utilize compute units
- Consider using fewer threads than physical cores to reduce memory pressure
- Ensure optimal memory placement with numactl
- Use matrix reordering to improve memory access patterns
- Monitor memory bandwidth utilization
- Use batch GEMM APIs for multiple small matrices
- Optimize batch size for cache hierarchy
- Consider data layout transformations for batch operations
-
Use consistent environment:
# Fix CPU frequency scaling echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # Set process affinity taskset -c 0-127 ./your_application
-
Multiple measurements for statistical significance
-
Isolate benchmark runs from other system activity
-
Warm-up iterations to ensure consistent cache state
- FLOPS (Floating Point Operations Per Second) - primary performance metric
- Memory bandwidth utilization - identify memory bottlenecks
- CPU utilization - ensure threads are active and not over-subscribed
- Cache hit rates - L1/L2/L3 cache effectiveness
- Use
htoportopto monitor thread count and CPU usage - Monitor NUMA topology with
numastatandlstopo - Profile with tools like
perffor detailed performance analysis - Enable DLP logging:
export AOCL_ENABLE_LPGEMM_LOGGER=1
- Over-subscription: More threads than physical cores
- Poor thread affinity: Threads migrating between cores
- Conflicting parallelism: DLP threads competing with application threads
- NUMA placement: Data on remote NUMA nodes
- Cache conflicts: Poor memory alignment or access patterns
- Memory bandwidth: Saturated memory subsystem
- Wrong instruction set: Not leveraging optimal CPU features
- Suboptimal matrix sizes: Too small to amortize overhead
- Incorrect data types: Using higher precision than needed
- Establish baseline performance metrics for your workloads
- Test across different matrix sizes and data types
- Validate performance after library updates
- Use automated performance tracking in CI/CD
For detailed environment configuration, see: Environment Configuration Guide
For comprehensive benchmarking setup, see: DLP Benchmarking Guide
For specific GEMM optimization techniques, see: GEMM Optimization Guide
Getting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference