Performance Guide

Practical tips to get the best out of AOCL-DLP for optimal performance on AMD processors.

Threading Optimization

Thread Count Configuration

Set threads appropriately using dlp_thread_set_num_threads
Use DLP_NUM_THREADS environment variable for global control
For GEMM operations, consider 2D thread decomposition with DLP_IC_NT and DLP_JC_NT

OpenMP Configuration

Always use OMP_WAIT_POLICY=active for best benchmark performance
Set OMP_PROC_BIND=close to keep threads close together
Use OMP_PLACES=cores for fine-grained thread control
Avoid OMP_SCHEDULE - not applicable to DLP's internal parallelization

NUMA Optimization

Multi-Socket Systems:

# Recommended configuration for second socket (128 cores example)
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

Single-Socket Systems:

# Local memory binding for single-socket
OMP_WAIT_POLICY=active \
OMP_PROC_BIND=close \
OMP_PLACES=cores \
OMP_NUM_THREADS=16 \
numactl --cpunodebind=0 --membind=0 \
./your_application

Memory and Layout Optimization

Matrix Layout

Prefer row-major layout where applicable for better cache utilization
Align matrix buffers to cache line boundaries (64-byte alignment)
Use leading dimension values that avoid cache bank conflicts

Matrix Reordering

Reorder weights for repeated GEMMs using matrix tags:
- mtagA: "pack" - Pack matrix A for better performance
- mtagB: "reorder" - Reorder matrix B for optimal access patterns
Pre-process weight matrices once, reuse multiple times

Memory Access Patterns

Consider matrix sizes relative to cache hierarchy:
- L1 Cache: ~32KB per core - optimize for small matrices
- L2 Cache: ~512KB per core - medium matrices
- L3 Cache: ~32MB shared - large matrices
Use memory interleaving on NUMA systems with numactl --interleave

Architecture-Specific Optimization

Instruction Set Selection

Force optimal instruction sets using AOCL_ENABLE_INSTRUCTIONS:

# For Zen4+ processors with AVX512 support
export AOCL_ENABLE_INSTRUCTIONS=avx512

# For Zen3 processors  
export AOCL_ENABLE_INSTRUCTIONS=zen3

# For Zen2 processors
export AOCL_ENABLE_INSTRUCTIONS=zen2

Quantized Operations

Use AVX512_VNNI-capable systems (Zen4+) for best int8 throughput
Leverage AVX512_BF16 instructions for bfloat16 operations
Consider mixed-precision workflows: train in FP32, infer in BF16/INT8

Workload-Specific Optimizations

Compute-Bound Workloads

Use all available cores with appropriate thread count
Set OMP_WAIT_POLICY=active to eliminate thread wake-up overhead
Consider hyperthreading benefits for your specific workload
Use larger matrix sizes that fully utilize compute units

Memory-Bound Workloads

Consider using fewer threads than physical cores to reduce memory pressure
Ensure optimal memory placement with numactl
Use matrix reordering to improve memory access patterns
Monitor memory bandwidth utilization

Batch Processing

Use batch GEMM APIs for multiple small matrices
Optimize batch size for cache hierarchy
Consider data layout transformations for batch operations

Performance Measurement and Analysis

Benchmarking Best Practices

Use consistent environment:

# Fix CPU frequency scaling
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Set process affinity  
taskset -c 0-127 ./your_application

Multiple measurements for statistical significance
Isolate benchmark runs from other system activity
Warm-up iterations to ensure consistent cache state

Key Metrics to Monitor

FLOPS (Floating Point Operations Per Second) - primary performance metric
Memory bandwidth utilization - identify memory bottlenecks
CPU utilization - ensure threads are active and not over-subscribed
Cache hit rates - L1/L2/L3 cache effectiveness

Profiling and Analysis Tools

Use htop or top to monitor thread count and CPU usage
Monitor NUMA topology with numastat and lstopo
Profile with tools like perf for detailed performance analysis
Enable DLP logging: export AOCL_ENABLE_LPGEMM_LOGGER=1

Common Performance Pitfalls

Threading Issues

Over-subscription: More threads than physical cores
Poor thread affinity: Threads migrating between cores
Conflicting parallelism: DLP threads competing with application threads

Memory Issues

NUMA placement: Data on remote NUMA nodes
Cache conflicts: Poor memory alignment or access patterns
Memory bandwidth: Saturated memory subsystem

Configuration Issues

Wrong instruction set: Not leveraging optimal CPU features
Suboptimal matrix sizes: Too small to amortize overhead
Incorrect data types: Using higher precision than needed

Validation and Testing

Performance Regression Testing

Establish baseline performance metrics for your workloads
Test across different matrix sizes and data types
Validate performance after library updates
Use automated performance tracking in CI/CD

Getting Help

For detailed environment configuration, see: Environment Configuration Guide

For comprehensive benchmarking setup, see: DLP Benchmarking Guide

For specific GEMM optimization techniques, see: GEMM Optimization Guide

Home | Quick Start | API Reference | Report Issue | Source Code

AOCL-DLP Wiki

Getting Started

User Guides

Performance & Config

Testing & Benchmarking

Developer Guides

JIT Code Generation

Reference

Performance Guide

Performance Guide

Threading Optimization

Thread Count Configuration

OpenMP Configuration

NUMA Optimization

Memory and Layout Optimization

Matrix Layout

Matrix Reordering

Memory Access Patterns

Architecture-Specific Optimization

Instruction Set Selection

Quantized Operations

Workload-Specific Optimizations

Compute-Bound Workloads

Memory-Bound Workloads

Batch Processing

Performance Measurement and Analysis

Benchmarking Best Practices

Key Metrics to Monitor

Profiling and Analysis Tools

Common Performance Pitfalls

Threading Issues

Memory Issues

Configuration Issues

Validation and Testing

Performance Regression Testing

Getting Help

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally