Skip to content

Performance Guide

Nallani Bhaskar edited this page Mar 18, 2026 · 3 revisions

Performance Guide

Practical tips to get the best out of AOCL-DLP for optimal performance on AMD processors.

Threading Optimization

Thread Count Configuration

  • Set threads appropriately using dlp_thread_set_num_threads
  • Use DLP_NUM_THREADS environment variable for global control
  • For GEMM operations, consider 2D thread decomposition with DLP_IC_NT and DLP_JC_NT

OpenMP Configuration

  • Always use OMP_WAIT_POLICY=active for best benchmark performance
  • Set OMP_PROC_BIND=close to keep threads close together
  • Use OMP_PLACES=cores for fine-grained thread control
  • Avoid OMP_SCHEDULE - not applicable to DLP's internal parallelization

NUMA Optimization

Multi-Socket Systems:

# Recommended configuration for second socket (128 cores example)
OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

Single-Socket Systems:

# Local memory binding for single-socket
OMP_WAIT_POLICY=active \
OMP_PROC_BIND=close \
OMP_PLACES=cores \
OMP_NUM_THREADS=16 \
numactl --cpunodebind=0 --membind=0 \
./your_application

Memory and Layout Optimization

Matrix Layout

  • Prefer row-major layout where applicable for better cache utilization
  • Align matrix buffers to cache line boundaries (64-byte alignment)
  • Use leading dimension values that avoid cache bank conflicts

Matrix Reordering

  • Reorder weights for repeated GEMMs using matrix tags:
    • mtagA: "pack" - Pack matrix A for better performance
    • mtagB: "reorder" - Reorder matrix B for optimal access patterns
  • Pre-process weight matrices once, reuse multiple times

Memory Access Patterns

  • Consider matrix sizes relative to cache hierarchy:
    • L1 Cache: ~32KB per core - optimize for small matrices
    • L2 Cache: ~512KB per core - medium matrices
    • L3 Cache: ~32MB shared - large matrices
  • Use memory interleaving on NUMA systems with numactl --interleave

Architecture-Specific Optimization

Instruction Set Selection

Force optimal instruction sets using AOCL_ENABLE_INSTRUCTIONS:

# For Zen4+ processors with AVX512 support
export AOCL_ENABLE_INSTRUCTIONS=avx512

# For Zen3 processors  
export AOCL_ENABLE_INSTRUCTIONS=zen3

# For Zen2 processors
export AOCL_ENABLE_INSTRUCTIONS=zen2

Quantized Operations

  • Use AVX512_VNNI-capable systems (Zen4+) for best int8 throughput
  • Leverage AVX512_BF16 instructions for bfloat16 operations
  • Consider mixed-precision workflows: train in FP32, infer in BF16/INT8

Workload-Specific Optimizations

Compute-Bound Workloads

  • Use all available cores with appropriate thread count
  • Set OMP_WAIT_POLICY=active to eliminate thread wake-up overhead
  • Consider hyperthreading benefits for your specific workload
  • Use larger matrix sizes that fully utilize compute units

Memory-Bound Workloads

  • Consider using fewer threads than physical cores to reduce memory pressure
  • Ensure optimal memory placement with numactl
  • Use matrix reordering to improve memory access patterns
  • Monitor memory bandwidth utilization

Batch Processing

  • Use batch GEMM APIs for multiple small matrices
  • Optimize batch size for cache hierarchy
  • Consider data layout transformations for batch operations

Performance Measurement and Analysis

Benchmarking Best Practices

  • Use consistent environment:

    # Fix CPU frequency scaling
    echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
    # Set process affinity  
    taskset -c 0-127 ./your_application
  • Multiple measurements for statistical significance

  • Isolate benchmark runs from other system activity

  • Warm-up iterations to ensure consistent cache state

Key Metrics to Monitor

  • FLOPS (Floating Point Operations Per Second) - primary performance metric
  • Memory bandwidth utilization - identify memory bottlenecks
  • CPU utilization - ensure threads are active and not over-subscribed
  • Cache hit rates - L1/L2/L3 cache effectiveness

Profiling and Analysis Tools

  • Use htop or top to monitor thread count and CPU usage
  • Monitor NUMA topology with numastat and lstopo
  • Profile with tools like perf for detailed performance analysis
  • Enable DLP logging: export AOCL_ENABLE_LPGEMM_LOGGER=1

Common Performance Pitfalls

Threading Issues

  • Over-subscription: More threads than physical cores
  • Poor thread affinity: Threads migrating between cores
  • Conflicting parallelism: DLP threads competing with application threads

Memory Issues

  • NUMA placement: Data on remote NUMA nodes
  • Cache conflicts: Poor memory alignment or access patterns
  • Memory bandwidth: Saturated memory subsystem

Configuration Issues

  • Wrong instruction set: Not leveraging optimal CPU features
  • Suboptimal matrix sizes: Too small to amortize overhead
  • Incorrect data types: Using higher precision than needed

Validation and Testing

Performance Regression Testing

  • Establish baseline performance metrics for your workloads
  • Test across different matrix sizes and data types
  • Validate performance after library updates
  • Use automated performance tracking in CI/CD

Getting Help

For detailed environment configuration, see: Environment Configuration Guide

For comprehensive benchmarking setup, see: DLP Benchmarking Guide

For specific GEMM optimization techniques, see: GEMM Optimization Guide

Clone this wiki locally