Skip to content

Integration Guide

Nallani Bhaskar edited this page Mar 18, 2026 · 2 revisions

Integration Guide

Comprehensive guide for integrating AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) into your application.

Table of Contents

  1. Prerequisites
  2. Installation
  3. Integration Methods
  4. Static vs Dynamic Linking
  5. Complete Integration Examples
  6. Troubleshooting & FAQ
  7. Best Practices

Prerequisites

Before integrating AOCL-DLP, ensure your development environment meets these requirements:

Build-Time Requirements

  • CMake ≥ 3.26
  • C/C++ Compiler with C11/C++17 support (GCC 11+, Clang 14+)
  • OpenMP (optional, for multi-threading)
  • AOCL-DLP installed on your system

Runtime Requirements

  • x86_64 CPU with AVX2/FMA3 support (minimum)
  • AVX512 support for enhanced performance (optional)
  • AVX512_VNNI for int8 GEMM operations (optional)
  • AVX512_BF16 for bfloat16 GEMM operations (optional)

Installation

First, build and install AOCL-DLP on your system:

# Clone the repository
git clone <repository-url>
cd aocl-dlp

# Configure with CMake
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local ..

# Build and install
make -j$(nproc)
sudo make install

For custom installation prefix:

cmake -DCMAKE_INSTALL_PREFIX=$HOME/local/aocl-dlp ..
make -j$(nproc)
make install  # No sudo needed for user-local install

Refer to BUILD.md and INSTALL.md for detailed build/install instructions.


Integration Methods

CMake Package Integration (Recommended)

AOCL-DLP provides CMake package configuration files for seamless integration.

Basic Usage

cmake_minimum_required(VERSION 3.26)
project(MyApp VERSION 1.0.0 LANGUAGES C CXX)

# Find AOCL-DLP package
find_package(AoclDlp REQUIRED)

# Create your application
add_executable(my_app main.c)

# Link with AOCL-DLP shared library
target_link_libraries(my_app PRIVATE AoclDlp::aocl-dlp)

Available Targets

After find_package(AoclDlp), the following targets are available:

Target Description
AoclDlp::aocl-dlp Shared library (recommended for most applications)
AoclDlp::aocl-dlp_static Static library (requires special linking flags)

Custom Installation Prefix

If AOCL-DLP is installed in a non-standard location:

# Method 1: Set CMAKE_PREFIX_PATH
set(CMAKE_PREFIX_PATH "/path/to/aocl-dlp/install" ${CMAKE_PREFIX_PATH})
find_package(AoclDlp REQUIRED)

# Method 2: Set AoclDlp_DIR directly
set(AoclDlp_DIR "/path/to/aocl-dlp/install/lib/cmake/AoclDlp")
find_package(AoclDlp REQUIRED)

Or via command line:

cmake -DCMAKE_PREFIX_PATH=/path/to/aocl-dlp/install ..
# or
cmake -DAoclDlp_DIR=/path/to/aocl-dlp/install/lib/cmake/AoclDlp ..

Manual Linking

If you're not using CMake or prefer manual control:

Compiler Flags

# Include directories
-I/usr/local/include

# Library directories
-L/usr/local/lib

# Link with shared library
-laocl-dlp

# Additional dependencies
-lpthread -lm

# OpenMP (if enabled during AOCL-DLP build)
-fopenmp

Complete Compilation Example

# Shared library linking
gcc -o my_app main.c \
    -I/usr/local/include \
    -L/usr/local/lib \
    -laocl-dlp \
    -lpthread -lm -fopenmp

# Static library linking (requires whole-archive - see below)
gcc -o my_app main.c \
    -I/usr/local/include \
    -L/usr/local/lib \
    -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive \
    -lpthread -lm -fopenmp -lstdc++

Static vs Dynamic Linking

Dynamic (Shared) Library Linking

Recommended for most applications.

Advantages

  • Smaller executable size
  • Easier updates (just replace the .so file)
  • No special linking flags needed
  • Simpler build configuration

CMake Example

find_package(AoclDlp REQUIRED)
add_executable(my_app main.c)
target_link_libraries(my_app PRIVATE AoclDlp::aocl-dlp)

Manual Example

gcc -o my_app main.c -I/usr/local/include -L/usr/local/lib -laocl-dlp -lpthread -lm

Static Library Linking

⚠️ CRITICAL: Requires --whole-archive Flag

AOCL-DLP uses static registration via constructor functions to automatically register optimized kernels and JIT generators at program startup. Without --whole-archive, the linker will discard object files containing only static constructors, resulting in:

  • ❌ Missing JIT kernels
  • ❌ Degraded performance (up to 10-100x slower)
  • ❌ Fallback to reference implementations

Why Whole-Archive is Required

AOCL-DLP internally uses macros like DLP_REGISTER_STATIC_GEMM_KERNEL and DLP_REGISTER_STATIC_GEMM_JIT_GENERATOR that create static constructor functions. These functions register kernels into singleton registries (kernelRegister and jitGeneratorRegister) at program startup.

Problem: Static libraries are archives of object files. By default, the linker only pulls in object files that resolve undefined symbols. Since static constructors don't create symbols used by your code, the linker discards them.

Solution: The --whole-archive flag forces the linker to include ALL object files from the static library, ensuring static constructors execute.

CMake Example (Static Linking)

find_package(AoclDlp REQUIRED)
add_executable(my_app main.c)

# Method 1: Using CMake's LINK_LIBRARY (CMake 3.24+)
target_link_libraries(my_app PRIVATE
    $<LINK_LIBRARY:WHOLE_ARCHIVE,AoclDlp::aocl-dlp_static>
)

# Method 2: Manual linker flags (for older CMake)
target_link_libraries(my_app PRIVATE
    -Wl,--whole-archive
    AoclDlp::aocl-dlp_static
    -Wl,--no-whole-archive
)

# Don't forget OpenMP if it was enabled during AOCL-DLP build
find_package(OpenMP REQUIRED)
target_link_libraries(my_app PRIVATE OpenMP::OpenMP_CXX)

Manual Example (Static Linking)

# Linux/GCC
gcc -o my_app main.c \
    -I/usr/local/include \
    -L/usr/local/lib \
    -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive \
    -lpthread -lm -lstdc++ -fopenmp

# Clang (similar)
clang -o my_app main.c \
    -I/usr/local/include \
    -L/usr/local/lib \
    -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive \
    -lpthread -lm -lstdc++ -fopenmp

Verification

Verify that static constructors were included:

# Check for JIT generator symbols
nm my_app | grep -i "jit.*register"

# Check for kernel registration symbols
nm my_app | grep -i "kernel.*register"

# If you see multiple matches, static linking is working correctly

Complete Integration Examples

Example 1: Simple CMake Project (Shared Library)

Directory Structure:

my_project/
├── CMakeLists.txt
└── main.c

CMakeLists.txt:

cmake_minimum_required(VERSION 3.26)
project(MyGemmApp VERSION 1.0.0 LANGUAGES C)

# Find AOCL-DLP
find_package(AoclDlp REQUIRED)

# Create executable
add_executable(my_gemm_app main.c)

# Link with AOCL-DLP shared library
target_link_libraries(my_gemm_app PRIVATE AoclDlp::aocl-dlp)

# Link with math library
target_link_libraries(my_gemm_app PRIVATE m)

main.c:

#include <aocl_dlp.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    // Matrix dimensions: C(128x128) = A(128x64) × B(64x128)
    md_t m = 128, n = 128, k = 64;
    
    // Allocate matrices
    float *a = (float*)malloc(m * k * sizeof(float));
    float *b = (float*)malloc(k * n * sizeof(float));
    float *c = (float*)malloc(m * n * sizeof(float));
    
    // Initialize matrices (simplified)
    for (size_t i = 0; i < m * k; i++) a[i] = 1.0f;
    for (size_t i = 0; i < k * n; i++) b[i] = 2.0f;
    for (size_t i = 0; i < m * n; i++) c[i] = 0.0f;
    
    // Perform GEMM: C = A × B
    aocl_gemm_f32f32f32of32(
        'R',           // Row-major layout
        'N', 'N',      // No transpose for A or B
        m, n, k,       // Matrix dimensions
        1.0f,          // alpha
        a, k, 'N',     // Matrix A
        b, n, 'N',     // Matrix B
        0.0f,          // beta
        c, n,          // Matrix C
        NULL           // No post-ops
    );
    
    printf("GEMM completed: C[0] = %f (expected ~128.0)\n", c[0]);
    
    free(a); free(b); free(c);
    return 0;
}

Build and Run:

mkdir build && cd build
cmake ..
make
./my_gemm_app

Example 2: Static Linking with Whole-Archive

CMakeLists.txt:

cmake_minimum_required(VERSION 3.26)
project(MyStaticApp VERSION 1.0.0 LANGUAGES C CXX)

# Find AOCL-DLP
find_package(AoclDlp REQUIRED)

# Find OpenMP (needed for static linking)
find_package(OpenMP REQUIRED)

# Create executable
add_executable(my_static_app main.c)

# Link with AOCL-DLP static library using whole-archive
target_link_libraries(my_static_app PRIVATE
    $<LINK_LIBRARY:WHOLE_ARCHIVE,AoclDlp::aocl-dlp_static>
    OpenMP::OpenMP_CXX
    m
)

Build:

mkdir build && cd build
cmake ..
make
./my_static_app

Example 3: Makefile-Based Project

Makefile:

CC = gcc
CFLAGS = -O3 -std=c11 -fopenmp
INCLUDES = -I/usr/local/include
LDFLAGS = -L/usr/local/lib
LIBS = -laocl-dlp -lpthread -lm

# For static linking (uncomment and comment the line above)
# LIBS = -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive -lpthread -lm -lstdc++

TARGET = my_app
SOURCES = main.c

all: $(TARGET)

$(TARGET): $(SOURCES)
	$(CC) $(CFLAGS) $(INCLUDES) -o $@ $^ $(LDFLAGS) $(LIBS)

clean:
	rm -f $(TARGET)

.PHONY: all clean

Example 4: Multiple Source Files

CMakeLists.txt:

cmake_minimum_required(VERSION 3.26)
project(MultiFileApp VERSION 1.0.0 LANGUAGES C)

find_package(AoclDlp REQUIRED)

# Create executable from multiple sources
add_executable(multi_app
    main.c
    gemm_ops.c
    matrix_utils.c
)

# Link with AOCL-DLP
target_link_libraries(multi_app PRIVATE
    AoclDlp::aocl-dlp
    m
)

Troubleshooting & FAQ

Issue 1: Poor Performance with Static Library

Symptoms:

  • Static binary runs 10-100x slower than expected
  • Performance is similar to naive C implementation
  • No JIT-generated code being used

Cause: Static library was linked without --whole-archive flag, causing static constructor registration code to be discarded by the linker.

Solution: Use --whole-archive when linking with the static library:

# CMake solution
target_link_libraries(my_app PRIVATE
    $<LINK_LIBRARY:WHOLE_ARCHIVE,AoclDlp::aocl-dlp_static>
)
# Manual linking solution
gcc -o my_app main.c \
    -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive \
    -lstdc++ -lpthread -lm -fopenmp

Verification:

# Run with verbose logging to see which kernels are registered
AOCL_DLP_VERBOSE=3 ./my_app

# Check symbol table for static registrations
nm my_app | grep -i "register"

Issue 2: find_package(AoclDlp) Not Found

Symptoms:

CMake Error: Could not find a package configuration file provided by "AoclDlp"

Solutions:

  1. Specify installation directory:

    cmake -DCMAKE_PREFIX_PATH=/path/to/aocl-dlp/install ..
  2. Set AoclDlp_DIR:

    cmake -DAoclDlp_DIR=/path/to/aocl-dlp/install/lib/cmake/AoclDlp ..
  3. Check installation:

    # Verify CMake config files exist
    ls /usr/local/lib/cmake/AoclDlp/
    # Should show: AoclDlpConfig.cmake, AoclDlpTargets.cmake, etc.

Issue 3: Undefined Reference Errors

Symptoms:

undefined reference to `aocl_gemm_f32f32f32of32'

Causes and Solutions:

  1. Missing library link:

    # Add this line
    target_link_libraries(my_app PRIVATE AoclDlp::aocl-dlp)
  2. Wrong library order (manual linking):

    # Correct order: sources first, then libraries
    gcc -o my_app main.c -laocl-dlp  # ✓ Correct
    gcc -o my_app -laocl-dlp main.c  # ✗ Wrong
  3. Missing C++ standard library (static linking):

    # Add -lstdc++ for static library
    gcc ... -laocl-dlp_static -lstdc++ -lpthread -lm

Issue 4: Runtime Library Not Found

Symptoms:

./my_app: error while loading shared libraries: libaocl-dlp.so: cannot open shared object file

Solutions:

  1. Add to LD_LIBRARY_PATH:

    export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
    ./my_app
  2. Add to system library path (permanent):

    # Create config file
    sudo sh -c 'echo "/usr/local/lib" > /etc/ld.so.conf.d/aocl-dlp.conf'
    sudo ldconfig
  3. Use RPATH (CMake):

    set_target_properties(my_app PROPERTIES
        INSTALL_RPATH "/usr/local/lib"
        BUILD_WITH_INSTALL_RPATH TRUE
    )
  4. Use static linking (avoids runtime dependency):

    target_link_libraries(my_app PRIVATE
        $<LINK_LIBRARY:WHOLE_ARCHIVE,AoclDlp::aocl-dlp_static>
    )

Issue 5: OpenMP Errors

Symptoms:

undefined reference to `omp_get_num_threads'

Solution: Link with OpenMP if AOCL-DLP was built with OpenMP support:

find_package(OpenMP REQUIRED)
target_link_libraries(my_app PRIVATE
    AoclDlp::aocl-dlp
    OpenMP::OpenMP_C  # or OpenMP::OpenMP_CXX for C++
)

Manual linking:

gcc -o my_app main.c -laocl-dlp -fopenmp

Issue 6: Performance Not Matching Expectations

Checklist:

  1. Verify CPU features:

    lscpu | grep -E "avx2|avx512|vnni|bf16"
  2. Check if JIT is being used:

    AOCL_DLP_VERBOSE=3 ./my_app
    # Look for "JIT generator registered" messages
  3. For static builds, ensure whole-archive was used:

    nm my_app | grep -i "jit.*register" | wc -l
    # Should show multiple symbols (>10)
  4. Verify threading:

    OMP_NUM_THREADS=8 ./my_app
    # Or programmatically:
    dlp_thread_set_num_threads(8);
  5. Check matrix sizes:

    • AOCL-DLP is optimized for matrices >= 64x64
    • Very small matrices may not see benefits

FAQ: Should I Use Static or Shared Library?

Criterion Shared Library Static Library
Ease of Use ✓ Simple ✗ Requires --whole-archive
Binary Size ✓ Smaller ✗ Larger
Updates ✓ Easy (replace .so) ✗ Rebuild required
Deployment ✗ Requires .so on target ✓ Self-contained
Performance Same Same (if linked correctly)

Recommendation: Use shared library unless you have specific requirements for static linking (e.g., containerized deployment, embedded systems).


FAQ: How to Check AOCL-DLP Version?

Runtime Check (C code):

#include <aocl_dlp.h>
#include <stdio.h>

int main() {
    printf("AOCL-DLP Version: %s\n", aocl_dlp_get_version());
    return 0;
}

CMake Check:

find_package(AoclDlp REQUIRED)
message(STATUS "Found AOCL-DLP version: ${AoclDlp_VERSION}")

Command Line:

# After installation
cat /usr/local/include/aocl_dlp_version.h | grep VERSION

FAQ: Can I Use AOCL-DLP from C++?

Yes! AOCL-DLP is C-compatible and works seamlessly from C++:

#include <aocl_dlp.h>
#include <iostream>
#include <vector>

int main() {
    std::vector<float> a(128 * 64, 1.0f);
    std::vector<float> b(64 * 128, 2.0f);
    std::vector<float> c(128 * 128, 0.0f);
    
    aocl_gemm_f32f32f32of32('R', 'N', 'N', 
        128, 128, 64,
        1.0f, a.data(), 64, 'N',
        b.data(), 128, 'N',
        0.0f, c.data(), 128,
        nullptr
    );
    
    std::cout << "Result: " << c[0] << std::endl;
    return 0;
}

CMake for C++:

project(MyCppApp LANGUAGES CXX)
find_package(AoclDlp REQUIRED)
add_executable(my_cpp_app main.cpp)
target_link_libraries(my_cpp_app PRIVATE AoclDlp::aocl-dlp)

Best Practices

1. Use Shared Library for Development

During development, use the shared library for faster iteration:

target_link_libraries(my_app PRIVATE AoclDlp::aocl-dlp)

2. Test with Static Library Before Deployment

If deploying a static binary, test it thoroughly:

# Build static
cmake -DUSE_STATIC_AOCL_DLP=ON ..
make

# Verify performance matches shared library
./benchmark_shared  # baseline
./benchmark_static  # should be similar

# Check that JIT kernels are registered
nm my_static_app | grep -i "jit.*register"

3. Set Threading Explicitly

Don't rely on defaults; set thread count explicitly:

#include <aocl_dlp.h>

int main() {
    // Set to number of physical cores
    dlp_thread_set_num_threads(8);
    
    // Your GEMM calls...
}

4. Reuse Reordered Matrices

For repeated GEMM with the same weights:

// Get buffer size needed
size_t reorder_size = aocl_get_reorder_buf_size_f32f32f32of32('R', 'N', k, n, n);

// Allocate and reorder
float *b_reordered = (float*)malloc(reorder_size);
aocl_reorder_f32f32f32of32('R', 'N', k, n, b, n, b_reordered);

// Use reordered matrix (pass 'R' as trans_b)
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, k, 'N',
    b_reordered, 0, 'R',  // 'R' indicates reordered
    0.0f, c, n, NULL
);

5. Check CPU Features at Runtime

#include <aocl_dlp.h>
#include <cpuid.h>

void check_features() {
    // AOCL-DLP automatically selects best kernel for your CPU
    // But you can check features manually if needed:
    
    unsigned int eax, ebx, ecx, edx;
    
    // Check AVX2
    if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
        if (ebx & bit_AVX2) printf("AVX2 supported\n");
        if (ebx & bit_AVX512F) printf("AVX512 supported\n");
    }
}

6. Use Post-Operations for Fused Kernels

Leverage fused operations for better performance:

// Setup bias post-op
dlp_metadata_t meta = {0};
meta.post_ops_len = 1;
meta.post_ops[0].op_type = AOCL_POST_OP_BIAS;
meta.post_ops[0].bias.bias = bias_vector;

// GEMM with fused bias
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, k, 'N', b, n, 'N',
    0.0f, c, n, &meta
);

7. Profile Your Application

# Use perf to identify bottlenecks
perf record -g ./my_app
perf report

# Check cache efficiency
perf stat -e cache-references,cache-misses ./my_app

Additional Resources


Quick Reference Card

// Include header
#include <aocl_dlp.h>

// Set thread count (optional)
dlp_thread_set_num_threads(8);

// Basic GEMM: C = A × B
aocl_gemm_f32f32f32of32(
    'R',              // Row-major
    'N', 'N',         // No transpose
    m, n, k,          // Dimensions
    1.0f,             // alpha
    a, lda, 'N',      // Matrix A
    b, ldb, 'N',      // Matrix B
    0.0f,             // beta
    c, ldc,           // Matrix C
    NULL              // No post-ops
);

// Link shared library (CMake)
target_link_libraries(app PRIVATE AoclDlp::aocl-dlp)

// Link static library (CMake)
target_link_libraries(app PRIVATE
    $<LINK_LIBRARY:WHOLE_ARCHIVE,AoclDlp::aocl-dlp_static>
)

// Compile manually (shared)
gcc -o app main.c -laocl-dlp -lm -fopenmp

// Compile manually (static)
gcc -o app main.c \
    -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive \
    -lstdc++ -lm -fopenmp

Need more help? See FAQ or open an issue on GitHub.

Clone this wiki locally