Skip to content

Library Overview

Nallani Bhaskar edited this page Mar 18, 2026 · 4 revisions

Library Overview

AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) provides optimized matrix operations for deep learning inference and training on AMD CPUs.

Components

Component Description
GEMM Kernels Optimized General Matrix Multiplication for multiple data types
Batch GEMM Process multiple independent GEMM operations in one call
Post-Operations Fused operations (BIAS, activations, SCALE, MATRIX_ADD/MUL) applied after GEMM
Eltwise Operations Standalone element-wise transforms (independent of GEMM)
Matrix Reordering Transform matrices into cache-optimal layouts for repeated use
Utility Functions Standalone GELU and softmax activations
Threading Parallel execution via OpenMP with 2D loop decomposition

Data Types

AOCL-DLP supports multiple precision formats:

Type Bits C Type Typical use
float32 32 float Training, high-accuracy inference
float16 16 float16 Memory-efficient inference (native on Zen5+)
bfloat16 16 bfloat16 Training and inference (good range, lower precision)
int8 8 int8_t / uint8_t Quantized inference
int4 4 packed in int8_t Extreme weight quantization
int32 32 int32_t Integer accumulation

For the full type combination matrix, see the GEMM Guide. Type definitions are in dlp_base_types.h.

BFloat16 Fallback Behavior

AOCL-DLP automatically handles BF16 operations on hardware that lacks native AVX512_BF16 support. BF16 API calls work unchanged across all hardware -- the library performs runtime detection and transparent rerouting.

How it works:

Hardware ISA Available What happens
AMD Zen4+, Intel Cooper Lake+ AVX512_BF16 Native BF16 instructions -- best performance
AMD Zen1-3 AVX2 only BF16 inputs converted to f32, computed on AVX2 f32 kernels
Intel Skylake, Cascade Lake, Ice Lake AVX512 (no BF16) BF16 inputs converted to f32, computed on AVX512 f32 kernels

When fallback is active:

  • BF16 to F32 conversion on input
  • F32 computation
  • F32 to BF16 conversion on output (if output type is bf16)
  • Performance impact: conversion overhead + 2x memory bandwidth for intermediates

No code changes needed. The same aocl_gemm_bf16bf16f32of32() call works on all platforms.

Call Flow

A typical AOCL-DLP workflow:

1. Prepare data
   - Choose memory layout (row/column major)
   - Set leading dimensions

2. (Optional) Reorder weights
   - aocl_get_reorder_buf_size_*() -> allocate -> aocl_reorder_*()
   - Worthwhile for repeated GEMM with same weights

3. Configure post-ops
   - Populate dlp_metadata_t with BIAS, ELTWISE, SCALE, etc.
   - Set seq_vector to define execution order

4. Call GEMM or Eltwise
   - aocl_gemm_*() for matrix multiplication
   - aocl_gemm_eltwise_ops_*() for standalone transforms

5. (Optional) Unreorder output
   - aocl_unreorder_*() if you need original layout back

See API Lifecycle for a concise version with direct API links.

Hardware Features

AOCL-DLP leverages AMD CPU features through runtime detection:

ISA Available On Enables
AVX2 / FMA3 AMD Zen1+ f32 GEMM, bf16 fallback
AVX512 AMD Zen4+ Wider vectors, bf16 x s4/u4
AVX512_VNNI AMD Zen4+ Accelerated integer GEMM
AVX512_BF16 AMD Zen4+ Native bfloat16 operations
AVX512_FP16 AMD Zen5+ Native half-precision GEMM

The library automatically selects the best available kernel. You can override this with the AOCL_ENABLE_INSTRUCTIONS environment variable -- see Environment Variables.

Threading Model

AOCL-DLP uses OpenMP for parallelism with a 2D loop decomposition (JC and IC loops). Thread count can be controlled at multiple levels with a clear precedence order:

Thread-local API (highest) > Library-global API > DLP env vars > OpenMP env vars > System default

Key APIs:

  • dlp_thread_set_num_threads(n) -- set thread count (thread-local)
  • dlp_thread_set_ways(jc, ic) -- set 2D decomposition (thread-local)
  • dlp_thread_set_num_threads_library(n) -- set thread count (process-wide)
  • dlp_thread_set_ways_library(jc, ic) -- set 2D decomposition (process-wide)

See Environment Variables and Performance Guide for detailed threading configuration.

Version Query

int major, minor, patch;
dlp_version_query(&major, &minor, &patch);

See Also

Clone this wiki locally