Library Overview

AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) provides optimized matrix operations for deep learning inference and training on AMD CPUs.

Components

Component	Description
GEMM Kernels	Optimized General Matrix Multiplication for multiple data types
Batch GEMM	Process multiple independent GEMM operations in one call
Post-Operations	Fused operations (BIAS, activations, SCALE, MATRIX_ADD/MUL) applied after GEMM
Eltwise Operations	Standalone element-wise transforms (independent of GEMM)
Matrix Reordering	Transform matrices into cache-optimal layouts for repeated use
Utility Functions	Standalone GELU and softmax activations
Threading	Parallel execution via OpenMP with 2D loop decomposition

Data Types

AOCL-DLP supports multiple precision formats:

Type	Bits	C Type	Typical use
float32	32	`float`	Training, high-accuracy inference
float16	16	`float16`	Memory-efficient inference (native on Zen5+)
bfloat16	16	`bfloat16`	Training and inference (good range, lower precision)
int8	8	`int8_t` / `uint8_t`	Quantized inference
int4	4	packed in `int8_t`	Extreme weight quantization
int32	32	`int32_t`	Integer accumulation

For the full type combination matrix, see the GEMM Guide. Type definitions are in dlp_base_types.h.

BFloat16 Fallback Behavior

AOCL-DLP automatically handles BF16 operations on hardware that lacks native AVX512_BF16 support. BF16 API calls work unchanged across all hardware -- the library performs runtime detection and transparent rerouting.

How it works:

Hardware	ISA Available	What happens
AMD Zen4+, Intel Cooper Lake+	AVX512_BF16	Native BF16 instructions -- best performance
AMD Zen1-3	AVX2 only	BF16 inputs converted to f32, computed on AVX2 f32 kernels
Intel Skylake, Cascade Lake, Ice Lake	AVX512 (no BF16)	BF16 inputs converted to f32, computed on AVX512 f32 kernels

When fallback is active:

BF16 to F32 conversion on input
F32 computation
F32 to BF16 conversion on output (if output type is bf16)
Performance impact: conversion overhead + 2x memory bandwidth for intermediates

No code changes needed. The same aocl_gemm_bf16bf16f32of32() call works on all platforms.

Call Flow

A typical AOCL-DLP workflow:

1. Prepare data
   - Choose memory layout (row/column major)
   - Set leading dimensions

2. (Optional) Reorder weights
   - aocl_get_reorder_buf_size_*() -> allocate -> aocl_reorder_*()
   - Worthwhile for repeated GEMM with same weights

3. Configure post-ops
   - Populate dlp_metadata_t with BIAS, ELTWISE, SCALE, etc.
   - Set seq_vector to define execution order

4. Call GEMM or Eltwise
   - aocl_gemm_*() for matrix multiplication
   - aocl_gemm_eltwise_ops_*() for standalone transforms

5. (Optional) Unreorder output
   - aocl_unreorder_*() if you need original layout back

See API Lifecycle for a concise version with direct API links.

Hardware Features

AOCL-DLP leverages AMD CPU features through runtime detection:

ISA	Available On	Enables
AVX2 / FMA3	AMD Zen1+	f32 GEMM, bf16 fallback
AVX512	AMD Zen4+	Wider vectors, bf16 x s4/u4
AVX512_VNNI	AMD Zen4+	Accelerated integer GEMM
AVX512_BF16	AMD Zen4+	Native bfloat16 operations
AVX512_FP16	AMD Zen5+	Native half-precision GEMM

The library automatically selects the best available kernel. You can override this with the AOCL_ENABLE_INSTRUCTIONS environment variable -- see Environment Variables.

Threading Model

AOCL-DLP uses OpenMP for parallelism with a 2D loop decomposition (JC and IC loops). Thread count can be controlled at multiple levels with a clear precedence order:

Thread-local API (highest) > Library-global API > DLP env vars > OpenMP env vars > System default

Key APIs:

dlp_thread_set_num_threads(n) -- set thread count (thread-local)
dlp_thread_set_ways(jc, ic) -- set 2D decomposition (thread-local)
dlp_thread_set_num_threads_library(n) -- set thread count (process-wide)
dlp_thread_set_ways_library(jc, ic) -- set 2D decomposition (process-wide)

See Environment Variables and Performance Guide for detailed threading configuration.

Version Query

int major, minor, patch;
dlp_version_query(&major, &minor, &patch);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library Overview

Library Overview

Components

Data Types

BFloat16 Fallback Behavior

Call Flow

Hardware Features

Threading Model

Version Query

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally