-
Notifications
You must be signed in to change notification settings - Fork 4
Library Overview
AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) provides optimized matrix operations for deep learning inference and training on AMD CPUs.
| Component | Description |
|---|---|
| GEMM Kernels | Optimized General Matrix Multiplication for multiple data types |
| Batch GEMM | Process multiple independent GEMM operations in one call |
| Post-Operations | Fused operations (BIAS, activations, SCALE, MATRIX_ADD/MUL) applied after GEMM |
| Eltwise Operations | Standalone element-wise transforms (independent of GEMM) |
| Matrix Reordering | Transform matrices into cache-optimal layouts for repeated use |
| Utility Functions | Standalone GELU and softmax activations |
| Threading | Parallel execution via OpenMP with 2D loop decomposition |
AOCL-DLP supports multiple precision formats:
| Type | Bits | C Type | Typical use |
|---|---|---|---|
| float32 | 32 | float |
Training, high-accuracy inference |
| float16 | 16 | float16 |
Memory-efficient inference (native on Zen5+) |
| bfloat16 | 16 | bfloat16 |
Training and inference (good range, lower precision) |
| int8 | 8 |
int8_t / uint8_t
|
Quantized inference |
| int4 | 4 | packed in int8_t
|
Extreme weight quantization |
| int32 | 32 | int32_t |
Integer accumulation |
For the full type combination matrix, see the GEMM Guide.
Type definitions are in dlp_base_types.h.
AOCL-DLP automatically handles BF16 operations on hardware that lacks native AVX512_BF16 support. BF16 API calls work unchanged across all hardware -- the library performs runtime detection and transparent rerouting.
How it works:
| Hardware | ISA Available | What happens |
|---|---|---|
| AMD Zen4+, Intel Cooper Lake+ | AVX512_BF16 | Native BF16 instructions -- best performance |
| AMD Zen1-3 | AVX2 only | BF16 inputs converted to f32, computed on AVX2 f32 kernels |
| Intel Skylake, Cascade Lake, Ice Lake | AVX512 (no BF16) | BF16 inputs converted to f32, computed on AVX512 f32 kernels |
When fallback is active:
- BF16 to F32 conversion on input
- F32 computation
- F32 to BF16 conversion on output (if output type is bf16)
- Performance impact: conversion overhead + 2x memory bandwidth for intermediates
No code changes needed. The same aocl_gemm_bf16bf16f32of32() call works on all platforms.
A typical AOCL-DLP workflow:
1. Prepare data
- Choose memory layout (row/column major)
- Set leading dimensions
2. (Optional) Reorder weights
- aocl_get_reorder_buf_size_*() -> allocate -> aocl_reorder_*()
- Worthwhile for repeated GEMM with same weights
3. Configure post-ops
- Populate dlp_metadata_t with BIAS, ELTWISE, SCALE, etc.
- Set seq_vector to define execution order
4. Call GEMM or Eltwise
- aocl_gemm_*() for matrix multiplication
- aocl_gemm_eltwise_ops_*() for standalone transforms
5. (Optional) Unreorder output
- aocl_unreorder_*() if you need original layout back
See API Lifecycle for a concise version with direct API links.
AOCL-DLP leverages AMD CPU features through runtime detection:
| ISA | Available On | Enables |
|---|---|---|
| AVX2 / FMA3 | AMD Zen1+ | f32 GEMM, bf16 fallback |
| AVX512 | AMD Zen4+ | Wider vectors, bf16 x s4/u4 |
| AVX512_VNNI | AMD Zen4+ | Accelerated integer GEMM |
| AVX512_BF16 | AMD Zen4+ | Native bfloat16 operations |
| AVX512_FP16 | AMD Zen5+ | Native half-precision GEMM |
The library automatically selects the best available kernel. You can override this with the AOCL_ENABLE_INSTRUCTIONS environment variable -- see Environment Variables.
AOCL-DLP uses OpenMP for parallelism with a 2D loop decomposition (JC and IC loops). Thread count can be controlled at multiple levels with a clear precedence order:
Thread-local API (highest) > Library-global API > DLP env vars > OpenMP env vars > System default
Key APIs:
-
dlp_thread_set_num_threads(n)-- set thread count (thread-local) -
dlp_thread_set_ways(jc, ic)-- set 2D decomposition (thread-local) -
dlp_thread_set_num_threads_library(n)-- set thread count (process-wide) -
dlp_thread_set_ways_library(jc, ic)-- set 2D decomposition (process-wide)
See Environment Variables and Performance Guide for detailed threading configuration.
int major, minor, patch;
dlp_version_query(&major, &minor, &patch);- GEMM Guide -- Data types, parameters, reordering
- Post-Ops Guide -- Fused post-operations
- Quick Start -- Build and run your first program
- API Reference -- Generated docs
Getting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference