-
Notifications
You must be signed in to change notification settings - Fork 4
Post Ops Guide
AOCL-DLP can fuse common operations (bias addition, activations, scaling, matrix arithmetic) directly into GEMM computation. This avoids separate passes over the output matrix and reduces memory traffic.
The effective computation becomes:
C = post_ops( alpha * op(A) * op(B) + beta * C )
All post-operations are configured through a single dlp_metadata_t struct passed as the last argument to any GEMM function. Pass NULL when no post-ops are needed.
#include <aocl_dlp.h>
dlp_metadata_t meta = {0}; // zero-initialize
// Configure the post-op sequence (see below)
// ...
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
1.0f, a, lda, 'N', b, ldb, 'N',
0.0f, c, ldc, &meta);| Field | Type | Description |
|---|---|---|
seq_length |
md_t |
Number of post-operations to apply |
seq_vector |
DLP_POST_OP_TYPE* |
Array defining the order of post-ops |
bias |
dlp_post_op_bias* |
Bias parameters (when BIAS is in sequence) |
eltwise |
dlp_post_op_eltwise* |
Eltwise/activation parameters |
scale |
dlp_scale_t* |
Scale + zero-point parameters |
matrix_add |
dlp_post_op_matrix_add* |
Matrix addition parameters |
matrix_mul |
dlp_post_op_matrix_mul* |
Matrix multiplication parameters |
num_eltwise |
md_t |
Number of eltwise operations (when multiple are chained) |
Post-ops execute in the order defined by seq_vector. For example, if seq_vector = {BIAS, ELTWISE}, bias is applied first, then the activation.
Adds a 1D bias vector (length n) to each row of the output matrix.
// Bias vector (one value per output column)
float bias_values[N] = { /* ... */ };
dlp_post_op_bias bias_op = {
.bias = bias_values,
.stor_type = DLP_F32, // data type of bias values
.sf = NULL, // scale factor (NULL if not needed)
.zp = NULL // zero point (NULL if not needed)
};
DLP_POST_OP_TYPE seq[] = { BIAS };
dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.bias = &bias_op;
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
1.0f, a, lda, 'N', b, ldb, 'N',
0.0f, c, ldc, &meta);
// Result: C[i][j] = (A * B)[i][j] + bias[j]Applies an element-wise activation function to the output.
// RELU example
dlp_post_op_eltwise eltwise_op = {
.sf = NULL,
.algo = {
.alpha = NULL, // unused for RELU
.beta = NULL, // unused for RELU
.algo_type = RELU,
.stor_type = DLP_F32
}
};
DLP_POST_OP_TYPE seq[] = { ELTWISE };
dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.eltwise = &eltwise_op;
meta.num_eltwise = 1;Supported activation functions:
DLP_ELT_ALGO_TYPE |
Formula | Parameters |
|---|---|---|
RELU |
max(0, x) |
None |
PRELU |
x >= 0 ? x : alpha * x |
alpha: leak factor |
GELU_TANH |
GELU with tanh approximation | None |
GELU_ERF |
GELU with erf approximation | None |
CLIP |
clamp(x, alpha, beta) |
alpha: min, beta: max |
SWISH |
x * sigmoid(alpha * x) |
alpha: scaling |
TANH |
tanh(x) |
None |
SIGMOID |
1 / (1 + exp(-x)) |
None |
PRELU example with alpha parameter:
float alpha_val = 0.01f;
dlp_post_op_eltwise prelu_op = {
.sf = NULL,
.algo = {
.alpha = &alpha_val,
.beta = NULL,
.algo_type = PRELU,
.stor_type = DLP_F32
}
};CLIP example with min/max:
float clip_min = -1.0f;
float clip_max = 1.0f;
dlp_post_op_eltwise clip_op = {
.sf = NULL,
.algo = {
.alpha = &clip_min,
.beta = &clip_max,
.algo_type = CLIP,
.stor_type = DLP_F32
}
};Applies per-channel or per-tensor scaling to the output, with optional zero-point offset.
float scale_vals[] = { 0.5f, 0.5f, /* ... one per column */ };
dlp_sf_t sf = {
.scale_factor = scale_vals,
.scale_factor_len = n, // per-channel (or 1 for per-tensor)
.scale_factor_type = DLP_F32
};
dlp_scale_t scale_op = {
.sf = &sf,
.zp = NULL // or provide a dlp_zp_t for zero-point
};
DLP_POST_OP_TYPE seq[] = { SCALE };
dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.scale = &scale_op;Adds another matrix to the GEMM output, with optional scaling.
float residual[M * N] = { /* ... */ };
dlp_post_op_matrix_add add_op = {
.matrix = residual,
.ldm = n, // leading dimension of the added matrix
.stor_type = DLP_F32,
.sf = NULL // optional scale factor
};
DLP_POST_OP_TYPE seq[] = { MATRIX_ADD };
dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.matrix_add = &add_op;
// Result: C[i][j] = (A * B)[i][j] + residual[i][j]Multiplies the GEMM output element-wise with another matrix.
float mask[M * N] = { /* ... */ };
dlp_post_op_matrix_mul mul_op = {
.matrix = mask,
.ldm = n,
.stor_type = DLP_F32,
.sf = NULL
};
DLP_POST_OP_TYPE seq[] = { MATRIX_MUL };
dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.matrix_mul = &mul_op;
// Result: C[i][j] = (A * B)[i][j] * mask[i][j]Post-ops can be chained by listing multiple types in seq_vector. They execute left to right.
Example: BIAS + RELU (common in neural networks)
float bias_values[N] = { /* ... */ };
dlp_post_op_bias bias_op = {
.bias = bias_values, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};
dlp_post_op_eltwise relu_op = {
.sf = NULL,
.algo = { .alpha = NULL, .beta = NULL, .algo_type = RELU, .stor_type = DLP_F32 }
};
DLP_POST_OP_TYPE seq[] = { BIAS, ELTWISE };
dlp_metadata_t meta = {0};
meta.seq_length = 2;
meta.seq_vector = seq;
meta.bias = &bias_op;
meta.eltwise = &relu_op;
meta.num_eltwise = 1;
// Result: C = RELU(A * B + bias)Example: SCALE + GELU_TANH + BIAS
DLP_POST_OP_TYPE seq[] = { SCALE, ELTWISE, BIAS };
dlp_metadata_t meta = {0};
meta.seq_length = 3;
meta.seq_vector = seq;
meta.scale = &scale_op;
meta.eltwise = &gelu_op;
meta.bias = &bias_op;
meta.num_eltwise = 1;
// Result: C = (GELU(scale * (A * B))) + bias- Align buffers -- Align bias, scale, and residual matrix buffers to 64-byte boundaries for best performance.
-
Match data types -- Ensure
stor_typeof post-op parameters matches the accumulator type of your GEMM variant. For float GEMM, useDLP_F32. For integer GEMM, scale/bias still useDLP_F32since post-ops operate on the accumulator. -
Zero-initialize metadata -- Always start with
dlp_metadata_t meta = {0}to avoid uninitialized fields. -
Maximum post-ops -- Up to
AOCL_MAX_POST_OPS(8) post-operations can be chained.
- GEMM Guide -- GEMM parameters, data types, and reordering
- Eltwise Guide -- Standalone element-wise ops (not fused with GEMM)
- Quantization Guide -- Scale/zero-point setup for quantized workflows
-
Examples --
simple_gemm_with_bias.c,simple_gemm_with_relu.c,post_ops_combinations.c - API Reference -- Generated struct documentation
Getting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference