Skip to content

Post Ops Guide

Nallani Bhaskar edited this page Mar 18, 2026 · 3 revisions

Post-Operations Guide

AOCL-DLP can fuse common operations (bias addition, activations, scaling, matrix arithmetic) directly into GEMM computation. This avoids separate passes over the output matrix and reduces memory traffic.

The effective computation becomes:

C = post_ops( alpha * op(A) * op(B) + beta * C )

The dlp_metadata_t Structure

All post-operations are configured through a single dlp_metadata_t struct passed as the last argument to any GEMM function. Pass NULL when no post-ops are needed.

#include <aocl_dlp.h>

dlp_metadata_t meta = {0};  // zero-initialize

// Configure the post-op sequence (see below)
// ...

aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N',
    0.0f, c, ldc, &meta);

Key Fields

Field Type Description
seq_length md_t Number of post-operations to apply
seq_vector DLP_POST_OP_TYPE* Array defining the order of post-ops
bias dlp_post_op_bias* Bias parameters (when BIAS is in sequence)
eltwise dlp_post_op_eltwise* Eltwise/activation parameters
scale dlp_scale_t* Scale + zero-point parameters
matrix_add dlp_post_op_matrix_add* Matrix addition parameters
matrix_mul dlp_post_op_matrix_mul* Matrix multiplication parameters
num_eltwise md_t Number of eltwise operations (when multiple are chained)

Execution Order

Post-ops execute in the order defined by seq_vector. For example, if seq_vector = {BIAS, ELTWISE}, bias is applied first, then the activation.

Post-Op Types

BIAS -- Add a Bias Vector

Adds a 1D bias vector (length n) to each row of the output matrix.

// Bias vector (one value per output column)
float bias_values[N] = { /* ... */ };

dlp_post_op_bias bias_op = {
    .bias      = bias_values,
    .stor_type = DLP_F32,     // data type of bias values
    .sf        = NULL,        // scale factor (NULL if not needed)
    .zp        = NULL         // zero point (NULL if not needed)
};

DLP_POST_OP_TYPE seq[] = { BIAS };

dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.bias       = &bias_op;

aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N',
    0.0f, c, ldc, &meta);
// Result: C[i][j] = (A * B)[i][j] + bias[j]

ELTWISE -- Activation Functions

Applies an element-wise activation function to the output.

// RELU example
dlp_post_op_eltwise eltwise_op = {
    .sf   = NULL,
    .algo = {
        .alpha     = NULL,        // unused for RELU
        .beta      = NULL,        // unused for RELU
        .algo_type = RELU,
        .stor_type = DLP_F32
    }
};

DLP_POST_OP_TYPE seq[] = { ELTWISE };

dlp_metadata_t meta = {0};
meta.seq_length  = 1;
meta.seq_vector  = seq;
meta.eltwise     = &eltwise_op;
meta.num_eltwise = 1;

Supported activation functions:

DLP_ELT_ALGO_TYPE Formula Parameters
RELU max(0, x) None
PRELU x >= 0 ? x : alpha * x alpha: leak factor
GELU_TANH GELU with tanh approximation None
GELU_ERF GELU with erf approximation None
CLIP clamp(x, alpha, beta) alpha: min, beta: max
SWISH x * sigmoid(alpha * x) alpha: scaling
TANH tanh(x) None
SIGMOID 1 / (1 + exp(-x)) None

PRELU example with alpha parameter:

float alpha_val = 0.01f;

dlp_post_op_eltwise prelu_op = {
    .sf   = NULL,
    .algo = {
        .alpha     = &alpha_val,
        .beta      = NULL,
        .algo_type = PRELU,
        .stor_type = DLP_F32
    }
};

CLIP example with min/max:

float clip_min = -1.0f;
float clip_max = 1.0f;

dlp_post_op_eltwise clip_op = {
    .sf   = NULL,
    .algo = {
        .alpha     = &clip_min,
        .beta      = &clip_max,
        .algo_type = CLIP,
        .stor_type = DLP_F32
    }
};

SCALE -- Scaling and Zero-Point

Applies per-channel or per-tensor scaling to the output, with optional zero-point offset.

float scale_vals[] = { 0.5f, 0.5f, /* ... one per column */ };
dlp_sf_t sf = {
    .scale_factor     = scale_vals,
    .scale_factor_len = n,           // per-channel (or 1 for per-tensor)
    .scale_factor_type = DLP_F32
};

dlp_scale_t scale_op = {
    .sf = &sf,
    .zp = NULL   // or provide a dlp_zp_t for zero-point
};

DLP_POST_OP_TYPE seq[] = { SCALE };

dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.scale      = &scale_op;

MATRIX_ADD -- Element-wise Addition

Adds another matrix to the GEMM output, with optional scaling.

float residual[M * N] = { /* ... */ };

dlp_post_op_matrix_add add_op = {
    .matrix    = residual,
    .ldm       = n,           // leading dimension of the added matrix
    .stor_type = DLP_F32,
    .sf        = NULL         // optional scale factor
};

DLP_POST_OP_TYPE seq[] = { MATRIX_ADD };

dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.matrix_add = &add_op;
// Result: C[i][j] = (A * B)[i][j] + residual[i][j]

MATRIX_MUL -- Element-wise Multiplication

Multiplies the GEMM output element-wise with another matrix.

float mask[M * N] = { /* ... */ };

dlp_post_op_matrix_mul mul_op = {
    .matrix    = mask,
    .ldm       = n,
    .stor_type = DLP_F32,
    .sf        = NULL
};

DLP_POST_OP_TYPE seq[] = { MATRIX_MUL };

dlp_metadata_t meta = {0};
meta.seq_length = 1;
meta.seq_vector = seq;
meta.matrix_mul = &mul_op;
// Result: C[i][j] = (A * B)[i][j] * mask[i][j]

Chaining Multiple Post-Ops

Post-ops can be chained by listing multiple types in seq_vector. They execute left to right.

Example: BIAS + RELU (common in neural networks)

float bias_values[N] = { /* ... */ };

dlp_post_op_bias bias_op = {
    .bias = bias_values, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};

dlp_post_op_eltwise relu_op = {
    .sf = NULL,
    .algo = { .alpha = NULL, .beta = NULL, .algo_type = RELU, .stor_type = DLP_F32 }
};

DLP_POST_OP_TYPE seq[] = { BIAS, ELTWISE };

dlp_metadata_t meta = {0};
meta.seq_length  = 2;
meta.seq_vector  = seq;
meta.bias        = &bias_op;
meta.eltwise     = &relu_op;
meta.num_eltwise = 1;

// Result: C = RELU(A * B + bias)

Example: SCALE + GELU_TANH + BIAS

DLP_POST_OP_TYPE seq[] = { SCALE, ELTWISE, BIAS };

dlp_metadata_t meta = {0};
meta.seq_length  = 3;
meta.seq_vector  = seq;
meta.scale       = &scale_op;
meta.eltwise     = &gelu_op;
meta.bias        = &bias_op;
meta.num_eltwise = 1;

// Result: C = (GELU(scale * (A * B))) + bias

Tips

  • Align buffers -- Align bias, scale, and residual matrix buffers to 64-byte boundaries for best performance.
  • Match data types -- Ensure stor_type of post-op parameters matches the accumulator type of your GEMM variant. For float GEMM, use DLP_F32. For integer GEMM, scale/bias still use DLP_F32 since post-ops operate on the accumulator.
  • Zero-initialize metadata -- Always start with dlp_metadata_t meta = {0} to avoid uninitialized fields.
  • Maximum post-ops -- Up to AOCL_MAX_POST_OPS (8) post-operations can be chained.

See Also

  • GEMM Guide -- GEMM parameters, data types, and reordering
  • Eltwise Guide -- Standalone element-wise ops (not fused with GEMM)
  • Quantization Guide -- Scale/zero-point setup for quantized workflows
  • Examples -- simple_gemm_with_bias.c, simple_gemm_with_relu.c, post_ops_combinations.c
  • API Reference -- Generated struct documentation

Clone this wiki locally