Skip to content

Quantization Guide

Nallani Bhaskar edited this page Mar 18, 2026 · 3 revisions

Quantization Guide

AOCL-DLP supports quantized GEMM operations for efficient inference workloads. This guide covers symmetric quantization, mixed-precision workflows, and how to configure scale factors and zero-points.

Quantization Concepts

Quantization maps floating-point values to lower-precision integers for faster computation and smaller memory footprint.

Symmetric quantization centers the quantized range around zero:

q = round(x * scale)
x = q / scale

Asymmetric quantization uses a zero-point offset:

q = round(x * scale) + zero_point
x = (q - zero_point) / scale

Integer GEMM Variants

For workloads where both inputs are already quantized:

Input A Input B Accumulator Outputs Function pattern
u8 s8 s32 s32, s8, u8, f32, bf16 aocl_gemm_u8s8s32o*
s8 s8 s32 s32, s8, u8, f32, bf16 aocl_gemm_s8s8s32o*

These variants compute the GEMM in integer arithmetic (using AVX512_VNNI when available) and can output in various types. Use SCALE and BIAS post-ops for dequantization.

Example: Quantized Inference Layer

#include <aocl_dlp.h>

// Quantized activations (uint8) and weights (int8)
uint8_t activations[M * K] = { /* quantized input */ };
int8_t  weights[K * N]     = { /* quantized weights */ };
int32_t output[M * N]      = {0};

// Basic quantized GEMM (no post-ops, raw int32 accumulation)
aocl_gemm_u8s8s32os32(
    'R', 'N', 'N', m, n, k,
    1,                          // alpha (int32)
    activations, lda, 'N',
    weights, ldb, 'N',
    0,                          // beta (int32)
    output, ldc, NULL);

Dequantize Output with Post-Ops

To get float output from integer GEMM with dequantization:

// Scale factors for dequantization (one per output channel)
float scale_vals[N] = { /* calibrated scales */ };
dlp_sf_t sf = {
    .scale_factor      = scale_vals,
    .scale_factor_len  = n,
    .scale_factor_type = DLP_F32
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };

// Bias (applied after scaling)
float bias_vals[N] = { /* bias per channel */ };
dlp_post_op_bias bias_op = {
    .bias = bias_vals, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};

// Chain: SCALE then BIAS
DLP_POST_OP_TYPE seq[] = { SCALE, BIAS };

dlp_metadata_t meta = {0};
meta.seq_length = 2;
meta.seq_vector = seq;
meta.scale      = &scale_op;
meta.bias       = &bias_op;

float output_f32[M * N];
aocl_gemm_u8s8s32of32(
    'R', 'N', 'N', m, n, k,
    1, activations, lda, 'N',
    weights, ldb, 'N',
    0, output_f32, ldc, &meta);
// output_f32 = bias + scale * (activations * weights)

Symmetric Quantization GEMM

AOCL-DLP provides specialized symmetric quantization variants that handle grouped quantization natively:

  • aocl_gemm_s8s8s32of32_sym_quant
  • aocl_gemm_s8s8s32obf16_sym_quant

The reorder functions (aocl_get_reorder_buf_size_s8s8s32os32_sym_quant and aocl_reorder_s8s8s32os32_sym_quant) accept a DLP_SYMM_STAT_QUANT* parameter to pack quantization group metadata alongside the reordered matrix. The GEMM call itself uses the standard signature with dlp_metadata_t* as the last parameter.

// Symmetric quantization config
DLP_SYMM_STAT_QUANT symq = {
    .group_size = 128   // quantization group size (e.g., 128 elements per group)
};

// Reorder weights with symmetric quantization metadata
msz_t buf_size = aocl_get_reorder_buf_size_s8s8s32os32_sym_quant(
    'R', 'N', 'B', k, n, &symq, NULL);

int8_t *b_reordered = (int8_t *)malloc(buf_size);
aocl_reorder_s8s8s32os32_sym_quant(
    'R', 'N', 'B', weights, b_reordered, k, n, ldb, &symq, NULL);

// Compute with symmetric quantization
float output_f32[M * N];
aocl_gemm_s8s8s32of32_sym_quant(
    'R', 'N', 'N', m, n, k,
    1, activations_s8, lda, 'N',
    b_reordered, ldb, 'R',
    0, output_f32, ldc, NULL);

Mixed-Precision Quantized GEMM

For workloads where activations are in higher precision and weights are quantized:

Input A (activations) Input B (weights) Accumulator Outputs Use case
bf16 s8 s32 s32, f32, bf16, s8, u8 BF16 activations with int8 weights
bf16 s4 f32 f32, bf16 BF16 activations with 4-bit weights
bf16 u4 f32 f32, bf16 BF16 activations with unsigned 4-bit weights
f32 s8 s32 s32, f32, bf16, s8, u8 F32 activations with int8 weights

These variants handle on-the-fly quantization of the higher-precision input internally.

The dlp_quant_op Structure

For advanced quantization workflows, dlp_metadata_t supports pre- and post-quantization operations via the dlp_quant_op struct:

typedef struct {
    md_t      group_size;  // elements per quantization group
    DLP_TYPE  src_type;    // source type (e.g., DLP_BF16)
    DLP_TYPE  dst_type;    // destination type (e.g., DLP_S8)
    dlp_sf_t* scl;         // scale factors
    dlp_zp_t* zp;          // zero-points (NULL for symmetric)
    bool      symmetric;   // true = symmetric, false = asymmetric
} dlp_quant_op;

These can be attached to dlp_metadata_t as:

  • a_pre_quant / b_pre_quant -- quantize inputs before GEMM
  • a_post_quant / b_post_quant -- quantize after GEMM

Tips

  • Calibrate scales carefully -- Scale factors significantly impact accuracy. Use representative calibration data.
  • Validate against float baselines -- Compare quantized output against f32 GEMM to verify acceptable accuracy loss.
  • Use per-channel quantization for better accuracy at minimal performance cost compared to per-tensor.
  • Reorder quantized weights -- Pre-reorder weights for repeated inference calls using aocl_reorder_* functions.
  • Choose output type wisely -- Writing quantized output (os8, ou8) avoids a separate requantization pass.

See Also

Clone this wiki locally