-
Notifications
You must be signed in to change notification settings - Fork 4
Quantization Guide
AOCL-DLP supports quantized GEMM operations for efficient inference workloads. This guide covers symmetric quantization, mixed-precision workflows, and how to configure scale factors and zero-points.
Quantization maps floating-point values to lower-precision integers for faster computation and smaller memory footprint.
Symmetric quantization centers the quantized range around zero:
q = round(x * scale)
x = q / scale
Asymmetric quantization uses a zero-point offset:
q = round(x * scale) + zero_point
x = (q - zero_point) / scale
For workloads where both inputs are already quantized:
| Input A | Input B | Accumulator | Outputs | Function pattern |
|---|---|---|---|---|
| u8 | s8 | s32 | s32, s8, u8, f32, bf16 | aocl_gemm_u8s8s32o* |
| s8 | s8 | s32 | s32, s8, u8, f32, bf16 | aocl_gemm_s8s8s32o* |
These variants compute the GEMM in integer arithmetic (using AVX512_VNNI when available) and can output in various types. Use SCALE and BIAS post-ops for dequantization.
#include <aocl_dlp.h>
// Quantized activations (uint8) and weights (int8)
uint8_t activations[M * K] = { /* quantized input */ };
int8_t weights[K * N] = { /* quantized weights */ };
int32_t output[M * N] = {0};
// Basic quantized GEMM (no post-ops, raw int32 accumulation)
aocl_gemm_u8s8s32os32(
'R', 'N', 'N', m, n, k,
1, // alpha (int32)
activations, lda, 'N',
weights, ldb, 'N',
0, // beta (int32)
output, ldc, NULL);To get float output from integer GEMM with dequantization:
// Scale factors for dequantization (one per output channel)
float scale_vals[N] = { /* calibrated scales */ };
dlp_sf_t sf = {
.scale_factor = scale_vals,
.scale_factor_len = n,
.scale_factor_type = DLP_F32
};
dlp_scale_t scale_op = { .sf = &sf, .zp = NULL };
// Bias (applied after scaling)
float bias_vals[N] = { /* bias per channel */ };
dlp_post_op_bias bias_op = {
.bias = bias_vals, .stor_type = DLP_F32, .sf = NULL, .zp = NULL
};
// Chain: SCALE then BIAS
DLP_POST_OP_TYPE seq[] = { SCALE, BIAS };
dlp_metadata_t meta = {0};
meta.seq_length = 2;
meta.seq_vector = seq;
meta.scale = &scale_op;
meta.bias = &bias_op;
float output_f32[M * N];
aocl_gemm_u8s8s32of32(
'R', 'N', 'N', m, n, k,
1, activations, lda, 'N',
weights, ldb, 'N',
0, output_f32, ldc, &meta);
// output_f32 = bias + scale * (activations * weights)AOCL-DLP provides specialized symmetric quantization variants that handle grouped quantization natively:
aocl_gemm_s8s8s32of32_sym_quantaocl_gemm_s8s8s32obf16_sym_quant
The reorder functions (aocl_get_reorder_buf_size_s8s8s32os32_sym_quant and aocl_reorder_s8s8s32os32_sym_quant) accept a DLP_SYMM_STAT_QUANT* parameter to pack quantization group metadata alongside the reordered matrix. The GEMM call itself uses the standard signature with dlp_metadata_t* as the last parameter.
// Symmetric quantization config
DLP_SYMM_STAT_QUANT symq = {
.group_size = 128 // quantization group size (e.g., 128 elements per group)
};
// Reorder weights with symmetric quantization metadata
msz_t buf_size = aocl_get_reorder_buf_size_s8s8s32os32_sym_quant(
'R', 'N', 'B', k, n, &symq, NULL);
int8_t *b_reordered = (int8_t *)malloc(buf_size);
aocl_reorder_s8s8s32os32_sym_quant(
'R', 'N', 'B', weights, b_reordered, k, n, ldb, &symq, NULL);
// Compute with symmetric quantization
float output_f32[M * N];
aocl_gemm_s8s8s32of32_sym_quant(
'R', 'N', 'N', m, n, k,
1, activations_s8, lda, 'N',
b_reordered, ldb, 'R',
0, output_f32, ldc, NULL);For workloads where activations are in higher precision and weights are quantized:
| Input A (activations) | Input B (weights) | Accumulator | Outputs | Use case |
|---|---|---|---|---|
| bf16 | s8 | s32 | s32, f32, bf16, s8, u8 | BF16 activations with int8 weights |
| bf16 | s4 | f32 | f32, bf16 | BF16 activations with 4-bit weights |
| bf16 | u4 | f32 | f32, bf16 | BF16 activations with unsigned 4-bit weights |
| f32 | s8 | s32 | s32, f32, bf16, s8, u8 | F32 activations with int8 weights |
These variants handle on-the-fly quantization of the higher-precision input internally.
For advanced quantization workflows, dlp_metadata_t supports pre- and post-quantization operations via the dlp_quant_op struct:
typedef struct {
md_t group_size; // elements per quantization group
DLP_TYPE src_type; // source type (e.g., DLP_BF16)
DLP_TYPE dst_type; // destination type (e.g., DLP_S8)
dlp_sf_t* scl; // scale factors
dlp_zp_t* zp; // zero-points (NULL for symmetric)
bool symmetric; // true = symmetric, false = asymmetric
} dlp_quant_op;These can be attached to dlp_metadata_t as:
-
a_pre_quant/b_pre_quant-- quantize inputs before GEMM -
a_post_quant/b_post_quant-- quantize after GEMM
- Calibrate scales carefully -- Scale factors significantly impact accuracy. Use representative calibration data.
- Validate against float baselines -- Compare quantized output against f32 GEMM to verify acceptable accuracy loss.
- Use per-channel quantization for better accuracy at minimal performance cost compared to per-tensor.
-
Reorder quantized weights -- Pre-reorder weights for repeated inference calls using
aocl_reorder_*functions. -
Choose output type wisely -- Writing quantized output (
os8,ou8) avoids a separate requantization pass.
- GEMM Guide -- All GEMM variants and parameter details
- Post-Ops Guide -- SCALE and BIAS post-ops for dequantization
-
Examples --
quantization.c,simple_gemm_s8.c - API Reference -- Generated API docs
Getting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference