Skip to content
Abhiram S edited this page Sep 23, 2025 · 1 revision

JIT - Just-In-Time Code Generation (Developer Guide)

AOCL-DLP uses Just In Time (JIT) compilation to generate optimized code for specific matrix sizes and data types at runtime. This approach allows the library to produce highly efficient implementations tailored to the exact parameters of the GEMM operations being performed.

How JIT Works

  1. Kernel Generation: When a specific operation is requested, AOCL-DLP analyzes the parameters and generates a tailored kernel optimized for those parameters.
  2. Caching: Once a kernel is generated, it can be cached for future use, reducing the overhead of recompilation.
  3. Dynamic Optimization: The JIT compiler can apply various optimization techniques based on the current execution context, such as loop unrolling, vectorization, and more.

Benefits of JIT Compilation

  • Performance: By generating optimized code on the fly, JIT compilation can significantly improve the performance of operations, especially for non-standard configurations.
  • Flexibility: JIT allows for greater flexibility in supporting a wide range of hardware and software configurations without the need for extensive pre-compilation.
  • Reduced Latency: For workloads with varying parameters, JIT can reduce latency by avoiding the need to recompile code for each unique configuration.

Xbyak JIT Assembler

AOCL-DLP leverages the Xbyak JIT assembler to generate optimized assembly code on the fly. Xbyak provides a high-level C++ interface for writing JIT-compiled code, allowing developers to focus on the algorithm rather than the intricacies of assembly language.

How JIT Works in AOCL-DLP

  1. Parameter Specification: When a GEMM operation is requested, the user specifies the matrix dimensions, data types, and any additional parameters (e.g., post-operations).
  2. Code Generation: AOCL-DLP uses Xbyak to generate assembly code optimized for the specified parameters. This code is tailored to leverage the specific capabilities of the underlying hardware (e.g., AVX2, AVX512).
  3. Compilation: The generated assembly code is compiled into machine code at runtime.
  4. Execution: The compiled code is executed to perform the GEMM operation, providing high performance for the specific use case.
  5. Caching: To avoid the overhead of regenerating code for the same parameters, AOCL-DLP caches the generated code for reuse in future operations with identical parameters.

How to dump JIT generated code

To dump the JIT generated code for inspection or debugging purposes:

Method 1: Build flag

cmake -DCMAKE_CXX_FLAGS="-DDLP_DUMP_JIT_CODE" ...

Method 2: Source modification Add #define DLP_DUMP_JIT_CODE at the top of src/jit/amdzen/amdzen_generator.cc before building.

Output files

Dumped files are created in the current working directory with names like:

  • jit_kernel_16x64.bin (GEMM kernel for MR=16, NR=64)
  • jit_gemv_n1_kernel_16x5.bin (GEMV N=1, MR=16, config index 5)
  • jit_gemv_m1_kernel_32x2.bin (GEMV M=1, NR=32, config index 2)

To disassemble:

objdump -D -b binary -m i386:x86-64 jit_kernel_16x64.bin

Clone this wiki locally