-
Notifications
You must be signed in to change notification settings - Fork 5
Jit
AOCL-DLP uses Just In Time (JIT) compilation to generate optimized code for specific matrix sizes and data types at runtime. This approach allows the library to produce highly efficient implementations tailored to the exact parameters of the GEMM operations being performed.
- Kernel Generation: When a specific operation is requested, AOCL-DLP analyzes the parameters and generates a tailored kernel optimized for those parameters.
- Caching: Once a kernel is generated, it can be cached for future use, reducing the overhead of recompilation.
- Dynamic Optimization: The JIT compiler can apply various optimization techniques based on the current execution context, such as loop unrolling, vectorization, and more.
- Performance: By generating optimized code on the fly, JIT compilation can significantly improve the performance of operations, especially for non-standard configurations.
- Flexibility: JIT allows for greater flexibility in supporting a wide range of hardware and software configurations without the need for extensive pre-compilation.
- Reduced Latency: For workloads with varying parameters, JIT can reduce latency by avoiding the need to recompile code for each unique configuration.
AOCL-DLP leverages the Xbyak JIT assembler to generate optimized assembly code on the fly. Xbyak provides a high-level C++ interface for writing JIT-compiled code, allowing developers to focus on the algorithm rather than the intricacies of assembly language.
- Parameter Specification: When a GEMM operation is requested, the user specifies the matrix dimensions, data types, and any additional parameters (e.g., post-operations).
- Code Generation: AOCL-DLP uses Xbyak to generate assembly code optimized for the specified parameters. This code is tailored to leverage the specific capabilities of the underlying hardware (e.g., AVX2, AVX512).
- Compilation: The generated assembly code is compiled into machine code at runtime.
- Execution: The compiled code is executed to perform the GEMM operation, providing high performance for the specific use case.
- Caching: To avoid the overhead of regenerating code for the same parameters, AOCL-DLP caches the generated code for reuse in future operations with identical parameters.
To dump the JIT generated code for inspection or debugging purposes:
Method 1: Build flag
cmake -DCMAKE_CXX_FLAGS="-DDLP_DUMP_JIT_CODE" ...Method 2: Source modification
Add #define DLP_DUMP_JIT_CODE at the top of src/jit/amdzen/amdzen_generator.cc before building.
Dumped files are created in the current working directory with names like:
-
jit_kernel_16x64.bin(GEMM kernel for MR=16, NR=64) -
jit_gemv_n1_kernel_16x5.bin(GEMV N=1, MR=16, config index 5) -
jit_gemv_m1_kernel_32x2.bin(GEMV M=1, NR=32, config index 2)
To disassemble:
objdump -D -b binary -m i386:x86-64 jit_kernel_16x64.binGetting Started
User Guides
Performance & Config
Testing & Benchmarking
Developer Guides
Reference