Some tensors of a machine learning model are constant during inference, such as the weights of filters of convolution layers. There are two types of constant tensors:
-
Type 1: they are available at compile time. They can appear as literal values within the model, such as
arith.constant
operations in MLIR. They can also appear as the arguments of the MLIR module entry function and are marked as compile-time available constants explicitly. Constants in OpenVINO belong to this type. The IR of an OpenVINO model consists of its topology and constant values, like weights, in memory. When transforming OpenVINO IR to MLIR, the constants can be lowered intoarith.constant
operations, or arguments of the MLIR module entry function. Since the concrete values of the constants are compile-time available in OpenVINO case, it is possible to fold them in compile-time. -
Type 2: they are only available at runtime. Constants in OneDNN Graph belong to this type. According to the specification of oneDNN Graph, AOT compiling is not supported and the kernel compilation happens with logical tensors instead of real tensors. The literal values of these constant tensors are available at runtime.
Within the IR, there are operations that take the constant tensors as parameters and process them, such as reordering or packing. Outputs of such operations are also constants. However, these operations will run every time the kernel being executed, which causes redundant memory and computation consumptions. This pass modifies the IR so that these operations will only run once. For Type 1 constants, the compiler can choose to run these operations once in compile time or runtime. For Type 2 constants, these operations will only run once in runtime, more specificly, the first execution time.
These is no similar pass in the MLIR community currently.
A related pass is constant folding, which processes explicit constants at compile time. But in machine learning, the tensors are usually high-dimensional and the operations that process the constant tensors are complex and require compiled kernels. So traditional constant folding cannot handle them well.
Our pass can be thought of enhanced constant folding. It makes the constant tensors be processed only once and the processed tensors are cached to buffers to reuse in later executions.
There are already some similar transformations in OpenVINO. For each Graph
, there is a GraphContext
member.
A GraphContext
holds a WeightsSharing
, which is basically a std::unordered_map<std::string, MemoryInfo::Ptr>
that stores the memory of cached tensors. In compile stage, the operations (for example, type casting ops) that
follow the constant Input
operations (weights, bias or others) will be executed and the results are cached in the
unordered_map
of the GraphContext
.
For each FullyConnected
(FC
for short) operation with DNNL primitive implementation, there is a DnnlFCExecutor
,
which has an attribute of type ExecutorContext
. The ExecutorContext
holds an unordered_map<string, MemoryPtr>
to store the memory of its private cached weights. When the FC
has dynamic shape inputs, which is the case for
llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit Reorder
operation
in the graph after the constant Input
operation which holds the weight of a FC
. In the first execution, all the
input shapes are defined so the DnnlFCExecutor
is constructed. During the construction, the weight is packed to
the blocking format that the DNNL primitive requires, and the memory is stored in the unordered_map
of
the ExecutorContext
. In later executions, the packed weight can be used directly. When the FC
has static shape,
the DnnlFCExecutor
is constructed in compile stage, so the above packing and caching process can be done in compile
time. All the executions directly use the cached weight.
We can not utilize the work in OpenVINO because
- it happens far later than the transformation that replaces subgraphs with MLIR-world operations;
- it is deeply coupled with DNNL related data structures.
There are two steps to complete the pass: an analysis step and a transform step.
These two steps will be implemented as Pass
es in the MLIR world.
The analysis step is mainly to identify the operations that take the constant tensors as inputs and output constant tensors. They will be marked as interested ops.
The main work of analysis step will be implemented as a
DataFlow Analysis pass. In the MLIR module's entry function,
the constant tensors will appear as outputs of arith.constant
operations or arguments of the MLIR module
entry function marked with constant
attributes. The constantness starts its propagation
from these tensors to the output tensors of operations that process them. Eventually, these operations
will form a subgraph, which is named as 'constant subgraph'. Another subgraph, which contains non-constant operations,
consumes the outputs of constant subgraph and the non-constant parameters to the graph.
Because the constantness information is carried by tensors, the analysis step is on linalg-on-tensor level.
The interested ops are most likely reorder
, pack
or broadcast
ops, so the analysis step should be after the
layout propagation pass.
Take the following IR as an example (To make the IR short and easy to understand, only the important information is shown):
module {
// %weight0 and %weight1 is Type 1 constant. %weight2 is Type 2 constant.
entry(%feature0: tensor<*xbf16>, %weight1: tensor<*xbf16>, %weight2: tensor<*xbf16>)
-> %feature3: tensor<*xbf16> attributes {compiletime_const_args_index = [1 : i32], runtime_const_args_index = [2 : i32]} {
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16>
%packedWeight0 = tensor.pack(%weight0, ...)
%feature1 = linalg.matmul(%feature0, %packedWeight0)
%packedWeight1 = tensor.pack(%weight1, ...)
%feature2 = linalg.matmul(%feature1, %packedWeight1)
%packedWeight2 = tensor.pack(%weight2, ...)
%feature3 = linalg.matmul(%feature2, %packedWeight2)
return %feature3
}
}
After transformation, there will be three functions in the module, one for compile time folding, one for runtime folding and one as new entry. The compile time folding function contains the operations that consume and produce constants of Type 1. The runtime folding function contains the operations that consume and produce constants of Type 2. The new entry function will take all folded tensors as inputs. The expected output IR will be like:
module {
entry(%feature0: tensor<*xbf16>, %foldedWeight0: tensor<*xbf16>, %foldedWeight1: tensor<*xbf16>, %foldedWeight2: tensor<*xbf16>)
-> %feature3: tensor<*xbf16> {
%feature1 = linalg.matmul(%feature0, %foldedWeight0)
%feature2 = linalg.matmul(%feature1, %foldedWeight1)
%feature3 = linalg.matmul(%feature2, %foldedWeight2)
return %feature3
}
compiletime_fold(%weight1: tensor<*xbf16>) -> %foldedWeight0, %foldedWeight1: tensor<*xbf16>, tensor<*xbf16> {
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16>
%foldedWeight0 = tensor.pack(%weight0, ...)
%foldedWeight1 = tensor.pack(%weight1, ...)
return %foldedWeight0, %foldedWeight1
}
runtime_fold(%weight2: tensor<*xbf16>) -> %foldedWeight2: tensor<*xbf16>{
%foldedWeight2 = tensor.pack(%weight2, ...)
return %foldedWeight2
}
}
However, this requires that the compiletime_fold
to be called in compile time, which makes the compilation pipeline
complex. So we also provide a simplified version, which does all the folding at runtime. In this case, the output IR
will be like:
module {
entry(%feature0: tensor<*xbf16>, %foldedWeight0: tensor<*xbf16>, %foldedWeight1: tensor<*xbf16>, %foldedWeight2: tensor<*xbf16>)
-> %feature3: tensor<*xbf16> {
%feature1 = linalg.matmul(%feature0, %foldedWeight0)
%feature2 = linalg.matmul(%feature1, %foldedWeight1)
%feature3 = linalg.matmul(%feature2, %foldedWeight2)
return %feature3
}
runtime_fold(%weight1: tensor<*xbf16>, %weight2: tensor<*xbf16>)
-> %foldedWeight0, %foldedWeight1, %foldedWeight2: tensor<*xbf16>, tensor<*xbf16>, tensor<*xbf16> {
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16>
%foldedWeight0 = tensor.pack(%weight0, ...)
%foldedWeight1 = tensor.pack(%weight1, ...)
%foldedWeight2 = tensor.pack(%weight2, ...)
return %foldedWeight0, %foldedWeight1, %foldedWeight2
}
}
The simplified version is adopted as the default choice.
We place this transformation at linalg-on-tensor level, right after the analysis step.
This part is designed for integration with OpenVINO. For other frontends (like benchgc or OneDNN Graph), the details may be different.
Later after compiled to executable, the folding function will be executed to generate folded tensors,
which need to be cached into buffers for future use. These buffers will be under the management of a runtime context,
if there is one, or the MLIROp
s.
An example implementation will be a map which stores pairs of a global index and an allocated buffer:
struct CachedBuffer {
void* buffer;
std::vector<int64_t> shape;
std::vector<int64_t> strides;
};
class OPENVINO_API MLIROp {
...
std::unordered_map<int64_t, CachedBuffer> cached_const_buffers;
int executionCount = 0;
}
When a buffer is allocated for the folded tensor, an index will be assigned to the buffer.
The map stores these pairs. This map can be shared by all MLIROp
s and each MLIROp
holds the indexes of buffers
it uses, or each MLIROp
holds its own map. In current implementation, each MLIROp
holds its own map.
During the execution of folding function, the buffers are filled with folded values.
In the first execution, both the runtime folding function and the entry function will be executed. In later executions, only the entry function will be executed.
void ov::MLIROp::execute(InputTensors& inputs, OutputTensors& outputs) {
if (executionCount == 0) {
std::vector<void *> constantInputs = ...;
std::vector<void *> cachedBuffers = ...;
runtimeFold(constantInputs, cachedBuffers);
}
std::vector<void *> nonConstantInputs = ...;
entry(nonConstantInputs, cachedBuffers, outputs);
executionCount += 1;
...
}
There is another optimization during the transform. Some operations, such as Broadcast
, will expand the tensor's
size dramatically. Folding these operations is not profitable. However, this makes folding their children operations
not possible. If we can change the order of the expanding-size op and its children ops, the children ops
can be folded. Take the following IR as example:
%15 = tensor.empty() : tensor<8x32xbf16>
%packed_arg2 = tensor.pack %arg2 outer_dims_perm = [0] inner_dims_pos = [0] inner_tiles = [32] into %15 : tensor<256xbf16> -> tensor<8x32xbf16>
%bc_arg2_init = tensor.empty() : tensor<2x8x32x32xbf16>
%bc_arg2 = linalg.broadcast ins(%packed_arg2 : tensor<8x32xbf16>) outs(%bc_arg2_init : tensor<2x8x32x32xbf16>) dimensions = [0, 2]
%extf32 = arith.extf %bc_arg2 : tensor<2x8x32x32xbf16> to tensor<2x8x32x32xf32>
%cst_2 = arith.constant 2.000000e+00 : f32
%extf32_mul2_init = tensor.empty() : tensor<2x8x32x32xf32>
%extf32_mul2 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%extf32 : tensor<2x8x32x32xf32>) outs(%extf32_mul2_init : tensor<2x8x32x32xf32>) {
^bb0(%in: f32, %out: f32):
%8 = arith.mulf %in, %cst_2 : f32
linalg.yield %8 : f32
} -> tensor<2x8x32x32xf32>
%truncbf16 = arith.truncf %extf32_mul2 : tensor<2x8x32x32xf32> to tensor<2x8x32x32xbf16>
%arg2
is processed sequentially by pack
, broadcast
, extf
, mulf
and truncf
. The broadcast
stops
the folding on extf
, mulf
and truncf
. Then the pass moves the broadcast
after truncf
and transforms the IR to:
%2 = tensor.empty() : tensor<8x32xbf16>
%pack_1 = tensor.pack %arg2 outer_dims_perm = [0] inner_dims_pos = [0] inner_tiles = [32] into %2 : tensor<256xbf16> -> tensor<8x32xbf16>
%3 = arith.extf %pack_1 : tensor<8x32xbf16> to tensor<8x32xf32>
%4 = tensor.empty() : tensor<8x32xf32>
%5 = linalg.generic {indexing_maps = [#map5, #map5], iterator_types = ["parallel", "parallel"]} ins(%3 : tensor<8x32xf32>) outs(%4 : tensor<8x32xf32>) {
^bb0(%in: f32, %out: f32):
%10 = arith.mulf %in, %cst : f32
linalg.yield %10 : f32
} -> tensor<8x32xf32>
%6 = arith.truncf %5 : tensor<8x32xf32> to tensor<8x32xbf16>
%7 = tensor.empty() : tensor<2x8x32x32xbf16>
%broadcasted = linalg.broadcast ins(%6 : tensor<8x32xbf16>) outs(%7 : tensor<2x8x32x32xbf16>) dimensions = [0, 2]
Then the extf
, mulf
and truncf
can be folded, and the broadcast
is still not folded.
Strict constraints have to be applied to this optimization to ensure the semantic correctness. The children ops
should be element-wise operations from linalg
, arith
or math
dialects.