-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Constant tensors caching pass #183
Open
niuxiaog
wants to merge
4
commits into
main
Choose a base branch
from
xgniu/rfc/const-weight-cache
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,234 @@ | ||
# Constant tensors folding pass | ||
|
||
## 1 Motivation | ||
Some tensors of a machine learning model are constant during inference, such as the weights of filters of convolution | ||
layers. There are two types of constant tensors: | ||
|
||
- Type 1: they are available at compile time. They can appear as literal values within the model, such as | ||
`arith.constant` operations in MLIR. They can also appear as the arguments of the MLIR module entry function and | ||
are marked as compile-time available constants explicitly. Constants in OpenVINO belong to this type. The IR of an | ||
OpenVINO model consists of its topology and constant values, like weights, in memory. | ||
When transforming OpenVINO IR to MLIR, the constants can be lowered into `arith.constant` operations, | ||
or arguments of the MLIR module entry function. Since the concrete values of the constants are compile-time | ||
available in OpenVINO case, it is possible to fold them in compile-time. | ||
|
||
- Type 2: they are only available at runtime. Constants in OneDNN Graph belong to this type. According to the | ||
specification of oneDNN Graph, AOT compiling is not supported and the kernel compilation happens with logical tensors | ||
instead of real tensors. The literal values of these constant tensors are available at runtime. | ||
|
||
Within the IR, there are operations that take the constant tensors as parameters and process them, such as | ||
reordering or packing. Outputs of such operations are also constants. However, these operations will run every time | ||
the kernel being executed, which causes redundant memory and computation consumptions. | ||
This pass modifies the IR so that these operations will only run once. For Type 1 constants, the compiler can choose to | ||
run these operations once in compile time or runtime. For Type 2 constants, these operations will only run once in | ||
runtime, more specificly, the first execution time. | ||
|
||
## 2 Background | ||
These is no similar pass in the MLIR community currently. | ||
|
||
### 2.1 Constant folding | ||
A related pass is constant folding, which processes **explicit** constants at compile time. But in machine learning, | ||
the tensors are usually high-dimensional and the operations that process the constant tensors are complex and require | ||
compiled kernels. So traditional constant folding cannot handle them well. | ||
|
||
Our pass can be thought of enhanced constant folding. It makes the constant tensors be processed only once and the | ||
processed tensors are cached to buffers to reuse in later executions. | ||
|
||
### 2.2 Constant tensors caching in OpenVINO | ||
There are already some similar transformations in OpenVINO. For each `Graph`, there is a `GraphContext` member. | ||
A `GraphContext` holds a `WeightsSharing`, which is basically a `std::unordered_map<std::string, MemoryInfo::Ptr>` | ||
that stores the memory of cached tensors. In compile stage, the operations (for example, type casting ops) that | ||
follow the constant `Input` operations (weights, bias or others) will be executed and the results are cached in the | ||
`unordered_map` of the `GraphContext`. | ||
|
||
For each `FullyConnected` (`FC` for short) operation with DNNL primitive implementation, there is a `DnnlFCExecutor`, | ||
which has an attribute of type `ExecutorContext`. The `ExecutorContext` holds an `unordered_map<string, MemoryPtr>` | ||
to store the memory of its private cached weights. When the `FC` has dynamic shape inputs, which is the case for | ||
llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit `Reorder` operation | ||
in the graph after the constant `Input` operation which holds the weight of a `FC`. In the first execution, all the | ||
input shapes are defined so the `DnnlFCExecutor` is constructed. During the construction, the weight is packed to | ||
the blocking format that the DNNL primitive requires, and the memory is stored in the `unordered_map` of | ||
the `ExecutorContext`. In later executions, the packed weight can be used directly. When the `FC` has static shape, | ||
the `DnnlFCExecutor` is constructed in compile stage, so the above packing and caching process can be done in compile | ||
time. All the executions directly use the cached weight. | ||
|
||
We can not utilize the work in OpenVINO because | ||
- it happens far later than the transformation that replaces subgraphs with MLIR-world operations; | ||
- it is deeply coupled with DNNL related data structures. | ||
|
||
## 3 Algorithm | ||
There are two steps to complete the pass: an analysis step and a transform step. | ||
These two steps will be implemented as `Pass`es in the MLIR world. | ||
|
||
### 3.1 Analysis step | ||
The analysis step is mainly to identify the operations that take the constant tensors as inputs and output constant | ||
tensors. They will be marked as interested ops. | ||
|
||
The main work of analysis step will be implemented as a | ||
[DataFlow Analysis](https://mlir.llvm.org/docs/Tutorials/DataFlowAnalysis/) pass. In the MLIR module's entry function, | ||
the constant tensors will appear as outputs of `arith.constant` operations or arguments of the MLIR module | ||
entry function marked with `constant` attributes. The constantness starts its propagation | ||
from these tensors to the output tensors of operations that process them. Eventually, these operations | ||
will form a subgraph, which is named as 'constant subgraph'. Another subgraph, which contains non-constant operations, | ||
consumes the outputs of constant subgraph and the non-constant parameters to the graph. | ||
|
||
Because the constantness information is carried by tensors, the analysis step is on linalg-on-tensor level. | ||
The interested ops are most likely `reorder`, `pack` or `broadcast` ops, so the analysis step should be after the | ||
layout propagation pass. | ||
|
||
### 3.2 Transform step | ||
Take the following IR as an example (To make the IR short and easy to understand, only the important information is | ||
shown): | ||
```mlir | ||
module { | ||
// %weight0 and %weight1 is Type 1 constant. %weight2 is Type 2 constant. | ||
entry(%feature0: tensor<*xbf16>, %weight1: tensor<*xbf16>, %weight2: tensor<*xbf16>) | ||
-> %feature3: tensor<*xbf16> attributes {compiletime_const_args_index = [1 : i32], runtime_const_args_index = [2 : i32]} { | ||
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16> | ||
%packedWeight0 = tensor.pack(%weight0, ...) | ||
%feature1 = linalg.matmul(%feature0, %packedWeight0) | ||
%packedWeight1 = tensor.pack(%weight1, ...) | ||
%feature2 = linalg.matmul(%feature1, %packedWeight1) | ||
%packedWeight2 = tensor.pack(%weight2, ...) | ||
%feature3 = linalg.matmul(%feature2, %packedWeight2) | ||
return %feature3 | ||
} | ||
} | ||
``` | ||
|
||
After transformation, there will be three functions in the module, one for compile time folding, one for runtime | ||
folding and one as new entry. The compile time folding function contains the operations that consume and produce | ||
constants of Type 1. The runtime folding function contains the operations that consume and produce | ||
constants of Type 2. The new entry function will take all folded tensors as inputs. The expected output IR will be like: | ||
```mlir | ||
module { | ||
entry(%feature0: tensor<*xbf16>, %foldedWeight0: tensor<*xbf16>, %foldedWeight1: tensor<*xbf16>, %foldedWeight2: tensor<*xbf16>) | ||
-> %feature3: tensor<*xbf16> { | ||
%feature1 = linalg.matmul(%feature0, %foldedWeight0) | ||
%feature2 = linalg.matmul(%feature1, %foldedWeight1) | ||
%feature3 = linalg.matmul(%feature2, %foldedWeight2) | ||
return %feature3 | ||
} | ||
compiletime_fold(%weight1: tensor<*xbf16>) -> %foldedWeight0, %foldedWeight1: tensor<*xbf16>, tensor<*xbf16> { | ||
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16> | ||
%foldedWeight0 = tensor.pack(%weight0, ...) | ||
%foldedWeight1 = tensor.pack(%weight1, ...) | ||
return %foldedWeight0, %foldedWeight1 | ||
} | ||
runtime_fold(%weight2: tensor<*xbf16>) -> %foldedWeight2: tensor<*xbf16>{ | ||
%foldedWeight2 = tensor.pack(%weight2, ...) | ||
return %foldedWeight2 | ||
} | ||
} | ||
``` | ||
However, this requires that the `compiletime_fold` to be called in compile time, which makes the compilation pipeline | ||
complex. So we also provide a simplified version, which does all the folding at runtime. In this case, the output IR | ||
will be like: | ||
```mlir | ||
module { | ||
entry(%feature0: tensor<*xbf16>, %foldedWeight0: tensor<*xbf16>, %foldedWeight1: tensor<*xbf16>, %foldedWeight2: tensor<*xbf16>) | ||
-> %feature3: tensor<*xbf16> { | ||
%feature1 = linalg.matmul(%feature0, %foldedWeight0) | ||
%feature2 = linalg.matmul(%feature1, %foldedWeight1) | ||
%feature3 = linalg.matmul(%feature2, %foldedWeight2) | ||
return %feature3 | ||
} | ||
runtime_fold(%weight1: tensor<*xbf16>, %weight2: tensor<*xbf16>) | ||
-> %foldedWeight0, %foldedWeight1, %foldedWeight2: tensor<*xbf16>, tensor<*xbf16>, tensor<*xbf16> { | ||
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16> | ||
%foldedWeight0 = tensor.pack(%weight0, ...) | ||
%foldedWeight1 = tensor.pack(%weight1, ...) | ||
%foldedWeight2 = tensor.pack(%weight2, ...) | ||
return %foldedWeight0, %foldedWeight1, %foldedWeight2 | ||
} | ||
} | ||
``` | ||
The simplified version is adopted as the default choice. | ||
|
||
We place this transformation at linalg-on-tensor level, right after the analysis step. | ||
|
||
### 3.3 Management of cached tensors. Integration. | ||
This part is designed for integration with OpenVINO. For other frontends (like benchgc or OneDNN Graph), | ||
the details may be different. | ||
|
||
Later after compiled to executable, the folding function will be executed to generate folded tensors, | ||
which need to be cached into buffers for future use. These buffers will be under the management of a runtime context, | ||
if there is one, or the `MLIROp`s. | ||
|
||
An example implementation will be a map which stores pairs of a global index and an allocated buffer: | ||
```c++ | ||
struct CachedBuffer { | ||
void* buffer; | ||
std::vector<int64_t> shape; | ||
std::vector<int64_t> strides; | ||
}; | ||
|
||
class OPENVINO_API MLIROp { | ||
... | ||
std::unordered_map<int64_t, CachedBuffer> cached_const_buffers; | ||
int executionCount = 0; | ||
} | ||
``` | ||
|
||
When a buffer is allocated for the folded tensor, an index will be assigned to the buffer. | ||
The map stores these pairs. This map can be shared by all `MLIROp`s and each `MLIROp` holds the indexes of buffers | ||
it uses, or each `MLIROp` holds its own map. In current implementation, each `MLIROp` holds its own map. | ||
During the execution of folding function, the buffers are filled with folded values. | ||
|
||
In the first execution, both the runtime folding function and the entry function will be executed. | ||
In later executions, only the entry function will be executed. | ||
```C++ | ||
void ov::MLIROp::execute(InputTensors& inputs, OutputTensors& outputs) { | ||
if (executionCount == 0) { | ||
std::vector<void *> constantInputs = ...; | ||
std::vector<void *> cachedBuffers = ...; | ||
runtimeFold(constantInputs, cachedBuffers); | ||
} | ||
|
||
std::vector<void *> nonConstantInputs = ...; | ||
entry(nonConstantInputs, cachedBuffers, outputs); | ||
|
||
executionCount += 1; | ||
... | ||
} | ||
``` | ||
|
||
### 3.5 Postpone expanding size ops | ||
There is another optimization during the transform. Some operations, such as `Broadcast`, will expand the tensor's | ||
size dramatically. Folding these operations is not profitable. However, this makes folding their children operations | ||
not possible. If we can change the order of the expanding-size op and its children ops, the children ops | ||
can be folded. Take the following IR as example: | ||
```mlir | ||
%15 = tensor.empty() : tensor<8x32xbf16> | ||
%packed_arg2 = tensor.pack %arg2 outer_dims_perm = [0] inner_dims_pos = [0] inner_tiles = [32] into %15 : tensor<256xbf16> -> tensor<8x32xbf16> | ||
%bc_arg2_init = tensor.empty() : tensor<2x8x32x32xbf16> | ||
%bc_arg2 = linalg.broadcast ins(%packed_arg2 : tensor<8x32xbf16>) outs(%bc_arg2_init : tensor<2x8x32x32xbf16>) dimensions = [0, 2] | ||
%extf32 = arith.extf %bc_arg2 : tensor<2x8x32x32xbf16> to tensor<2x8x32x32xf32> | ||
%cst_2 = arith.constant 2.000000e+00 : f32 | ||
%extf32_mul2_init = tensor.empty() : tensor<2x8x32x32xf32> | ||
%extf32_mul2 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%extf32 : tensor<2x8x32x32xf32>) outs(%extf32_mul2_init : tensor<2x8x32x32xf32>) { | ||
^bb0(%in: f32, %out: f32): | ||
%8 = arith.mulf %in, %cst_2 : f32 | ||
linalg.yield %8 : f32 | ||
} -> tensor<2x8x32x32xf32> | ||
%truncbf16 = arith.truncf %extf32_mul2 : tensor<2x8x32x32xf32> to tensor<2x8x32x32xbf16> | ||
``` | ||
`%arg2` is processed sequentially by `pack`, `broadcast`, `extf`, `mulf` and `truncf`. The `broadcast` stops | ||
the folding on `extf`, `mulf` and `truncf`. Then the pass moves the `broadcast` after `truncf` and transforms the IR to: | ||
```mlir | ||
%2 = tensor.empty() : tensor<8x32xbf16> | ||
%pack_1 = tensor.pack %arg2 outer_dims_perm = [0] inner_dims_pos = [0] inner_tiles = [32] into %2 : tensor<256xbf16> -> tensor<8x32xbf16> | ||
%3 = arith.extf %pack_1 : tensor<8x32xbf16> to tensor<8x32xf32> | ||
%4 = tensor.empty() : tensor<8x32xf32> | ||
%5 = linalg.generic {indexing_maps = [#map5, #map5], iterator_types = ["parallel", "parallel"]} ins(%3 : tensor<8x32xf32>) outs(%4 : tensor<8x32xf32>) { | ||
^bb0(%in: f32, %out: f32): | ||
%10 = arith.mulf %in, %cst : f32 | ||
linalg.yield %10 : f32 | ||
} -> tensor<8x32xf32> | ||
%6 = arith.truncf %5 : tensor<8x32xf32> to tensor<8x32xbf16> | ||
%7 = tensor.empty() : tensor<2x8x32x32xbf16> | ||
%broadcasted = linalg.broadcast ins(%6 : tensor<8x32xbf16>) outs(%7 : tensor<2x8x32x32xbf16>) dimensions = [0, 2] | ||
``` | ||
Then the `extf`, `mulf` and `truncf` can be folded, and the `broadcast` is still not folded. | ||
Strict constraints have to be applied to this optimization to ensure the semantic correctness. The children ops | ||
should be element-wise operations from `linalg`, `arith` or `math` dialects. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiletime_fold
will be a problem because:runtime_fold
from it, the binary size is wasted and this is not friendly for kernel cache.If we want to support direct compile-time folding, I suggest following the direction of section 2.1 to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the compile time available tensors, the integration can choose to:
compiletime_const_args
,runtime_const_args
.For the first two choices, they will be folded by
compiletime_fold
, and for the third choice, byruntime_fold
. There will be no large literal tensors in the kernel for the last two choices.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear if we can generate a new module in the pass pipeline. If so, shall we put the
compiletime_fold
in one module, andruntime_fold
andcompute
in another module?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. But my current thinking is, we can support
compiletime_fold
in the future when there's a demand for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will put all folding operations into
runtime_fold
.