Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Constant tensors caching pass #183

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 234 additions & 0 deletions docs/constant_tensors_cache.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# Constant tensors folding pass

## 1 Motivation
Some tensors of a machine learning model are constant during inference, such as the weights of filters of convolution
layers. There are two types of constant tensors:

- Type 1: they are available at compile time. They can appear as literal values within the model, such as
`arith.constant` operations in MLIR. They can also appear as the arguments of the MLIR module entry function and
are marked as compile-time available constants explicitly. Constants in OpenVINO belong to this type. The IR of an
OpenVINO model consists of its topology and constant values, like weights, in memory.
When transforming OpenVINO IR to MLIR, the constants can be lowered into `arith.constant` operations,
or arguments of the MLIR module entry function. Since the concrete values of the constants are compile-time
available in OpenVINO case, it is possible to fold them in compile-time.

- Type 2: they are only available at runtime. Constants in OneDNN Graph belong to this type. According to the
specification of oneDNN Graph, AOT compiling is not supported and the kernel compilation happens with logical tensors
instead of real tensors. The literal values of these constant tensors are available at runtime.

Within the IR, there are operations that take the constant tensors as parameters and process them, such as
reordering or packing. Outputs of such operations are also constants. However, these operations will run every time
the kernel being executed, which causes redundant memory and computation consumptions.
This pass modifies the IR so that these operations will only run once. For Type 1 constants, the compiler can choose to
run these operations once in compile time or runtime. For Type 2 constants, these operations will only run once in
runtime, more specificly, the first execution time.

## 2 Background
These is no similar pass in the MLIR community currently.

### 2.1 Constant folding
A related pass is constant folding, which processes **explicit** constants at compile time. But in machine learning,
the tensors are usually high-dimensional and the operations that process the constant tensors are complex and require
compiled kernels. So traditional constant folding cannot handle them well.

Our pass can be thought of enhanced constant folding. It makes the constant tensors be processed only once and the
processed tensors are cached to buffers to reuse in later executions.

### 2.2 Constant tensors caching in OpenVINO
There are already some similar transformations in OpenVINO. For each `Graph`, there is a `GraphContext` member.
A `GraphContext` holds a `WeightsSharing`, which is basically a `std::unordered_map<std::string, MemoryInfo::Ptr>`
that stores the memory of cached tensors. In compile stage, the operations (for example, type casting ops) that
follow the constant `Input` operations (weights, bias or others) will be executed and the results are cached in the
`unordered_map` of the `GraphContext`.

For each `FullyConnected` (`FC` for short) operation with DNNL primitive implementation, there is a `DnnlFCExecutor`,
which has an attribute of type `ExecutorContext`. The `ExecutorContext` holds an `unordered_map<string, MemoryPtr>`
to store the memory of its private cached weights. When the `FC` has dynamic shape inputs, which is the case for
llama2, these is nothing to do with the weights in compile stage. Actually, there is no explicit `Reorder` operation
in the graph after the constant `Input` operation which holds the weight of a `FC`. In the first execution, all the
input shapes are defined so the `DnnlFCExecutor` is constructed. During the construction, the weight is packed to
the blocking format that the DNNL primitive requires, and the memory is stored in the `unordered_map` of
the `ExecutorContext`. In later executions, the packed weight can be used directly. When the `FC` has static shape,
the `DnnlFCExecutor` is constructed in compile stage, so the above packing and caching process can be done in compile
time. All the executions directly use the cached weight.

We can not utilize the work in OpenVINO because
- it happens far later than the transformation that replaces subgraphs with MLIR-world operations;
- it is deeply coupled with DNNL related data structures.

## 3 Algorithm
There are two steps to complete the pass: an analysis step and a transform step.
These two steps will be implemented as `Pass`es in the MLIR world.

### 3.1 Analysis step
The analysis step is mainly to identify the operations that take the constant tensors as inputs and output constant
tensors. They will be marked as interested ops.

The main work of analysis step will be implemented as a
[DataFlow Analysis](https://mlir.llvm.org/docs/Tutorials/DataFlowAnalysis/) pass. In the MLIR module's entry function,
the constant tensors will appear as outputs of `arith.constant` operations or arguments of the MLIR module
entry function marked with `constant` attributes. The constantness starts its propagation
from these tensors to the output tensors of operations that process them. Eventually, these operations
will form a subgraph, which is named as 'constant subgraph'. Another subgraph, which contains non-constant operations,
consumes the outputs of constant subgraph and the non-constant parameters to the graph.

Because the constantness information is carried by tensors, the analysis step is on linalg-on-tensor level.
The interested ops are most likely `reorder`, `pack` or `broadcast` ops, so the analysis step should be after the
layout propagation pass.

### 3.2 Transform step
Take the following IR as an example (To make the IR short and easy to understand, only the important information is
shown):
```mlir
module {
// %weight0 and %weight1 is Type 1 constant. %weight2 is Type 2 constant.
entry(%feature0: tensor<*xbf16>, %weight1: tensor<*xbf16>, %weight2: tensor<*xbf16>)
-> %feature3: tensor<*xbf16> attributes {compiletime_const_args_index = [1 : i32], runtime_const_args_index = [2 : i32]} {
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16>
%packedWeight0 = tensor.pack(%weight0, ...)
%feature1 = linalg.matmul(%feature0, %packedWeight0)
%packedWeight1 = tensor.pack(%weight1, ...)
%feature2 = linalg.matmul(%feature1, %packedWeight1)
%packedWeight2 = tensor.pack(%weight2, ...)
%feature3 = linalg.matmul(%feature2, %packedWeight2)
return %feature3
}
}
```

After transformation, there will be three functions in the module, one for compile time folding, one for runtime
folding and one as new entry. The compile time folding function contains the operations that consume and produce
constants of Type 1. The runtime folding function contains the operations that consume and produce
constants of Type 2. The new entry function will take all folded tensors as inputs. The expected output IR will be like:
```mlir
module {
entry(%feature0: tensor<*xbf16>, %foldedWeight0: tensor<*xbf16>, %foldedWeight1: tensor<*xbf16>, %foldedWeight2: tensor<*xbf16>)
-> %feature3: tensor<*xbf16> {
%feature1 = linalg.matmul(%feature0, %foldedWeight0)
%feature2 = linalg.matmul(%feature1, %foldedWeight1)
%feature3 = linalg.matmul(%feature2, %foldedWeight2)
return %feature3
}
compiletime_fold(%weight1: tensor<*xbf16>) -> %foldedWeight0, %foldedWeight1: tensor<*xbf16>, tensor<*xbf16> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compiletime_fold will be a problem because:

  1. The kernel binary will contain the whole tensor to be folded, which is too large. If we only want to use runtime_fold from it, the binary size is wasted and this is not friendly for kernel cache.
  2. For GPU device, we may want to do the folding on CPU. We shouldn't put three functions in the same module to achieve that.

If we want to support direct compile-time folding, I suggest following the direction of section 2.1 to implement it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the compile time available tensors, the integration can choose to:

  • lower them to arith.constant, which is not suggested,
  • put them into the arguments list of the module and mark them as compiletime_const_args,
  • put them into the arguments list of the module and mark them as runtime_const_args.

For the first two choices, they will be folded by compiletime_fold, and for the third choice, by runtime_fold. There will be no large literal tensors in the kernel for the last two choices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't put three functions in the same module to achieve that.

I'm not clear if we can generate a new module in the pass pipeline. If so, shall we put the compiletime_fold in one module, and runtime_fold and compute in another module?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear if we can generate a new module in the pass pipeline. If so, shall we put the compiletime_fold in one module, and runtime_fold and compute in another module?

I think so. But my current thinking is, we can support compiletime_fold in the future when there's a demand for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will put all folding operations into runtime_fold.

%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16>
%foldedWeight0 = tensor.pack(%weight0, ...)
%foldedWeight1 = tensor.pack(%weight1, ...)
return %foldedWeight0, %foldedWeight1
}
runtime_fold(%weight2: tensor<*xbf16>) -> %foldedWeight2: tensor<*xbf16>{
%foldedWeight2 = tensor.pack(%weight2, ...)
return %foldedWeight2
}
}
```
However, this requires that the `compiletime_fold` to be called in compile time, which makes the compilation pipeline
complex. So we also provide a simplified version, which does all the folding at runtime. In this case, the output IR
will be like:
```mlir
module {
entry(%feature0: tensor<*xbf16>, %foldedWeight0: tensor<*xbf16>, %foldedWeight1: tensor<*xbf16>, %foldedWeight2: tensor<*xbf16>)
-> %feature3: tensor<*xbf16> {
%feature1 = linalg.matmul(%feature0, %foldedWeight0)
%feature2 = linalg.matmul(%feature1, %foldedWeight1)
%feature3 = linalg.matmul(%feature2, %foldedWeight2)
return %feature3
}
runtime_fold(%weight1: tensor<*xbf16>, %weight2: tensor<*xbf16>)
-> %foldedWeight0, %foldedWeight1, %foldedWeight2: tensor<*xbf16>, tensor<*xbf16>, tensor<*xbf16> {
%weight0 = arith.constant dense<"0x01234567..."> : tensor<*xbf16>
%foldedWeight0 = tensor.pack(%weight0, ...)
%foldedWeight1 = tensor.pack(%weight1, ...)
%foldedWeight2 = tensor.pack(%weight2, ...)
return %foldedWeight0, %foldedWeight1, %foldedWeight2
}
}
```
The simplified version is adopted as the default choice.

We place this transformation at linalg-on-tensor level, right after the analysis step.

### 3.3 Management of cached tensors. Integration.
This part is designed for integration with OpenVINO. For other frontends (like benchgc or OneDNN Graph),
the details may be different.

Later after compiled to executable, the folding function will be executed to generate folded tensors,
which need to be cached into buffers for future use. These buffers will be under the management of a runtime context,
if there is one, or the `MLIROp`s.

An example implementation will be a map which stores pairs of a global index and an allocated buffer:
```c++
struct CachedBuffer {
void* buffer;
std::vector<int64_t> shape;
std::vector<int64_t> strides;
};

class OPENVINO_API MLIROp {
...
std::unordered_map<int64_t, CachedBuffer> cached_const_buffers;
int executionCount = 0;
}
```

When a buffer is allocated for the folded tensor, an index will be assigned to the buffer.
The map stores these pairs. This map can be shared by all `MLIROp`s and each `MLIROp` holds the indexes of buffers
it uses, or each `MLIROp` holds its own map. In current implementation, each `MLIROp` holds its own map.
During the execution of folding function, the buffers are filled with folded values.

In the first execution, both the runtime folding function and the entry function will be executed.
In later executions, only the entry function will be executed.
```C++
void ov::MLIROp::execute(InputTensors& inputs, OutputTensors& outputs) {
if (executionCount == 0) {
std::vector<void *> constantInputs = ...;
std::vector<void *> cachedBuffers = ...;
runtimeFold(constantInputs, cachedBuffers);
}

std::vector<void *> nonConstantInputs = ...;
entry(nonConstantInputs, cachedBuffers, outputs);

executionCount += 1;
...
}
```

### 3.5 Postpone expanding size ops
There is another optimization during the transform. Some operations, such as `Broadcast`, will expand the tensor's
size dramatically. Folding these operations is not profitable. However, this makes folding their children operations
not possible. If we can change the order of the expanding-size op and its children ops, the children ops
can be folded. Take the following IR as example:
```mlir
%15 = tensor.empty() : tensor<8x32xbf16>
%packed_arg2 = tensor.pack %arg2 outer_dims_perm = [0] inner_dims_pos = [0] inner_tiles = [32] into %15 : tensor<256xbf16> -> tensor<8x32xbf16>
%bc_arg2_init = tensor.empty() : tensor<2x8x32x32xbf16>
%bc_arg2 = linalg.broadcast ins(%packed_arg2 : tensor<8x32xbf16>) outs(%bc_arg2_init : tensor<2x8x32x32xbf16>) dimensions = [0, 2]
%extf32 = arith.extf %bc_arg2 : tensor<2x8x32x32xbf16> to tensor<2x8x32x32xf32>
%cst_2 = arith.constant 2.000000e+00 : f32
%extf32_mul2_init = tensor.empty() : tensor<2x8x32x32xf32>
%extf32_mul2 = linalg.generic {indexing_maps = [#map4, #map4], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%extf32 : tensor<2x8x32x32xf32>) outs(%extf32_mul2_init : tensor<2x8x32x32xf32>) {
^bb0(%in: f32, %out: f32):
%8 = arith.mulf %in, %cst_2 : f32
linalg.yield %8 : f32
} -> tensor<2x8x32x32xf32>
%truncbf16 = arith.truncf %extf32_mul2 : tensor<2x8x32x32xf32> to tensor<2x8x32x32xbf16>
```
`%arg2` is processed sequentially by `pack`, `broadcast`, `extf`, `mulf` and `truncf`. The `broadcast` stops
the folding on `extf`, `mulf` and `truncf`. Then the pass moves the `broadcast` after `truncf` and transforms the IR to:
```mlir
%2 = tensor.empty() : tensor<8x32xbf16>
%pack_1 = tensor.pack %arg2 outer_dims_perm = [0] inner_dims_pos = [0] inner_tiles = [32] into %2 : tensor<256xbf16> -> tensor<8x32xbf16>
%3 = arith.extf %pack_1 : tensor<8x32xbf16> to tensor<8x32xf32>
%4 = tensor.empty() : tensor<8x32xf32>
%5 = linalg.generic {indexing_maps = [#map5, #map5], iterator_types = ["parallel", "parallel"]} ins(%3 : tensor<8x32xf32>) outs(%4 : tensor<8x32xf32>) {
^bb0(%in: f32, %out: f32):
%10 = arith.mulf %in, %cst : f32
linalg.yield %10 : f32
} -> tensor<8x32xf32>
%6 = arith.truncf %5 : tensor<8x32xf32> to tensor<8x32xbf16>
%7 = tensor.empty() : tensor<2x8x32x32xbf16>
%broadcasted = linalg.broadcast ins(%6 : tensor<8x32xbf16>) outs(%7 : tensor<2x8x32x32xbf16>) dimensions = [0, 2]
```
Then the `extf`, `mulf` and `truncf` can be folded, and the `broadcast` is still not folded.
Strict constraints have to be applied to this optimization to ensure the semantic correctness. The children ops
should be element-wise operations from `linalg`, `arith` or `math` dialects.