-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Constant tensors caching pass #183
base: main
Are you sure you want to change the base?
Conversation
%feature3 = linalg.matmul(%feature2, %foldedWeight2) | ||
return %feature3 | ||
} | ||
compiletime_fold(%weight1: tensor<*xbf16>) -> %foldedWeight0, %foldedWeight1: tensor<*xbf16>, tensor<*xbf16> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiletime_fold
will be a problem because:
- The kernel binary will contain the whole tensor to be folded, which is too large. If we only want to use
runtime_fold
from it, the binary size is wasted and this is not friendly for kernel cache. - For GPU device, we may want to do the folding on CPU. We shouldn't put three functions in the same module to achieve that.
If we want to support direct compile-time folding, I suggest following the direction of section 2.1 to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the compile time available tensors, the integration can choose to:
- lower them to arith.constant, which is not suggested,
- put them into the arguments list of the module and mark them as
compiletime_const_args
, - put them into the arguments list of the module and mark them as
runtime_const_args
.
For the first two choices, they will be folded by compiletime_fold
, and for the third choice, by runtime_fold
. There will be no large literal tensors in the kernel for the last two choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't put three functions in the same module to achieve that.
I'm not clear if we can generate a new module in the pass pipeline. If so, shall we put the compiletime_fold
in one module, and runtime_fold
and compute
in another module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear if we can generate a new module in the pass pipeline. If so, shall we put the
compiletime_fold
in one module, andruntime_fold
andcompute
in another module?
I think so. But my current thinking is, we can support compiletime_fold
in the future when there's a demand for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will put all folding operations into runtime_fold
.
waiting from OV integration, how they handle const weight..? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC LGTM
Do we have an idea of the performance impact of this pass on some example?
Ideally, the execution time of operations on weights of matmuls, such as |
Related to issue #146 .