Questions on deploying Quantized models ... #8213
-
Hi, This is more of a question than an issue, but I couldn't find the documentation or source code examples that address this. We have a backend that only supports fixed point operators and I am trying to evaluate using executorch to deploy to our platform. I am new to using Py-Torch as a deployment platform, so please bear with me if my question is too basic. When I use Py-Torch quantization, I see that it creates a graph in the following format where each operator is sandwiched between
So, when I use executorch partitioning, is it the expectation that we pattern match Suppose, I have a Python model of each fixed point op, is there any straightforward way I can run the executorch program directly on Python by substituting the python model for the corresponding lowered module? Since the graph schema is known, it should be possible to do this myself, but wondering if someone already solved this problem. If I lower the entire graph onto the backend as a single lowered module, I suppose that the memory planning doesn't apply inside the lowered module - i.e., the lowered module needs to take care of memory planning of tensors inside the module? Finally, is there an example that shows how I can pass already quantized inputs to the executorch program? For example, if I use fixed quantization for inputs and outputs, clients can directly pass quantized inputs and outputs without the need to deal with floating point data. Is this possible with executorch? Appreciate your help with my questions. This is an impressive platform! Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
That is correct. However, there is some WIP to represent quantized ops via integer compute instead "dq -> op -> q". See here https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html#convert-the-calibrated-model-to-a-quantized-model
Are you trying to use the export pipeline and generate executorch model (.pte file) and run it in python enviornment? If so yes, but this requires python bindings which is being enabled (or may already be). @larryliu0820 I saw you land some diffs for this.
This is correct. It might be possible to leverage memory planning, just got an idea of the arena that needs to be allocated and the tensor offsets. You might want to file feature request if you need this.
Answer to this is yes. Imagine you have quantized model that is:
Now say you delegate quantized conv to your backend so you have
This hasnt been tested, but in theory it should be possible for you to rewrite this graph to remove q/dq nodes to hat what you have is
Since we are changing dtypes for the input and outputs, two things need to be considered:
|
Beta Was this translation helpful? Give feedback.
-
Thank you so much for this information. This is very helpful.
Yes. This is one way of doing this and I think it should work for me (I was probably thinking of running it at a earlier stage of the compilation process, but running a .pte in Python using bindings is good as well).
I'll try to play around with this to see if there is any way I can take advantage of the existing memory planning code. I'll raise a feature request if necessary.
Correct. My plan for this would be use use "fixed" quantization (example, Q15) for input and output Q/DQ with the quantization scales and biases implicitly known. This way the entire inference is executed purely using integers. |
Beta Was this translation helpful? Give feedback.
-
Yes I’m trying to land a PR #1006 to add pybind support. Currently running into some errors on macos still debugging |
Beta Was this translation helpful? Give feedback.
-
That sounds good. Although do note that we dont really have fixed point dtype the way you have specified. So I would like to learn how would you leverage the PT 2 quantization to achieve your objectives. Maybe best to create pytorch forums post here https://discuss.pytorch.org/c/executorch/42 for further discussion on fixed point quantization. @larryliu0820 once you have the PR landed, we can close this. |
Beta Was this translation helpful? Give feedback.
-
Yes. I'll start a discussion on this once I have a solidified proposal. The Q-formats are simply special cases of (scale, zero point, dtype) based affine quantization where zero-point is always 0 and scale is a power of 2 (i.e., Q8 in affine representation is (scale=2^7, zero-point-0, dtype=int8). So, existing Py-Torch quantization framework will still work, with |
Beta Was this translation helpful? Give feedback.
-
thats great. Although, I would like to understand how 2^7 will translate into Q8, or you meant Q15? For Q15 it makes sense as the fractional part is really dividing by 2^7? |
Beta Was this translation helpful? Give feedback.
-
I think I goofed. What I meant was Q0.7 - where there are 7 fractional bits, 0 integer bits and 1 sign bit. This corresponds to (2^7, 0, int8). There are different variations of the notation and we have played fast and loose with this - for example, sometimes we use Q7 to automatically refer to Q0.7. The key thing is that we'll standardize the input and output Q-formats for a given model which are known to the client and this allows us to use a 100% fixed-point data path where it will be the clients responsibility to do the input/output quantization. |
Beta Was this translation helpful? Give feedback.
-
@rvijayc do you still need help with the issue? |
Beta Was this translation helpful? Give feedback.
That is correct. However, there is some WIP to represent quantized ops via integer compute instead "dq -> op -> q". See here https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html#convert-the-calibrated-model-to-a-quantized-model