-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert i64 input tensors to i32 for GPU plugin #781
base: main
Are you sure you want to change the base?
Conversation
e5fff9f
to
1d4800f
Compare
… always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32.
1d4800f
to
b502ee0
Compare
@yeonbok thank you for your PR, but I'm not sure that it is right place to do that:
|
@eaidova Yes, it might be benefitial for CPU as well (since internally we anyway convert i64 to i32). I would say it sounds very logical to have such logic aligned for all devices. |
@eaidova Thanks for the review and @dmitry-gorokhov thanks for the inputs for the CPU plugin. Sounds that it can be applied generally.. I will remove the GPU condition.
Actually current GPU (and CPU) plugin is always converting to i32, even though the model requires i64 we dont support. I will add a comment regarding this. |
there was a discussion about this some time ago. Maybe @slyalin remembers the context. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What's the expected gain from this change? Do you have numbers? By doing this and having the desired effect of "avoiding runtime conversion" you need to align other parts of the pipeline that supply these data: here in optimum-intel and in the deployment scenarios in C++ code. I don't think that in optimum you can avoid this runtime conversion, it will just happen in another place where int64 torch tensor will be converted to our updated i32 input tensor. Right? As for C++ code, for example, code here https://github.com/openvinotoolkit/openvino.genai/blob/master/src/cpp/src/greedy_decoding.cpp directly relies on i64. You need to propagate this change to other parts to understand the impact (and value). Including tokenizers and detokenizers. Just to "avoid runtime conversion". Sure, in practice we won't see real sizes above 2G I believe, so the numbers could be represented as i32. But I don't see how this PR is enough to leverage this possibility in the consistent manner. |
OpenVINO Tokenizer uses i32 internally too but sets output/input types to i64 for compatibility by default. One can set the I didn't remember any model with more than 1_000_00 vocab size, used by the multilingual encoder models. Gemma2 model has 256k vocab, GPT4 - 200k vocab, other LLMs have smaller vocabs. Extrapolating TinyLlama vocab to 1m tokens with the hidden state size of 2048, results in a |
Hi everyone, sorry for the belated update. I will collect performance data in the near future, thanks for the discussion. |
BTW, apart from this change, I'd like to ask a question about the logits (output) precision. Currently it is set as fp32, so if the context size is very large, it allocate a quite large memory. However, when the inference precision is fp16, the logits are actually computed in fp16 and then just converted as fp32. So I think we can just set the logits precision as fp16 if the inference precision is fp16. How do you think? E.g., when the context size is 10000, it allocates more than 1 GB for output for mistral, however it can be reduced to half(500 MB) for example. |
@yeonbok I do not recommend to do that due to several reasons:
|
Thanks for the response. Oh I see, as you mentioned before the runtime device an be changed by model.to, which will be problematic for other device... |
What does this PR do?
OpenVINO GPU plugin does not support int64 natively so i64 inputs are always converted to i32. To avoid runtime conversion, updated IO tensor precision to i32
Fixes # (issue)
Before submitting