-
Notifications
You must be signed in to change notification settings - Fork 61
Extend on-device sampling support for dual QPC VLMs #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
af8e673 to
df3501a
Compare
quic-hemagnih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the CI test cases.
Signed-off-by: quic-xiyushi <[email protected]>
df3501a to
d722a5a
Compare
Signed-off-by: quic-xiyushi <[email protected]>
d722a5a to
e06e175
Compare
Signed-off-by: quic-sanising <[email protected]> Signed-off-by: sanising <[email protected]>
900aee5 to
3e242ce
Compare
Signed-off-by: quic-xiyushi <[email protected]>
Signed-off-by: sanising <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for intern model i,e VLM in dual qpc mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternVL does not support the new generation interface yet, which will be later be added in PR #610. So instead of making changes to the legacy API, I added tests for Qwen2.5VL, which suports continuous batching in the current mainline.
| QEffGPTJForCausalLM, | ||
| QEffGraniteForCausalLM, | ||
| QEffGraniteMoeForCausalLM, | ||
| QEffInternDecoderWrapper, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we are enabling sampling only for intern model?
Will other VLMs also be supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other VLMs are also supposed to be supported. But currently only InternVL and Qwen VL 2.5 have been tested.
Signed-off-by: quic-xiyushi <[email protected]>
Signed-off-by: quic-xiyushi <[email protected]>
Signed-off-by: quic-xiyushi <[email protected]>
cc44ad0 to
ef9ae14
Compare
Signed-off-by: quic-xiyushi <[email protected]>
Signed-off-by: quic-xiyushi <[email protected]>
8d00cb1 to
5e2afb7
Compare
@quic-hemagnih CI added. Please review this PR again. Thank you! |
7d06a75 to
a0716fa
Compare
Signed-off-by: quic-xiyushi <[email protected]>
a0716fa to
10990a9
Compare
| top_ps: Optional[torch.Tensor] = None, | ||
| min_ps: Optional[torch.Tensor] = None, | ||
| random_numbers: Optional[torch.Tensor] = None, | ||
| vision_embeds: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add these both vision_embeds and image_idx in docs Args list.
| min_ps: Optional[torch.Tensor] = None, | ||
| random_numbers: Optional[torch.Tensor] = None, | ||
| vision_embeds: Optional[torch.Tensor] = None, | ||
| image_idx: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep dtype of these 2 consistent as per lines 27-28. also update function docstring for these newly added args.
|
Please resolve the conflicts. |
Overview
On-device sampling can significantly reduce host overhead and improve inference throughput; however, so far it has only been implemented for
QEffForCausalLMmodels. This PR extends on-device sampling support to the language decoder of dual QPC vision language models,QEffCausalLMForTextImageToTextModel. In addition, it fixes the bug in gumbel noise so that it correctly simulates a multinomial distribution for random sampling.Implementation details
Usage
The usage is the similar to enable on-device sampling for
QEffForCausalLM.