Support for Multimodal , multi-image summary inference #16794
vivekt-max
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
It would be great if someone has some pointers on how to achieve multimodal + mult-image inferencing using a model like QwenVL,3B model. We want to generate a summary across a set of images, using the multimodal Qwen model on-device.
Currently, it appears that the visual summary is restricted to just one image at time, using the mmproj multimodal projector, for the visual content processing.
However, --mmproj can only take one image at a time.
Can it be extended to allow multiple images ?
Kindly help!
Beta Was this translation helpful? Give feedback.
All reactions