Support for Multimodal , multi-image summary inference #16794

vivekt-max · 2025-10-27T05:33:13Z

vivekt-max
Oct 27, 2025

Hi Team,

It would be great if someone has some pointers on how to achieve multimodal + mult-image inferencing using a model like QwenVL,3B model. We want to generate a summary across a set of images, using the multimodal Qwen model on-device.

Currently, it appears that the visual summary is restricted to just one image at time, using the mmproj multimodal projector, for the visual content processing.

However, --mmproj can only take one image at a time.

Can it be extended to allow multiple images ?

Kindly help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Multimodal , multi-image summary inference #16794

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Support for Multimodal , multi-image summary inference #16794

Uh oh!

Uh oh!

vivekt-max Oct 27, 2025

Replies: 0 comments

vivekt-max
Oct 27, 2025