[Feature Request / Discussion] Add multi-image support for Qwen2.5-VL models in llama.cpp #16802
Sanskar002
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
I’ve been experimenting with Qwen2.5-VL-3B-Instruct inside llama.cpp, and I wanted to start a discussion about extending it to support multi-image inference — something that works in the official PyTorch pipeline but is currently limited in the llama.cpp implementation.
Context & Goal
Goal:
Enable inference with multiple image inputs (e.g., img1, img2, …) in a single prompt — so the model can reason across them, such as:
Example use case:
Summarizing a series of related photos, comparing two documents, or analyzing a timeline of events.
Current Observations (limitations in llama.cpp)
While reviewing the source code and testing the inference path with Qwen2.5-VL, I found:
The
mtmd_tokenize()API accepts multiple images (const mtmd_bitmap *images[], size_t n_images),but internally, only one image embedding is actually processed.
Functions like
mtmd_helper_eval_chunks() llava_image_embed_make_with_image()assume a single vision tower forward pass and store only one embedding tensor.
There’s no logic for merging or concatenating multiple vision embeddings before feeding into the multimodal adapter.
(In contrast, the official Qwen processor concatenates them before tokenization.)
The multimodal adapter matrix in qwen2vl is built for one
[CLS]embedding, so multi-image reasoning requires either:Current workaround I’m using:
This works but lacks cross-image reasoning ability.
What I’m Proposing
I’d like to work (or collaborate) on extending llama.cpp to:
(e.g., handle sequences like
<image1> <image2> Compare these.)Proposed small goal example:
Technical Notes & Evidence
Qwen2.5-VL official support:
The Hugging Face implementation allows multiple images via:
→ Qwen2.5-VL-3B-Instruct on Hugging Face
llama.cpp evidence:
llava.cpp#L350–L420— only oneimage_embedinstance used.mtmd_helper_eval_chunks()runs a single vision forward per chunk.So as of now, llama.cpp is limited to one vision embedding per prompt.
Request for Collaboration
Would love guidance or contributions from:
mtmd/llavainternalsI can help test on Android (OpenCL Adreno) and Linux builds, and share C++ patches once we agree on an approach.
My Setup
Any thoughts, suggestions, or pointers on implementing multi-image embedding fusion in llama.cpp would be appreciated.
Happy to open a draft PR once we finalize a direction.
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions