[Feature Request / Discussion] Add multi-image support for Qwen2.5-VL models in llama.cpp #16802

Sanskar002 · 2025-10-27T13:36:00Z

Sanskar002
Oct 27, 2025

Hey everyone,

I’ve been experimenting with Qwen2.5-VL-3B-Instruct inside llama.cpp, and I wanted to start a discussion about extending it to support multi-image inference — something that works in the official PyTorch pipeline but is currently limited in the llama.cpp implementation.

Context & Goal

Goal:
Enable inference with multiple image inputs (e.g., img1, img2, …) in a single prompt — so the model can reason across them, such as:

“Create a summary based on the given images — describe how they relate or differ.”

Example use case:
Summarizing a series of related photos, comparing two documents, or analyzing a timeline of events.

Current Observations (limitations in llama.cpp)

While reviewing the source code and testing the inference path with Qwen2.5-VL, I found:

The mtmd_tokenize() API accepts multiple images (const mtmd_bitmap *images[], size_t n_images),
but internally, only one image embedding is actually processed.
Functions like
```
mtmd_helper_eval_chunks()
llava_image_embed_make_with_image()
```
assume a single vision tower forward pass and store only one embedding tensor.
There’s no logic for merging or concatenating multiple vision embeddings before feeding into the multimodal adapter.
(In contrast, the official Qwen processor concatenates them before tokenization.)
The multimodal adapter matrix in qwen2vl is built for one [CLS] embedding, so multi-image reasoning requires either:
- Averaging embeddings,
- Concatenating them with positional offsets, or
- Performing sequential vision passes and merging at the text level (current workaround).
Current workaround I’m using:
- Run each image independently
- Generate per-image summaries
- Merge them with a final text-only summary prompt.
  This works but lacks cross-image reasoning ability.

What I’m Proposing

I’d like to work (or collaborate) on extending llama.cpp to:

Support multiple image embeddings per prompt
Concatenate or fuse them before feeding into the multimodal adapter
Preserve chat template compatibility with Qwen2.5-VL
(e.g., handle sequences like <image1> <image2> Compare these.)

Proposed small goal example:

<|im_start|>system
You are a vision-language assistant.
<|im_end|>
<|im_start|>user
<image>
<image>
Describe the similarities between these images.
<|im_end|>
<|im_start|>assistant
...

Technical Notes & Evidence

Qwen2.5-VL official support:
The Hugging Face implementation allows multiple images via:

processor.apply_chat_template(messages)
processor(text=[text], images=image_inputs)

→ Qwen2.5-VL-3B-Instruct on Hugging Face

llama.cpp evidence:

In llava.cpp#L350–L420 — only one image_embed instance used.
mtmd_helper_eval_chunks() runs a single vision forward per chunk.

So as of now, llama.cpp is limited to one vision embedding per prompt.

Request for Collaboration

Would love guidance or contributions from:

Developers familiar with mtmd / llava internals
Anyone who’s attempted multi-image support before
Ideas on the best way to fuse multiple image embeddings

I can help test on Android (OpenCL Adreno) and Linux builds, and share C++ patches once we agree on an approach.

My Setup

Model: Qwen2.5-VL-3B-Instruct
llama.cpp: latest master (Oct 2025)
Platform: Android (JNI C++ bridge)
Backend: OpenCL (Adreno GPU)

Any thoughts, suggestions, or pointers on implementing multi-image embedding fusion in llama.cpp would be appreciated.
Happy to open a draft PR once we finalize a direction.

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request / Discussion] Add multi-image support for Qwen2.5-VL models in llama.cpp #16802

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Feature Request / Discussion] Add multi-image support for Qwen2.5-VL models in llama.cpp #16802

Uh oh!

Sanskar002 Oct 27, 2025

Context & Goal

Current Observations (limitations in llama.cpp)

What I’m Proposing

Technical Notes & Evidence

Request for Collaboration

My Setup

Replies: 0 comments

Sanskar002
Oct 27, 2025