[Benchmark] Support MVU_Eval #1348

Tianhao-Peng · 2025-12-05T04:34:47Z

Intro

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing benchmarks remain limited to single-video understanding.
To address this gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating multi-video understanding in MLLMs.

MVU-Eval contains 1,824 carefully curated QA pairs spanning 4,959 videos from diverse domains, covering both fundamental perception and high-order reasoning tasks.
It assesses eight core competencies: Object Recognition, Spatial Understanding, Counting, Comparison, Knowledge-Intensive Reasoning, In-Context Learning, Retrieval-Augmented Generation, and Temporal Reasoning.

Evaluation Results

Model	Overall	OR	SU	Counting	Comparison	KIR	ICL	RAG	TR
Random Choice	26.0	25.5	25.3	24.3	13.6	25.0	25.0	25.0	34.0
Closed-source Models
Gemini 2.5 Pro	58.4	47.6	54.7	65.6	76.3	50.2	34.8	43.7	83.1
Gemini 1.5 Pro	57.3	51.6	55.3	66.1	67.4	43.1	47.6	44.0	78.6
Gemini 2.0 Flash	56.3	46.0	52.0	45.4	75.6	53.7	45.1	44.5	79.1
Open-Sourced Models
Model Size > 40B
Qwen2.5-VL-72B	57.1	52.4	56.4	58.1	77.8	43.8	35.4	48.1	78.6
InternVL3-78B	50.6	42.9	56.4	49.8	72.6	43.8	34.1	49.0	56.8
InternVL2.5-78B	48.7	44.4	47.5	45.8	72.6	38.1	28.7	48.1	61.4
LLaVA-OneVision-72B	44.6	31.7	50.8	44.5	61.5	37.4	26.2	44.5	53.6
8B < Model Size ≤ 40B
Qwen2.5-VL-32B	55.6	48.4	57.0	59.5	71.1	43.4	28.7	48.4	76.9
InternVL3-38B	48.4	46.0	46.4	47.1	69.6	42.0	30.5	42.8	61.1
InternVL2.5-38B	44.5	37.3	40.8	40.1	67.4	40.2	28.0	43.1	54.7
4B < Model Size ≤ 8B
Qwen2.5-VL-7B	51.9	50.8	55.3	62.1	65.2	32.4	29.3	49.3	66.8
VideoChat-Flash-7B	48.5	48.4	55.9	55.5	67.4	38.1	25.0	43.1	57.1
VideoLLaMA3-7B	47.5	48.4	50.3	52.9	60.0	37.0	29.9	44.0	57.1
InternVideo2.5-8B [	46.4	45.2	43.0	44.9	63.7	37.7	28.7	48.1	56.0
mPLUG-Owl3-7B	45.0	48.4	53.6	50.2	50.4	29.5	24.4	41.6	58.2
InternVL3-8B	41.7	41.3	44.1	31.3	54.8	34.5	26.8	43.7	52.5
InternVL2.5-8B	41.1	38.1	40.8	28.2	54.8	36.9	28.0	44.5	51.1
LLaVA-OneVision-7B	40.4	40.5	36.3	36.6	45.9	29.9	28.0	45.1	51.5
MiniCPM-o	40.6	31.0	45.3	37.9	63.7	26.7	21.3	42.5	52.0
Slow-Fast-MLLM-7B	38.7	44.4	38.5	37.4	54.8	20.3	24.4	46.9	44.5
MiniCPM-V	37.9	34.1	41.3	32.6	45.9	26.3	23.2	43.7	47.7
LLaVA-Video-7B	27.4	26.2	26.3	35.7	43.0	7.9	22.0	18.9	42.4
LLaVa-NeXT-Video-7B	26.8	22.2	29.1	23.8	20.7	27.8	12.8	28.9	34.9
Model Size ≤ 4B
Qwen2.5-VL-3B	46.2	46.0	45.8	44.1	46.7	36.3	27.4	46.3	63.3
InternVL2.5-4B	37.3	32.5	40.2	28.2	45.2	33.8	17.7	42.8	46.4

References

@inproceedings{
  peng2025mvueval,
  title={{MVU}-Eval: Towards Multi-Video Understanding Evaluation for Multimodal {LLM}s},
  author={Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2025},
  url={https://openreview.net/forum?id=UZD5CQV6f9}
}

Tianhao-Peng added 2 commits December 5, 2025 12:29

add mvu_eval

9ec215c

Update video_dataset_config.py

ffc5378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Benchmark] Support MVU_Eval #1348

[Benchmark] Support MVU_Eval #1348

Tianhao-Peng commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Benchmark] Support MVU_Eval #1348

Are you sure you want to change the base?

[Benchmark] Support MVU_Eval #1348

Conversation

Tianhao-Peng commented Dec 5, 2025

Intro

Evaluation Results

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant