Skip to content

Conversation

@Tianhao-Peng
Copy link

Intro

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing benchmarks remain limited to single-video understanding.
To address this gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating multi-video understanding in MLLMs.

MVU-Eval contains 1,824 carefully curated QA pairs spanning 4,959 videos from diverse domains, covering both fundamental perception and high-order reasoning tasks.
It assesses eight core competencies: Object Recognition, Spatial Understanding, Counting, Comparison, Knowledge-Intensive Reasoning, In-Context Learning, Retrieval-Augmented Generation, and Temporal Reasoning.

Evaluation Results

Model Overall OR SU Counting Comparison KIR ICL RAG TR
Random Choice 26.0 25.5 25.3 24.3 13.6 25.0 25.0 25.0 34.0
Closed-source Models
Gemini 2.5 Pro 58.4 47.6 54.7 65.6 76.3 50.2 34.8 43.7 83.1
Gemini 1.5 Pro 57.3 51.6 55.3 66.1 67.4 43.1 47.6 44.0 78.6
Gemini 2.0 Flash 56.3 46.0 52.0 45.4 75.6 53.7 45.1 44.5 79.1
Open-Sourced Models
Model Size > 40B
Qwen2.5-VL-72B 57.1 52.4 56.4 58.1 77.8 43.8 35.4 48.1 78.6
InternVL3-78B 50.6 42.9 56.4 49.8 72.6 43.8 34.1 49.0 56.8
InternVL2.5-78B 48.7 44.4 47.5 45.8 72.6 38.1 28.7 48.1 61.4
LLaVA-OneVision-72B 44.6 31.7 50.8 44.5 61.5 37.4 26.2 44.5 53.6
8B < Model Size ≤ 40B
Qwen2.5-VL-32B 55.6 48.4 57.0 59.5 71.1 43.4 28.7 48.4 76.9
InternVL3-38B 48.4 46.0 46.4 47.1 69.6 42.0 30.5 42.8 61.1
InternVL2.5-38B 44.5 37.3 40.8 40.1 67.4 40.2 28.0 43.1 54.7
4B < Model Size ≤ 8B
Qwen2.5-VL-7B 51.9 50.8 55.3 62.1 65.2 32.4 29.3 49.3 66.8
VideoChat-Flash-7B 48.5 48.4 55.9 55.5 67.4 38.1 25.0 43.1 57.1
VideoLLaMA3-7B 47.5 48.4 50.3 52.9 60.0 37.0 29.9 44.0 57.1
InternVideo2.5-8B [ 46.4 45.2 43.0 44.9 63.7 37.7 28.7 48.1 56.0
mPLUG-Owl3-7B 45.0 48.4 53.6 50.2 50.4 29.5 24.4 41.6 58.2
InternVL3-8B 41.7 41.3 44.1 31.3 54.8 34.5 26.8 43.7 52.5
InternVL2.5-8B 41.1 38.1 40.8 28.2 54.8 36.9 28.0 44.5 51.1
LLaVA-OneVision-7B 40.4 40.5 36.3 36.6 45.9 29.9 28.0 45.1 51.5
MiniCPM-o 40.6 31.0 45.3 37.9 63.7 26.7 21.3 42.5 52.0
Slow-Fast-MLLM-7B 38.7 44.4 38.5 37.4 54.8 20.3 24.4 46.9 44.5
MiniCPM-V 37.9 34.1 41.3 32.6 45.9 26.3 23.2 43.7 47.7
LLaVA-Video-7B 27.4 26.2 26.3 35.7 43.0 7.9 22.0 18.9 42.4
LLaVa-NeXT-Video-7B 26.8 22.2 29.1 23.8 20.7 27.8 12.8 28.9 34.9
Model Size ≤ 4B
Qwen2.5-VL-3B 46.2 46.0 45.8 44.1 46.7 36.3 27.4 46.3 63.3
InternVL2.5-4B 37.3 32.5 40.2 28.2 45.2 33.8 17.7 42.8 46.4

References

@inproceedings{
  peng2025mvueval,
  title={{MVU}-Eval: Towards Multi-Video Understanding Evaluation for Multimodal {LLM}s},
  author={Tianhao Peng and Haochen Wang and Yuanxing Zhang and Zekun Moore Wang and Zili Wang and Ge Zhang and Jian Yang and Shihao Li and Yanghai Wang and Xintao Wang and Houyi Li and Wei Ji and Pengfei Wan and Wenhao Huang and Zhaoxiang Zhang and Jiaheng Liu},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2025},
  url={https://openreview.net/forum?id=UZD5CQV6f9}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant