Website · Code · Leaderboard · Dataset · Dataset-Zip · Issue
Each reported number corresponds to the average score (overall, temporal, spatial, and intent reasoning).
| Difficulty | Models | Size | Over. Avg. | Temporal | Spatial | Intent |
|---|---|---|---|---|---|---|
| Hard | GPT 4o | - | 22.21 | 24.92 | 27.14 | 13.80 |
| Hard | Gemini 2.5 Pro 🥇 | - | 31.01 | 38.18 | 30.08 | 25.20 |
| Hard | Gemini 1.5 Pro | - | 19.07 | 22.53 | 21.57 | 17.25 |
| Hard | Claude 3.5 | - | 28.89 | 32.84 | 29.18 | 23.41 |
| Hard | InternVL2.5 | 26B | 22.45 | 25.33 | 27.42 | 12.64 |
| Hard | InternVL2.5 | 8B | 20.39 | 21.30 | 29.41 | 11.42 |
| Hard | InternVL2.5 | 4B | 17.31 | 17.39 | 23.04 | 13.13 |
| Hard | LLaVA Next | 32B | 17.83 | 11.28 | 26.09 | 10.10 |
| Hard | LLaVA Video | 7B | 17.35 | 13.02 | 27.49 | 10.18 |
| Hard | LLaVA OneVision | 7B | 14.27 | 9.55 | 24.74 | 10.15 |
| Hard | Qwen2.5 VL | 32B | 19.39 | 13.19 | 27.85 | 14.05 |
| Hard | Qwen2.5 VL | 7B | 20.34 | 12.31 | 28.40 | 15.48 |
| Medium | GPT 4o 🥇 | - | 41.21 | 44.89 | 47.03 | 28.19 |
| Medium | Gemini 2.5 Pro | - | 41.07 | 41.31 | 48.33 | 33.06 |
| Medium | Gemini 1.5 Pro | - | 37.13 | 40.69 | 43.81 | 31.06 |
| Medium | Claude 3.5 | - | 37.99 | 36.46 | 47.34 | 31.09 |
| Medium | InternVL2.5 | 26B | 36.39 | 37.85 | 47.51 | 27.55 |
| Medium | InternVL2.5 | 8B | 35.44 | 39.85 | 51.07 | 18.98 |
| Medium | InternVL2.5 | 4B | 36.53 | 31.21 | 45.36 | 32.68 |
| Medium | LLaVA Next | 32B | 21.07 | 13.57 | 33.08 | 14.24 |
| Medium | LLaVA Video | 7B | 24.04 | 19.33 | 30.50 | 19.72 |
| Medium | LLaVA OneVision | 7B | 17.76 | 17.81 | 24.71 | 17.12 |
| Medium | Qwen2.5 VL | 32B | 29.93 | 23.34 | 41.94 | 25.82 |
| Medium | Qwen2.5 VL | 7B | 28.79 | 22.18 | 34.64 | 22.89 |
| Easy | GPT 4o | - | 45.01 | 55.33 | 38.08 | 43.72 |
| Easy | Gemini 2.5 Pro 🥇 | - | 59.36 | 61.16 | 54.51 | 58.09 |
| Easy | Gemini 1.5 Pro | - | 48.05 | 53.22 | 47.85 | 45.37 |
| Easy | Claude 3.5 | - | 50.14 | 53.28 | 48.51 | 46.40 |
| Easy | InternVL2.5 | 26B | 55.08 | 58.41 | 53.46 | 44.45 |
| Easy | InternVL2.5 | 8B | 51.03 | 53.64 | 54.52 | 42.20 |
| Easy | InternVL2.5 | 4B | 48.93 | 46.55 | 52.31 | 43.65 |
| Easy | LLaVA Next | 32B | 35.32 | 31.22 | 40.09 | 34.34 |
| Easy | LLaVA Video | 7B | 30.44 | 29.41 | 34.12 | 31.64 |
| Easy | LLaVA OneVision | 7B | 31.10 | 29.46 | 33.78 | 29.88 |
| Easy | Qwen2.5 VL | 32B | 48.35 | 50.68 | 47.82 | 44.97 |
| Easy | Qwen2.5 VL | 7B | 37.97 | 38.87 | 33.20 | 36.45 |
More results can be found at the link: https://open-space-reasoning.github.io/
This benchmark includes approximately 2,000 videos and 19,000 human-annotated question-answer pairs, covering a wide range of reasoning tasks (as shown in Figure 1). We provide a sample set (approximately 4K examples) for efficiency evaluation, randomly selected from the full dataset (19K examples). All annotations were performed by highly educated annotators, each holding at least a master's degree in engineering-related fields such as mathematics or computer science. The dataset features a variety of video lengths, categories, and frame counts, and spans three primary open-space reasoning scenarios: land space, water space, and air space. An overview of the dataset’s characteristics is shown in Figure 2, which illustrates the distributions of video duration, domain coverage, and reasoning styles. During annotation, we first design the hard-level tasks and label each question with the ground-truth answer. Based on these, we then construct the medium and easy tasks. The primary differences between difficulty levels lie in the number and types of answer choices. Details of the annotation procedure and difficulty levels are provided in our paper.
One example from air space:
{
"id": 1,
"dataset": "air_space_long",
"scene_name": "air_space_long_1.mp4",
"reasoning_style": "intent_goal_reasoning",
"question": "How many moving airplanes are observed in this video?",
"ground_truth": "A",
"options": [
"E. [0,1]",
"C. [8,9]",
"A. [4,5]",
"D. [6,7]",
"B. [10,11]",
"F. [2,3]"
]
}For development, you can install the package by cloning the repository and running the following command:
pip install uv
git clone [email protected]:SafeRL-Lab/m4r.git
cd m4r
uv venv dev
source dev/bin/activate
uv pip install -e .
uv pip install -U "qwen-vl-utils" You can download the dataset directly from our Hugging Face repository.
git lfs install
git clone https://huggingface.co/datasets/Open-Space-Reasoning/M4R
If you encounter any issues during the download, we also provide a zipped version for convenience: Download Dataset (ZIP)
Here's a basic evaluation example:
Download the dataset from Hugging Face, and set the dataset path to the corresponding task file. For example, specify the dataset path as
/your-dataset-path/land_space/short/hard/spatial_reasoning.jsonin the task configuration file located at/Open-Space-Reasoning/lmms_eval/tasks/land_space_short/land_space_hard.yaml.
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
--model qwen2_5_vl \
--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=12845056,use_flash_attention_2=False,interleave_visuals=True \
--tasks land_space_hard \
--batch_size 1 \
--log_samples \
--output_path /pasteur2/u/xhanwang/lmms-eval/outputs/land_space_hard/Modify the following examples to test more models as the above script.
More examples can be found in examples/models
Evaluation of OpenAI-Compatible Model
bash examples/models/openai_compatible.sh
bash examples/models/xai_grok.shEvaluation of vLLM
bash examples/models/vllm_qwen2vl.shEvaluation of LLaVA-OneVision
bash examples/models/llava_onevision.shEvaluation of LLaMA-3.2-Vision
bash examples/models/llama_vision.shEvaluation of Qwen2-VL
bash examples/models/qwen2_vl.sh
bash examples/models/qwen2_5_vl.shEvaluation of LLaVA on MME
If you want to test LLaVA 1.5, you will have to clone their repo from LLaVA and
bash examples/models/llava_next.shEvaluation with tensor parallel for bigger model (llava-next-72b)
bash examples/models/tensor_parallel.shEvaluation with SGLang for bigger model (llava-next-72b)
bash examples/models/sglang.shEvaluation with vLLM for bigger model (llava-next-72b)
bash examples/models/vllm_qwen2vl.shMore Parameters
python3 -m lmms_eval --helpEnvironmental Variables Before running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run.
export OPENAI_API_KEY="<YOUR_API_KEY>"
export HF_HOME="<Path to HF cache>"
export HF_TOKEN="<YOUR_API_KEY>"
export HF_HUB_ENABLE_HF_TRANSFER="1"
export REKA_API_KEY="<YOUR_API_KEY>"
# Other possible environment variables include
# ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc.Common Environment Issues
Sometimes you might encounter some common issues for example error related to httpx or protobuf. To solve these issues, you can first try
python3 -m pip install httpx==0.23.3;
python3 -m pip install protobuf==3.20;
# If you are using numpy==2.x, sometimes may causing errors
python3 -m pip install numpy==1.26;
# Someties sentencepiece are required for tokenizer to work
python3 -m pip install sentencepiece;If you find the repository useful, please cite the study
@article{gu2025accidentbench,
title={AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond},
author={Gu, Shangding and Wang, Xiaohan and Ying, Donghao and Zhao, Haoyu and Yang, Runing and Jin, Ming and Li, Boyi and Pavone, Marco and Yeung-Levy, Serena and Wang, Jun and others},
journal={arXiv preprint arXiv:2509.26636},
year={2025}
}This repository is adapted from lmms-eval for use in our benchmark. We thank the contributors of lmms-eval for their efforts and contributions.





{ "id": , "dataset": "str", // e.g., sub dataset filename "scene_name": "str", // e.g., video filename "reasoning_style": "str", // e.g., temporal_reasoning, intent_goal_reasoning, etc. "question": "str", // The reasoning question related to the scene "ground_truth": "str", // Correct answer key (e.g., "A", "B", etc.) "options": ["str", "str", "str", "str", "str", "str"] // Multiple-choice options }