Beyond the Last Frame: Process-aware Evaluation for
Generative Video Reasoning

Illustration of outcome-hacking, where the generated video has the correct final state but an incorrect process.

👀 Overview

Current video generation models often suffer from Outcome-hacking: they may generate a video with the correct final outcome but a wrong process. This hacks traditional single-frame evaluation metrics.

VIPER (VIdeo Process Evaluation for Reasoning) is designed to bridge this gap:

🏆 Comprehensive Benchmark: 309 carefully curated samples spanning 6 distinct domains (Temporal, Structural, Symbolic, Spatial, Physics, and Planning).
📏 New Metric (POC@r): Process-Outcome Consistency. We evaluate correctness at both the process and outcome levels by uniformly sampling frames at rate $r$.
🚫 Failure Pattern: We identify and summarize four common failure patterns in current generative video models.

Overview of VIPER. VIPER consists of 16 tasks from 6 domains

📊 Dataset Statistics

VIPER covers diverse reasoning tasks to ensure a holistic evaluation of video generation capabilities.

Domain	Samples	Task Types
Physics	32	experiment, game
Planning	44	navigation, manipulation
Spatial	60	rotate, restore
Structural	70	chess, maze, sudoku
Symbolic	60	math, multimodal
Temporal	43	obj_move, zoom

🚀 Quick Start

Download

from datasets import load_dataset

# Load the full VIPER benchmark
dataset = load_dataset("Monosail/VIPER")

Data Fields

id: Unique identifier for the sample
domain: The reasoning domain (Physics, Planning, Spatial, Structural, Symbolic, Temporal)
task_type: Specific task category within the domain
prompt: Text prompt describing the task
image: The input image
reference_frames: Ground-truth image frames
reference_texts: Ground-truth text descriptions
protocol: Process-level task constraints

🛠️ Evaluation

The evaluation pipeline is split into two stages: inference and judgement. During inference, we provide scripts to generate inference outputs on the VIPER datasets using the following supported models:

Closed-source (API)
- Sora2
- Veo3.1
- Seedance 1.5 Pro (Opened)
- Wan2.6 (Opened)
Open-source
- Wan2.2
- Hunyuan-1.5
  During judgement, we use the OpenRouter API and default to gpt-5. You may use any MLLM as long as it is compatible with the provider endpoint.

Inference

Seedance 1.5 pro

To run video inference with Seedance 1.5 Pro:

bash scripts/run_sd.sh

Prerequisites:

Apply for the Seedance API
Set the environment variable ARK_API_KEY

Wan2.6

To run video inference with Wan2.6:

bash scripts/run_wan26.sh

Prerequisites:

Apply for the Wan2.6 api
Set the environment variable DASHSCOPE_API_KEY

Judgement

📝 Citation

If you find our benchmark useful for your research, please consider citing:

@article{li2026viper,
  title={Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning},
  author={Li, Yifan and Gu, Yukai and Min, Yingqian and Liu, Zikang and Du, Yifan and Zhou, Kun and Yang, Min and Zhao, Wayne Xin and Qiu, Minghui},
  journal={arXiv preprint arXiv:2512.24952},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
models/standalone		models/standalone
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_helper.py		data_helper.py
vlm_helper.py		vlm_helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond the Last Frame: Process-aware Evaluation for
Generative Video Reasoning

👀 Overview

📊 Dataset Statistics

🚀 Quick Start

Download

Data Fields

🛠️ Evaluation

Inference

Seedance 1.5 pro

Wan2.6

Judgement

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AoiDragon/VIPER

Folders and files

Latest commit

History

Repository files navigation

Beyond the Last Frame: Process-aware Evaluation for Generative Video Reasoning

👀 Overview

📊 Dataset Statistics

🚀 Quick Start

Download

Data Fields

🛠️ Evaluation

Inference

Seedance 1.5 pro

Wan2.6

Judgement

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Beyond the Last Frame: Process-aware Evaluation for
Generative Video Reasoning

Packages