Agentic Video Understanding Benchmark — 100 multiple-choice video QA questions, 26 options each (A-Z; ~3.8% random baseline). Two evaluation frameworks, one entry point (inference.py):
by Seoyeon An*, Hyeonseo Jang*, Minsu Kim*, Chanho Lee, Younghan Park, Kangwook Lee (KRAFTON AI)
System requirement: ffmpeg on PATH (used by singleturn to sample frames). On Debian/Ubuntu: sudo apt install ffmpeg; on macOS: brew install ffmpeg.
git clone https://github.com/krafton-ai/agentvidbench.git && cd agentvidbench
conda create -n agentvideobench python=3.12 -y
conda activate agentvideobench
pip install -r requirements.txthf download agentvidbench/agentvidbench --repo-type dataset --local-dir datasetThis populates:
dataset/
├── questions.jsonl # 100 rows, one per Q (incl. transcript_path column)
├── videos.jsonl # video metadata
├── videos/video*.mp4 # 71 video files
└── transcripts/video*.srt
Copy .env.example to .env and fill in only the rows your chosen models / framework need:
cp .env.example .env| variable | required by |
|---|---|
GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_LOCATION |
every Gemini-family run: ours (always), singleturn when --model gemini-* |
AVB_GCS_BUCKET, AVB_GCS_PREFIX |
same as above (videos uploaded once, cached) |
OPENAI_API_KEY |
gpt-*, o1, o3, o4 models |
ANTHROPIC_API_KEY |
claude-* models |
VLLM_BASE_URL, VLLM_API_KEY |
only if vLLM serves on a non-default host/port |
For Gemini-family runs also authenticate gcloud once:
gcloud auth application-default login
gcloud storage buckets create gs://<your-bucket> --location=us-central1 # if you don't have oneSkip this section if you only run API models (Gemini / OpenAI / Anthropic).
Open-source models (Qwen, Gemma, Kimi-VL, etc.) are served via vLLM. Install and start the server in a separate shell:
pip install vllm==0.20.1
vllm serve <model-id> --port 8000 \
--allowed-local-media-path "$(pwd)/dataset/videos" \
--tensor-parallel-size 1 --gpu-memory-utilization 0.85 \
--max-model-len 32768--max-model-len 32768 is needed on a single 24 GB GPU; the default for recent VL models (262 K context) demands ~36 GB of KV cache and OOMs.
Examples of <model-id>: Qwen/Qwen3-VL-4B-Instruct, Qwen/Qwen3-VL-2B-Instruct, google/gemma-4-E2B-it. inference.py will refuse to launch and print this same vllm serve line if the endpoint isn't reachable.
| framework | what it does | external deps |
|---|---|---|
| singleturn | One model call per question | API key for the chosen model family |
| ours | ReAct planner + Vertex Gemini analyze_video tool + offline transcript tool |
Vertex AI + GCS bucket |
python inference.py --model <id> --framework {singleturn,ours} --tag run1Re-running with the same --tag resumes (per-question results are cached).
Outputs:
exp/<framework>_<model>_<tag>/
└── inference/
├── progress/question<N>.json # per-Q metadata + tokens (resume marker)
├── trajectories/question<N>.txt # canonical judge-input string (every framework)
└── summary.json # inference manifest (results + run metadata)
python evaluate.py exp/<framework>_<model>_<tag>/ \
[--questions 1,5,10-20] \
[--workers 16] \
[--n-extract 8] \
[--judge-provider anthropic] # or openai | gemini_vertexRequires:
GOOGLE_CLOUD_PROJECT(letter extraction always runs on Vertex Gemini Flash).- API key for the chosen
--judge-provider(ANTHROPIC_API_KEY,OPENAI_API_KEY, or Vertex auth).
Re-running skips qids with a cached evaluation/results/question*.json. Pass --rerun to re-score every question.
Outputs:
exp/<framework>_<model>_<tag>/
├── inference/
└── evaluation/ # populated by `python evaluate.py …`
├── letters/question<N>.json
├── judge/question<N>.json
├── results/question<N>.json
└── summary.json # accuracy + process means
Browse summary.json and per-Q judge output across runs in a local web UI:
streamlit run viewer.pyStopping the server.
Ctrl+Cis known to hang while Streamlit waits for open browser sessions and the file watcher to wind down. Workaround:Ctrl+Zto suspend, thenkillall -9 streamlit(orpkill -9 -f "streamlit run").
@misc{krafton2026agentvidbench,
title = {AgentVidBench: A Multi-Hop Video Question Answering Benchmark for Evaluating MLLM Agents},
author = {An, Seoyeon and Jang, Hyeonseo and Kim, Minsu and Lee, Chanho and Park, Younghan and Lee, Kangwook},
year = {2026},
url = {https://github.com/krafton-ai/agentvidbench}
}
