[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers by XingruiWang · Pull Request #1365 · EvolvingLMMs-Lab/lmms-eval

XingruiWang · 2026-06-14T17:53:25Z

Summary

ICLR 2026 (Webpage | Paper)

Add XModBench: a 61,320-MCQ cross-modal benchmark (5 task families × 6 directional flips: a↔t, a↔v, t↔v) loaded from HF RyanWW/XModBench, with a stratified 6K Lite split.
Add 5 omni-LLM *_interleave chat wrappers (Qwen2.5-Omni, Qwen3-Omni, Baichuan-Omni, MiniCPM-o, OmniVinci) on top of a new shared InterleaveChatMixin, so XModBench-style multi-media stem + 4 multi-media options run end-to-end on
lmms-eval.
Add reproduction infra: slurm launchers (submit_lite.sh, submit_full.sh, generic ada/a5000 templates), upload_eval_logs.sh, summarize.py, and RESULTS.md / README.md / PR.md covering 8 Lite + Full reproductions vs paper.

In scope

lmms_eval/tasks/xmod_bench/ — task package: 14 yaml configs (6 Lite + 10 Full combos + group), utils.py (data loader, doc_to_visual/messages, scoring + family/subtask aggregation), make_lite.py, build_data.py,
subsample_perception.py, summarize.py, docs.
lmms_eval/models/chat/_interleave_base.py — shared InterleaveChatMixin (is_simple=False, doc_to_messages loop, shared IMAGE_KWARGS/VIDEO_KWARGS caps).
5 new chat wrappers under lmms_eval/models/chat/: qwen2_5_omni_interleave.py, qwen3_omni_interleave.py, baichuan_omni_interleave.py, minicpm_o_interleave.py, omnivinci_interleave.py (+ registration in models/__init__.py).
Reproduction launchers: submit_lite.sh, submit_full.sh, run_xmod_lite_generic.slurm, run_xmod_full_generic.slurm, run_xmod_bench_qwen2_5_omni.slurm, run_xmod_bench_lite_qwen2_5_omni.slurm, upload_eval_logs.sh.

Out of scope

No upstream model-file edits — all model-side fixes (Baichuan max_pixels cache, MiniCPM-o audio align / num_beams=1, OmniVinci fixed image tile, video-SALMONN CUDA_HOME auto-set, Gemma 4 multi-video frame stack) live in our wrapper
layer only.
No changes to existing model wrappers beyond an _interleave_base shared mixin import; legacy simple wrappers untouched.
No new dataset uploads from this PR — XModBench data is fetched at runtime from RyanWW/XModBench on HF.
API-only and gate-derived results (Gemini 2.x / 3.x, EchoInk paper-cited) are kept on the project page; not part of this code PR.

Validation

python -m lmms_eval --model qwen2_5_omni_interleave --tasks xmod_bench_lite | sample size: N=6000 (6 cfg × 1000) | key metrics: cfg_avg 59.1, fam_avg 59.1, empty 0/6000 (0.0%) | result: pass (matches paper Lite ±0.2pp).
python -m lmms_eval --model qwen3_omni_interleave --tasks xmod_bench_<combo> | sample size: N=61,320 (10 combos) | key metrics: Full cfg_avg 68.6, 0–2 empty / combo (<0.03%) | result: pass (Full leaderboard, SOTA open-source).
7 additional reproductions audited (Baichuan-Omni-1.5, MiniCPM-o-2.6, OmniVinci arch-2/2, video-SALMONN-2, Gemma 4 E4B-it, Gemini 3.1-pro & 3.5-flash Lite via the new SDK wrapper) | sample size: N=6000 or N=61,320 each | key metric:
every published config <3% empty after wrapper fixes | result: pass.

Risk / Compatibility

Net additive — no breaking changes to existing tasks/models. New interleave wrappers register fresh model ids (*_interleave); legacy simple wrappers (qwen2_5_omni, minicpm_o, …) remain available unchanged.
InterleaveChatMixin ships defaults (fps=2, max_frames=16, max_pixels=384²) tuned for 24 GB GPUs; downstream users wanting full-resolution recall should override on the wrapper subclass.

Type of Change

XModBench evaluates omni multimodal models across all combinations of Audio, Image, Video, and Text modalities. Each sample has a condition in one modality and four options in another modality (A/B/C/D). - 10 subtask YAMLs covering all modality pairs (audio_image, audio_text, audio_video, image_audio, image_text, text_audio, text_image, text_video, video_audio, video_text) - Group YAML `xmod_bench` to run all subtasks at once - utils.py with doc_to_visual, doc_to_messages (interleaved), doc_to_text, process_results, and aggregate_results with per-category breakdown - Built on top of AudioBench data; set AUDIOBENCH_ROOT env var for paths

- Corrected sample counts for various tasks in README.md. - Updated environment variable from AUDIOBENCH_ROOT to XMODBENCH in utils.py and README.md. - Enhanced media loading functionality in utils.py to handle different modalities. - Added group and subtask categorization in result processing for better accuracy reporting.

Switch xmod_bench task from local JSONL paths to the published HF dataset. utils.py now resolves XMODBENCH_ROOT via snapshot_download (RyanWW/XModBench), overridable by the XMODBENCH env var. The 10 task yamls load JSONL via hf://datasets/RyanWW/XModBench/data/*.jsonl. JSONL paths are normalized to Data/... (./benchmark/ prefix stripped). build_data.py: cap natures subtask at 500/file and truncate panaroma at 390.

…mary - XModBench-Lite: 5 families × 6 configs × 200 = 6000 samples, balanced. Generated by make_lite.py; uploaded to RyanWW/XModBench under data_lite/. New yamls: xmod_bench_lite_{a2t,a2v,t2a,t2v,v2a,v2t} + group xmod_bench_lite. - Metrics overhaul. process_results now emits the canonical record {family, subtask, config, correct} (vision = image ∪ video). Per-task aggregator logs accuracy by config, by family, and per (family, subtask, config) — sufficient as Level-1 raw data. - summarize.py: stand-alone Level-2 summary script over lmms-eval samples logs. Produces 17 numbers — 6 by-config, 5 by-family, 3 modality disparity (ΔT_vs_V/A, ΔV_vs_A), 3 directional imbalance (ΔT↔V/A, ΔV↔A). - New run_xmod_bench_lite_qwen2_5_omni.slurm (array 0-5). The full-bench slurm script now defaults XMODBENCH to AudioBench_data, matching the HF repo layout.

XModBench items carry media in the question stem AND every answer option (up to 5 per item). lmms-eval's *simple* model interface only attaches one media object per request via doc_to_visual, so qwen2.5-omni/qwen3-omni/ omnivinci/baichuan-omni silently drop the option media and score far below their paper numbers (e.g. Qwen2.5-Omni t2a 26% -> 51% once fixed). Add chat-style (is_simple=False) wrappers that consume the task's doc_to_messages output and feed the full interleaved prompt to each model: - _interleave_base.InterleaveChatMixin: shared request loop written once (Collator, chunk loop, doc_to_messages extraction, error/cache handling). Per-media size/frame caps (fps=12, max_frames=60, max_pixels=512^2) match the upstream XModBench/AudioBench runner to avoid GPU OOM. - qwen2_5_omni_interleave, qwen3_omni_interleave: process_mm_info path - omnivinci_interleave: VILA processor + media/media_config path - baichuan_omni_interleave: special-token string prompt path Each concrete wrapper only implements _infer_one (~50-90 lines). Registered under AVAILABLE_CHAT_TEMPLATE_MODELS. VITA is intentionally skipped: its wrapper's generate_until only supports a single image_tensor/audios per request, so multi-distinct-media needs deeper rework. Also add resource-aware launchers: run_xmod_lite_generic.slurm + submit_lite.sh split light (no-video a2t/t2a -> 1 GPU) vs heavy (video configs -> 4 GPU) so a full 6-config sweep fits under the 24-GPU QOS cap and runs concurrently. debug_xmod_bench_lite.slurm for --limit smoke tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

For multimodal inputs the processor expands media placeholder tokens, so inputs.input_ids length does NOT match the generated sequence prefix. Trimming via out_ids[len(in_ids):] sliced into / past the generated tokens and produced EMPTY strings on ~50% of video-condition samples (qwen2.5-omni v2a dropped to 43.9 vs paper 50.5 purely from empty outputs). Now decode the full sequence and take the text after the final "assistant\n" turn, mirroring the upstream XModBench/AudioBench Qwen2.5-Omni runner. Verified: v2a debug no longer emits empty responses. Applied to qwen2_5_omni_interleave and qwen3_omni_interleave. Also add baichuan_omni_interleave + register all wrappers; debug slurm now honors MODEL_ARGS_EXTRA so non-attn_implementation models work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fps=12/max_frames=60/512px (AudioBench's ~80GB-GPU settings) OOM'd ~50% of video-condition samples on 24GB a5000s — they were silently caught and returned empty, dropping qwen2.5-omni v2a to 43.9 vs paper 50.5 (192/1000 OOM == 192 empty). Tighter budget (fps=2, 16 frames, 384px) keeps every sample on-GPU; XModBench video tasks don't need dense frames. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

t2a carries 4 audio options; singer_identification clips are long songs that OOM on a single 24GB GPU (100/1000 t2a samples dropped, qwen2.5-omni t2a 50.3 vs paper 55.4). Only a2t (1 audio) stays on the light profile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- InterleaveChatMixin.video_kwargs/image_kwargs are now overridable class attrs (Baichuan-Omni's video path needs far more memory than Qwen's; Qwen stays at the default fps=2/16f/384px that reproduces paper). - _to_qwen_messages takes explicit image/video kwargs. - omnivinci_interleave.__init__ mirrors the official AudioBench processor config (audio_chunk_length=max_3600, num_video_frames) without touching the upstream omnivinci model file; drops the system turn that broke VILA mm_info indexing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

a2v/t2v carry 4 vision options; at Baichuan's default ~1MP/image they OOM 700+/1000 even on 4x48GB. Set processor.config.max_pixels to ~0.2MP in the subclass __init__ (read at processor_omni.py:164), no upstream change. Mirrors the omnivinci processor-config approach. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

OmniImageProcessor caches max_pixels at __init__, so config.max_pixels set afterwards had no effect (a2v/t2v still OOM 800+/1000). Set the cached attribute directly on processor.visual_processor and .video_processor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Qwen2.5-Omni 6/6 configs within |Δ|<5 (paper reproduced on Lite). Qwen3-Omni numbers reported (new model). Baichuan-Omni-1.5 4/6 within |Δ|<5, the other two genuine positive Lite-subsample deviations. Documents all interleave-wrapper fixes; no upstream model file modified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

dynamic_s2 tiling explodes 4 vision options into many tiles/image, OOMing a2v/t2v (724/766 OOM). Set image_aspect_ratio=resize (single fixed tile) in the subclass; no upstream change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

OmniVinci image_aspect_ratio=resize avoids OOM but degenerates a2v/t2v to empty output, so revert it — OmniVinci is reported best-effort on its 2 clean configs (A→T 62.2, V→T 78.8). RESULTS.md now has the full 4-model table. a2v/t2v (4 vision opts) and t2a/v2a (4 audio opts) hit VILA-internal limits not resolvable without upstream edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PR hygiene: remove an accidental root-level Untitled file and revert the only upstream model-file change (a leftover ipdb debug comment) so the PR touches no original model code except the models/__init__.py registry (4 new *_interleave entries). All adaptations live in new task + chat wrapper files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ffeat: add XModBench [ICLR 2026] cross-modal benchmark + interleaved-multimedia omni model wrappers

XingruiWang and others added 24 commits February 27, 2026 17:53

Merge branch 'main' into feat/xmod-bench

9c25ff5

Merge branch 'EvolvingLMMs-Lab:main' into feat/xmod-bench

b04e983

style: auto-fix lint (black + isort)

152efbf

docs(xmod_bench): add PR.md (pull-request description)

cbb4f20

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

style: auto-fix lint (black + isort)

5ca073e

Merge pull request #1 from XingruiWang/feat/xmod-bench

3657db4

ffeat: add XModBench [ICLR 2026] cross-modal benchmark + interleaved-multimedia omni model wrappers

Merge branch 'EvolvingLMMs-Lab:main' into main

aef4cb8

style: auto-fix lint (black + isort)

03b68e3

Merge branch 'EvolvingLMMs-Lab:main' into main

0ced2d4

XingruiWang changed the title ~~# feat(xmod_bench): add cross-modal MCQ benchmark + omni-LLM interleave wrappers~~ # [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers Jun 14, 2026

XingruiWang changed the title ~~# [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers~~ # **ICLR 2026 XmodBench**: add New MCQ benchmark + omni-LLM interleave wrappers Jun 14, 2026

XingruiWang changed the title ~~# **ICLR 2026 XmodBench**: add New MCQ benchmark + omni-LLM interleave wrappers~~ # [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers Jun 14, 2026

Merge branch 'EvolvingLMMs-Lab:main' into main

14c715e

XingruiWang changed the title ~~# [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers~~ [ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers#1365

[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers#1365
XingruiWang wants to merge 25 commits into
EvolvingLMMs-Lab:mainfrom
XingruiWang:main

XingruiWang commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XingruiWang commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XingruiWang commented Jun 14, 2026 •

edited

Loading