Skip to content

[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers#1365

Open
XingruiWang wants to merge 25 commits into
EvolvingLMMs-Lab:mainfrom
XingruiWang:main
Open

[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers#1365
XingruiWang wants to merge 25 commits into
EvolvingLMMs-Lab:mainfrom
XingruiWang:main

Conversation

@XingruiWang

@XingruiWang XingruiWang commented Jun 14, 2026

Copy link
Copy Markdown

Summary

ICLR 2026 (Webpage | Paper)

  • Add XModBench: a 61,320-MCQ cross-modal benchmark (5 task families × 6 directional flips: a↔t, a↔v, t↔v) loaded from HF RyanWW/XModBench, with a stratified 6K Lite split.
  • Add 5 omni-LLM *_interleave chat wrappers (Qwen2.5-Omni, Qwen3-Omni, Baichuan-Omni, MiniCPM-o, OmniVinci) on top of a new shared InterleaveChatMixin, so XModBench-style multi-media stem + 4 multi-media options run end-to-end on
    lmms-eval.
  • Add reproduction infra: slurm launchers (submit_lite.sh, submit_full.sh, generic ada/a5000 templates), upload_eval_logs.sh, summarize.py, and RESULTS.md / README.md / PR.md covering 8 Lite + Full reproductions vs paper.

In scope

  • lmms_eval/tasks/xmod_bench/ — task package: 14 yaml configs (6 Lite + 10 Full combos + group), utils.py (data loader, doc_to_visual/messages, scoring + family/subtask aggregation), make_lite.py, build_data.py,
    subsample_perception.py, summarize.py, docs.
  • lmms_eval/models/chat/_interleave_base.py — shared InterleaveChatMixin (is_simple=False, doc_to_messages loop, shared IMAGE_KWARGS/VIDEO_KWARGS caps).
  • 5 new chat wrappers under lmms_eval/models/chat/: qwen2_5_omni_interleave.py, qwen3_omni_interleave.py, baichuan_omni_interleave.py, minicpm_o_interleave.py, omnivinci_interleave.py (+ registration in models/__init__.py).
  • Reproduction launchers: submit_lite.sh, submit_full.sh, run_xmod_lite_generic.slurm, run_xmod_full_generic.slurm, run_xmod_bench_qwen2_5_omni.slurm, run_xmod_bench_lite_qwen2_5_omni.slurm, upload_eval_logs.sh.

Out of scope

  • No upstream model-file edits — all model-side fixes (Baichuan max_pixels cache, MiniCPM-o audio align / num_beams=1, OmniVinci fixed image tile, video-SALMONN CUDA_HOME auto-set, Gemma 4 multi-video frame stack) live in our wrapper
    layer only.
  • No changes to existing model wrappers beyond an _interleave_base shared mixin import; legacy simple wrappers untouched.
  • No new dataset uploads from this PR — XModBench data is fetched at runtime from RyanWW/XModBench on HF.
  • API-only and gate-derived results (Gemini 2.x / 3.x, EchoInk paper-cited) are kept on the project page; not part of this code PR.

Validation

  • python -m lmms_eval --model qwen2_5_omni_interleave --tasks xmod_bench_lite | sample size: N=6000 (6 cfg × 1000) | key metrics: cfg_avg 59.1, fam_avg 59.1, empty 0/6000 (0.0%) | result: pass (matches paper Lite ±0.2pp).
  • python -m lmms_eval --model qwen3_omni_interleave --tasks xmod_bench_<combo> | sample size: N=61,320 (10 combos) | key metrics: Full cfg_avg 68.6, 0–2 empty / combo (<0.03%) | result: pass (Full leaderboard, SOTA open-source).
  • 7 additional reproductions audited (Baichuan-Omni-1.5, MiniCPM-o-2.6, OmniVinci arch-2/2, video-SALMONN-2, Gemma 4 E4B-it, Gemini 3.1-pro & 3.5-flash Lite via the new SDK wrapper) | sample size: N=6000 or N=61,320 each | key metric:
    every published config <3% empty after wrapper fixes | result: pass.

Risk / Compatibility

  • Net additive — no breaking changes to existing tasks/models. New interleave wrappers register fresh model ids (*_interleave); legacy simple wrappers (qwen2_5_omni, minicpm_o, …) remain available unchanged.
  • InterleaveChatMixin ships defaults (fps=2, max_frames=16, max_pixels=384²) tuned for 24 GB GPUs; downstream users wanting full-resolution recall should override on the wrapper subclass.

Type of Change

  • New benchmark/task
  • New model integration
  • Documentation update
  • Bug fix (non-breaking change)
  • New feature
  • Breaking change
  • Refactoring (no functional changes)

XingruiWang and others added 24 commits February 27, 2026 17:53
XModBench evaluates omni multimodal models across all combinations of
Audio, Image, Video, and Text modalities. Each sample has a condition
in one modality and four options in another modality (A/B/C/D).

- 10 subtask YAMLs covering all modality pairs (audio_image, audio_text,
  audio_video, image_audio, image_text, text_audio, text_image,
  text_video, video_audio, video_text)
- Group YAML `xmod_bench` to run all subtasks at once
- utils.py with doc_to_visual, doc_to_messages (interleaved), doc_to_text,
  process_results, and aggregate_results with per-category breakdown
- Built on top of AudioBench data; set AUDIOBENCH_ROOT env var for paths
- Corrected sample counts for various tasks in README.md.
- Updated environment variable from AUDIOBENCH_ROOT to XMODBENCH in utils.py and README.md.
- Enhanced media loading functionality in utils.py to handle different modalities.
- Added group and subtask categorization in result processing for better accuracy reporting.
Switch xmod_bench task from local JSONL paths to the published HF dataset.
utils.py now resolves XMODBENCH_ROOT via snapshot_download (RyanWW/XModBench),
overridable by the XMODBENCH env var. The 10 task yamls load JSONL via
hf://datasets/RyanWW/XModBench/data/*.jsonl. JSONL paths are normalized to
Data/... (./benchmark/ prefix stripped).

build_data.py: cap natures subtask at 500/file and truncate panaroma at 390.
…mary

- XModBench-Lite: 5 families × 6 configs × 200 = 6000 samples, balanced.
  Generated by make_lite.py; uploaded to RyanWW/XModBench under data_lite/.
  New yamls: xmod_bench_lite_{a2t,a2v,t2a,t2v,v2a,v2t} + group xmod_bench_lite.

- Metrics overhaul. process_results now emits the canonical record
  {family, subtask, config, correct} (vision = image ∪ video). Per-task
  aggregator logs accuracy by config, by family, and per (family, subtask,
  config) — sufficient as Level-1 raw data.

- summarize.py: stand-alone Level-2 summary script over lmms-eval samples
  logs. Produces 17 numbers — 6 by-config, 5 by-family, 3 modality disparity
  (ΔT_vs_V/A, ΔV_vs_A), 3 directional imbalance (ΔT↔V/A, ΔV↔A).

- New run_xmod_bench_lite_qwen2_5_omni.slurm (array 0-5). The full-bench
  slurm script now defaults XMODBENCH to AudioBench_data, matching the HF
  repo layout.
XModBench items carry media in the question stem AND every answer option
(up to 5 per item). lmms-eval's *simple* model interface only attaches one
media object per request via doc_to_visual, so qwen2.5-omni/qwen3-omni/
omnivinci/baichuan-omni silently drop the option media and score far below
their paper numbers (e.g. Qwen2.5-Omni t2a 26% -> 51% once fixed).

Add chat-style (is_simple=False) wrappers that consume the task's
doc_to_messages output and feed the full interleaved prompt to each model:

- _interleave_base.InterleaveChatMixin: shared request loop written once
  (Collator, chunk loop, doc_to_messages extraction, error/cache handling).
  Per-media size/frame caps (fps=12, max_frames=60, max_pixels=512^2)
  match the upstream XModBench/AudioBench runner to avoid GPU OOM.
- qwen2_5_omni_interleave, qwen3_omni_interleave: process_mm_info path
- omnivinci_interleave: VILA processor + media/media_config path
- baichuan_omni_interleave: special-token string prompt path

Each concrete wrapper only implements _infer_one (~50-90 lines). Registered
under AVAILABLE_CHAT_TEMPLATE_MODELS. VITA is intentionally skipped: its
wrapper's generate_until only supports a single image_tensor/audios per
request, so multi-distinct-media needs deeper rework.

Also add resource-aware launchers: run_xmod_lite_generic.slurm +
submit_lite.sh split light (no-video a2t/t2a -> 1 GPU) vs heavy
(video configs -> 4 GPU) so a full 6-config sweep fits under the 24-GPU
QOS cap and runs concurrently. debug_xmod_bench_lite.slurm for --limit
smoke tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
For multimodal inputs the processor expands media placeholder tokens, so
inputs.input_ids length does NOT match the generated sequence prefix.
Trimming via out_ids[len(in_ids):] sliced into / past the generated tokens
and produced EMPTY strings on ~50% of video-condition samples (qwen2.5-omni
v2a dropped to 43.9 vs paper 50.5 purely from empty outputs).

Now decode the full sequence and take the text after the final
"assistant\n" turn, mirroring the upstream XModBench/AudioBench
Qwen2.5-Omni runner. Verified: v2a debug no longer emits empty responses.

Applied to qwen2_5_omni_interleave and qwen3_omni_interleave. Also add
baichuan_omni_interleave + register all wrappers; debug slurm now honors
MODEL_ARGS_EXTRA so non-attn_implementation models work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fps=12/max_frames=60/512px (AudioBench's ~80GB-GPU settings) OOM'd ~50%
of video-condition samples on 24GB a5000s — they were silently caught and
returned empty, dropping qwen2.5-omni v2a to 43.9 vs paper 50.5 (192/1000
OOM == 192 empty). Tighter budget (fps=2, 16 frames, 384px) keeps every
sample on-GPU; XModBench video tasks don't need dense frames.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
t2a carries 4 audio options; singer_identification clips are long songs
that OOM on a single 24GB GPU (100/1000 t2a samples dropped, qwen2.5-omni
t2a 50.3 vs paper 55.4). Only a2t (1 audio) stays on the light profile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- InterleaveChatMixin.video_kwargs/image_kwargs are now overridable class
  attrs (Baichuan-Omni's video path needs far more memory than Qwen's;
  Qwen stays at the default fps=2/16f/384px that reproduces paper).
- _to_qwen_messages takes explicit image/video kwargs.
- omnivinci_interleave.__init__ mirrors the official AudioBench processor
  config (audio_chunk_length=max_3600, num_video_frames) without touching
  the upstream omnivinci model file; drops the system turn that broke
  VILA mm_info indexing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
a2v/t2v carry 4 vision options; at Baichuan's default ~1MP/image they
OOM 700+/1000 even on 4x48GB. Set processor.config.max_pixels to ~0.2MP
in the subclass __init__ (read at processor_omni.py:164), no upstream
change. Mirrors the omnivinci processor-config approach.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OmniImageProcessor caches max_pixels at __init__, so config.max_pixels set
afterwards had no effect (a2v/t2v still OOM 800+/1000). Set the cached
attribute directly on processor.visual_processor and .video_processor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qwen2.5-Omni 6/6 configs within |Δ|<5 (paper reproduced on Lite).
Qwen3-Omni numbers reported (new model). Baichuan-Omni-1.5 4/6 within
|Δ|<5, the other two genuine positive Lite-subsample deviations. Documents
all interleave-wrapper fixes; no upstream model file modified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dynamic_s2 tiling explodes 4 vision options into many tiles/image, OOMing
a2v/t2v (724/766 OOM). Set image_aspect_ratio=resize (single fixed tile)
in the subclass; no upstream change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OmniVinci image_aspect_ratio=resize avoids OOM but degenerates a2v/t2v to
empty output, so revert it — OmniVinci is reported best-effort on its 2
clean configs (A→T 62.2, V→T 78.8). RESULTS.md now has the full 4-model
table. a2v/t2v (4 vision opts) and t2a/v2a (4 audio opts) hit VILA-internal
limits not resolvable without upstream edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR hygiene: remove an accidental root-level Untitled file and revert the
only upstream model-file change (a leftover ipdb debug comment) so the PR
touches no original model code except the models/__init__.py registry
(4 new *_interleave entries). All adaptations live in new task + chat
wrapper files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ffeat: add XModBench [ICLR 2026] cross-modal benchmark + interleaved-multimedia omni model wrappers
@XingruiWang XingruiWang changed the title # feat(xmod_bench): add cross-modal MCQ benchmark + omni-LLM interleave wrappers # [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers Jun 14, 2026
@XingruiWang XingruiWang changed the title # [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers # **ICLR 2026 XmodBench**: add New MCQ benchmark + omni-LLM interleave wrappers Jun 14, 2026
@XingruiWang XingruiWang changed the title # **ICLR 2026 XmodBench**: add New MCQ benchmark + omni-LLM interleave wrappers # [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers Jun 14, 2026
@XingruiWang XingruiWang changed the title # [ICLR 2026 XmodBench]: add New MCQ benchmark + omni-LLM interleave wrappers [ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant