[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers#1365
Open
XingruiWang wants to merge 25 commits into
Open
[ICLR 2026] XmodBench. New MCQ benchmark + omni-LLM interleave wrappers#1365XingruiWang wants to merge 25 commits into
XingruiWang wants to merge 25 commits into
Conversation
XModBench evaluates omni multimodal models across all combinations of Audio, Image, Video, and Text modalities. Each sample has a condition in one modality and four options in another modality (A/B/C/D). - 10 subtask YAMLs covering all modality pairs (audio_image, audio_text, audio_video, image_audio, image_text, text_audio, text_image, text_video, video_audio, video_text) - Group YAML `xmod_bench` to run all subtasks at once - utils.py with doc_to_visual, doc_to_messages (interleaved), doc_to_text, process_results, and aggregate_results with per-category breakdown - Built on top of AudioBench data; set AUDIOBENCH_ROOT env var for paths
- Corrected sample counts for various tasks in README.md. - Updated environment variable from AUDIOBENCH_ROOT to XMODBENCH in utils.py and README.md. - Enhanced media loading functionality in utils.py to handle different modalities. - Added group and subtask categorization in result processing for better accuracy reporting.
Switch xmod_bench task from local JSONL paths to the published HF dataset. utils.py now resolves XMODBENCH_ROOT via snapshot_download (RyanWW/XModBench), overridable by the XMODBENCH env var. The 10 task yamls load JSONL via hf://datasets/RyanWW/XModBench/data/*.jsonl. JSONL paths are normalized to Data/... (./benchmark/ prefix stripped). build_data.py: cap natures subtask at 500/file and truncate panaroma at 390.
…mary
- XModBench-Lite: 5 families × 6 configs × 200 = 6000 samples, balanced.
Generated by make_lite.py; uploaded to RyanWW/XModBench under data_lite/.
New yamls: xmod_bench_lite_{a2t,a2v,t2a,t2v,v2a,v2t} + group xmod_bench_lite.
- Metrics overhaul. process_results now emits the canonical record
{family, subtask, config, correct} (vision = image ∪ video). Per-task
aggregator logs accuracy by config, by family, and per (family, subtask,
config) — sufficient as Level-1 raw data.
- summarize.py: stand-alone Level-2 summary script over lmms-eval samples
logs. Produces 17 numbers — 6 by-config, 5 by-family, 3 modality disparity
(ΔT_vs_V/A, ΔV_vs_A), 3 directional imbalance (ΔT↔V/A, ΔV↔A).
- New run_xmod_bench_lite_qwen2_5_omni.slurm (array 0-5). The full-bench
slurm script now defaults XMODBENCH to AudioBench_data, matching the HF
repo layout.
XModBench items carry media in the question stem AND every answer option (up to 5 per item). lmms-eval's *simple* model interface only attaches one media object per request via doc_to_visual, so qwen2.5-omni/qwen3-omni/ omnivinci/baichuan-omni silently drop the option media and score far below their paper numbers (e.g. Qwen2.5-Omni t2a 26% -> 51% once fixed). Add chat-style (is_simple=False) wrappers that consume the task's doc_to_messages output and feed the full interleaved prompt to each model: - _interleave_base.InterleaveChatMixin: shared request loop written once (Collator, chunk loop, doc_to_messages extraction, error/cache handling). Per-media size/frame caps (fps=12, max_frames=60, max_pixels=512^2) match the upstream XModBench/AudioBench runner to avoid GPU OOM. - qwen2_5_omni_interleave, qwen3_omni_interleave: process_mm_info path - omnivinci_interleave: VILA processor + media/media_config path - baichuan_omni_interleave: special-token string prompt path Each concrete wrapper only implements _infer_one (~50-90 lines). Registered under AVAILABLE_CHAT_TEMPLATE_MODELS. VITA is intentionally skipped: its wrapper's generate_until only supports a single image_tensor/audios per request, so multi-distinct-media needs deeper rework. Also add resource-aware launchers: run_xmod_lite_generic.slurm + submit_lite.sh split light (no-video a2t/t2a -> 1 GPU) vs heavy (video configs -> 4 GPU) so a full 6-config sweep fits under the 24-GPU QOS cap and runs concurrently. debug_xmod_bench_lite.slurm for --limit smoke tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
For multimodal inputs the processor expands media placeholder tokens, so inputs.input_ids length does NOT match the generated sequence prefix. Trimming via out_ids[len(in_ids):] sliced into / past the generated tokens and produced EMPTY strings on ~50% of video-condition samples (qwen2.5-omni v2a dropped to 43.9 vs paper 50.5 purely from empty outputs). Now decode the full sequence and take the text after the final "assistant\n" turn, mirroring the upstream XModBench/AudioBench Qwen2.5-Omni runner. Verified: v2a debug no longer emits empty responses. Applied to qwen2_5_omni_interleave and qwen3_omni_interleave. Also add baichuan_omni_interleave + register all wrappers; debug slurm now honors MODEL_ARGS_EXTRA so non-attn_implementation models work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fps=12/max_frames=60/512px (AudioBench's ~80GB-GPU settings) OOM'd ~50% of video-condition samples on 24GB a5000s — they were silently caught and returned empty, dropping qwen2.5-omni v2a to 43.9 vs paper 50.5 (192/1000 OOM == 192 empty). Tighter budget (fps=2, 16 frames, 384px) keeps every sample on-GPU; XModBench video tasks don't need dense frames. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
t2a carries 4 audio options; singer_identification clips are long songs that OOM on a single 24GB GPU (100/1000 t2a samples dropped, qwen2.5-omni t2a 50.3 vs paper 55.4). Only a2t (1 audio) stays on the light profile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- InterleaveChatMixin.video_kwargs/image_kwargs are now overridable class attrs (Baichuan-Omni's video path needs far more memory than Qwen's; Qwen stays at the default fps=2/16f/384px that reproduces paper). - _to_qwen_messages takes explicit image/video kwargs. - omnivinci_interleave.__init__ mirrors the official AudioBench processor config (audio_chunk_length=max_3600, num_video_frames) without touching the upstream omnivinci model file; drops the system turn that broke VILA mm_info indexing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
a2v/t2v carry 4 vision options; at Baichuan's default ~1MP/image they OOM 700+/1000 even on 4x48GB. Set processor.config.max_pixels to ~0.2MP in the subclass __init__ (read at processor_omni.py:164), no upstream change. Mirrors the omnivinci processor-config approach. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OmniImageProcessor caches max_pixels at __init__, so config.max_pixels set afterwards had no effect (a2v/t2v still OOM 800+/1000). Set the cached attribute directly on processor.visual_processor and .video_processor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Qwen2.5-Omni 6/6 configs within |Δ|<5 (paper reproduced on Lite). Qwen3-Omni numbers reported (new model). Baichuan-Omni-1.5 4/6 within |Δ|<5, the other two genuine positive Lite-subsample deviations. Documents all interleave-wrapper fixes; no upstream model file modified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dynamic_s2 tiling explodes 4 vision options into many tiles/image, OOMing a2v/t2v (724/766 OOM). Set image_aspect_ratio=resize (single fixed tile) in the subclass; no upstream change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
OmniVinci image_aspect_ratio=resize avoids OOM but degenerates a2v/t2v to empty output, so revert it — OmniVinci is reported best-effort on its 2 clean configs (A→T 62.2, V→T 78.8). RESULTS.md now has the full 4-model table. a2v/t2v (4 vision opts) and t2a/v2a (4 audio opts) hit VILA-internal limits not resolvable without upstream edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR hygiene: remove an accidental root-level Untitled file and revert the only upstream model-file change (a leftover ipdb debug comment) so the PR touches no original model code except the models/__init__.py registry (4 new *_interleave entries). All adaptations live in new task + chat wrapper files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ffeat: add XModBench [ICLR 2026] cross-modal benchmark + interleaved-multimedia omni model wrappers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ICLR 2026 (Webpage | Paper)
RyanWW/XModBench, with a stratified 6K Lite split.*_interleavechat wrappers (Qwen2.5-Omni, Qwen3-Omni, Baichuan-Omni, MiniCPM-o, OmniVinci) on top of a new sharedInterleaveChatMixin, so XModBench-style multi-media stem + 4 multi-media options run end-to-end onlmms-eval.
submit_lite.sh,submit_full.sh, generic ada/a5000 templates),upload_eval_logs.sh,summarize.py, andRESULTS.md/README.md/PR.mdcovering 8 Lite + Full reproductions vs paper.In scope
lmms_eval/tasks/xmod_bench/— task package: 14 yaml configs (6 Lite + 10 Full combos + group),utils.py(data loader, doc_to_visual/messages, scoring + family/subtask aggregation),make_lite.py,build_data.py,subsample_perception.py,summarize.py, docs.lmms_eval/models/chat/_interleave_base.py— sharedInterleaveChatMixin(is_simple=False,doc_to_messagesloop, sharedIMAGE_KWARGS/VIDEO_KWARGScaps).lmms_eval/models/chat/:qwen2_5_omni_interleave.py,qwen3_omni_interleave.py,baichuan_omni_interleave.py,minicpm_o_interleave.py,omnivinci_interleave.py(+ registration inmodels/__init__.py).submit_lite.sh,submit_full.sh,run_xmod_lite_generic.slurm,run_xmod_full_generic.slurm,run_xmod_bench_qwen2_5_omni.slurm,run_xmod_bench_lite_qwen2_5_omni.slurm,upload_eval_logs.sh.Out of scope
num_beams=1, OmniVinci fixed image tile, video-SALMONNCUDA_HOMEauto-set, Gemma 4 multi-video frame stack) live in our wrapperlayer only.
_interleave_baseshared mixin import; legacy simple wrappers untouched.RyanWW/XModBenchon HF.Validation
python -m lmms_eval --model qwen2_5_omni_interleave --tasks xmod_bench_lite| sample size:N=6000(6 cfg × 1000) | key metrics: cfg_avg59.1, fam_avg59.1, empty0/6000 (0.0%)| result: pass (matches paper Lite ±0.2pp).python -m lmms_eval --model qwen3_omni_interleave --tasks xmod_bench_<combo>| sample size:N=61,320(10 combos) | key metrics: Full cfg_avg68.6, 0–2 empty / combo (<0.03%) | result: pass (Full leaderboard, SOTA open-source).N=6000orN=61,320each | key metric:every published config <3% empty after wrapper fixes | result: pass.
Risk / Compatibility
*_interleave); legacy simple wrappers (qwen2_5_omni,minicpm_o, …) remain available unchanged.InterleaveChatMixinships defaults (fps=2, max_frames=16, max_pixels=384²) tuned for 24 GB GPUs; downstream users wanting full-resolution recall should override on the wrapper subclass.Type of Change