by Lin Long, Changdae Oh, Seongheon Park, and Sharon Li.
This repository provides tools and scripts to analyze the language prior in Vision-Language Models by examining representation distances across different layers.
First, set up the environment variable for your data path:
export DATA_PATH=/path/to/your/dataCreate dataset files in JSONL format under the DATA_PATH directory. Each dataset should be named as {dataset}.jsonl, where each line contains:
image: the image path or base64 string starting with "data:image/"instruction: the instruction texttarget_tokens: the target tokens (e.g., ["Yes", "No"] or ["A", "B", "C", "D"])other keys: additional keys you want to include
Example JSONL entry:
{"image": "/path/to/image.jpg", "instruction": "What color is the sky?", "target_tokens": ["A", "B", "C", "D"], "answer": "A"}We provide reference data processing scripts for several datasets in the data_preparation/ folder.
2. Generate Hidden States
Generate hidden states for your model. Using Qwen2.5-VL as an example:
CUDA_VISIBLE_DEVICES=0 python generation/gen_qwenvl.py --dataset mmeMulti-GPU Support: This step supports multi-GPU parallel generation. After generation is complete, you need to merge the results:
python utils/merge.pyUse the plotting script to visualize representation distance curves:
python plot_divergences.py --model qwenvl --dataset mmeAvailable options:
--model: Model name (e.g., qwenvl, llava, gemma)--dataset: Dataset name (e.g., mme, mmbench, vlind)--data_path: Path to the data directory (default: "data")
Install the required dependencies:
pip install -r requirements.txt