A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

News

2025-01-23: Our paper is accepted to NAACL 2025 main conference! 🎉🎉🎉

2024-06-09: We released CogBench! Hope you find it useful and interesting!

Introduction

Inspired by the prevalent use of the "Cookie Theft" picture description task in human cognition test, CogBench is proposed to evaluate high-level cognitive abilities of LVLMs using images with rich semantics.

Figure 1: Cookie Theft picture description task.

CogBench defines eight reasoning capabilities and consists of an Image Description task and a Visual Question Answering task.

The eight reasoning abilities include Special Time Reasoning, Location Reasoning, Character Reasoning, Character Relationship Reasoning, Event Reasoning, Event Relationship Reasoning, Next Moment Event Reasoning, and Mental State Reasoning.

For the Description task, [Entities], [Chain-of-Reasonings (CoRs)] and [Description] are annotatated. [Entities] and [CoRs] are used to evaluate the low-level recognition ability and high-level cognitive reasoning abilities of models respectively based on their description. Evaluation metrics for both levels are calculated using recall scores, referred to as Recognition Score and Cognition Score, respectively.

The VQA task features standard four-option Multiple-Choice Questions. The evaluation metric for this task is accuracy.

Figure 2: An example from CogBench.

Figure 2 shows an example from CogBench. More samples are shown here.

Data

Images in CogBench

Images in CogBench are carefully collected and they feature i) a prominent story theme, ii) richer content, iii) display complex relationships among entities, and thus require stronger cognitive abilities to understand and describe.

Figure 3: The comparison between our images and those from the previous visual reasoning tasks.

Statistics

Currently, CogBench consists of 251 semantically-rich images with a total of 2670 entities, 2243 CoRs, 251 descriptions and 2577 questions, showcased in Table 1.

Table 1: Distribution of CoRs and questions in CogBench

	Time	Location	Character	Character Relationship	Event	Event Relationship	Next Moment Event	Mental State
CoR	47	177	106	263	701	425	107	417
QA	86	220	162	317	658	402	135	597

Data Access

To get access to the data, you must Sign a Data Use Agreement (DUA). Please read the DUA carefully, and send an email to [email protected] with the message: "I consent to the Data Usage Agreement (DUA)." and attach the DUA including your handwritten signature in it.

After obtaining the password, you can download our dataset from Google Drive.

Data Format

The annotated data for the Image Description task is organized in the following format.

{
  "filename": {
    "Image Name": "filename.jpg",
    "Entities": ["..."],
    "Special Time Reasoning": ["..."],
    "Location Reasoning": ["..."],
    "Character Reasoning": ["..."],
    "Character Relationship Reasoning": ["..."],
    "Event Reasoning": ["..."],
    "Event Relationship Reasoning": ["..."],
    "Next Moment Event Reasoning": ["..."],
    "Mental State Reasoning": ["..."],
    "Description": ["..."]
  },
  ...
}

The Multiple-Choice Questions for the VQA task are organized in the following format.

[
    {
        "question": "...",
        "choice_a": "...",
        "choice_b": "...",
        "choice_c": "...",
        "choice_d": "...",
        "answer": "...",
        "img_id": "...",
        "category": "..."
    },
    ...
]

Evaluate your model on CogBench

Image Description Task

Step 0: Infer your model on CogBench and save your model outputs in a jsonl file like this.

{"filename": "example1.jpg", "model_output": "There are three girls sitting on a bench talking together..."}
{"filename": "example2.jpg", "model_output": "In a kitchen, a girl and her mother are putting cookies into the oven..."}
...

Recognition Score

Step 1: Calculate Recognition Score.

python eval/recognition_score.py --cogbench_description_file_path "/path/to/cogbench_description_file.json" --model_output_file_path "/path/to/model_output_file.jsonl"

Cognition Score

Step 1: GPT-based evaluation.

python eval/cognition_gpt_eval.py --cogbench_description_file_path "/path/to/cogbench_description_file.json"  --model_output_file_path "/path/to/model_output_file.jsonl"  --eval_output_file_path "/path/to/eval_output_file.jsonl"  --gpt_name  "gpt-4-turbo"  --openai_api_key "your-openai-api-key"

Step 2: Calculate Cognition Score.

python eval/cognition_score.py  --eval_output_file_path "/path/to/eval_output_file.jsonl"

VQA Task

Step 0: Infer your model on CogBench and save your model outputs in a jsonl file like this.

{"question": "What is the boy's emotion?", "choice_a": "Sad.", "choice_b": "Angry.", "choice_c": "Scared.", "choice_d": "Happy.", "answer": "D", "img_id": "example1", "category": "mental", "response": "D. Happy."}
{"question": "What is the setting of the activity in the image?", "choice_a": "In a restaurant.", "choice_b": "At a bakery shop.", "choice_c": "In a school cafeteria.", "choice_d": "In the kitchen.", "answer": "D", "img_id": "example2", "category": "location", "response": "D. In the kitchen."}
...

Note that you need to ensure that the first character of "response" is the option (A/B/C/D) your model chose.

Step 1: Calculate accuracy.

python eval/vqa_accuracy.py  --model_output_file_path "/path/to/model_output_file.jsonl"

Contact

Xiujie Song: [email protected]

Acknowledgements

The construction of this repository refers to some content in MM-VET.

Citation

If you find our work interesting, please feel free to cite our paper:

@article{song2024cognitive,
  title={A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models},
  author={Song, Xiujie and Wu, Mengyue and Zhu, Kenny Q and Zhang, Chunhao and Chen, Yanyi},
  journal={arXiv preprint arXiv:2402.18409},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
eval		eval
figs		figs
samples		samples
LICENSE		LICENSE
LICENSE_DATASET		LICENSE_DATASET
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

News

Introduction

Data

Images in CogBench

Statistics

Data Access

Data Format

Evaluate your model on CogBench

Image Description Task

Recognition Score

Cognition Score

VQA Task

Contact

Acknowledgements

Citation

About

Releases

Packages

Languages

License

X-LANCE/CogBench

Folders and files

Latest commit

History

Repository files navigation

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

News

Introduction

Data

Images in CogBench

Statistics

Data Access

Data Format

Evaluate your model on CogBench

Image Description Task

Recognition Score

Cognition Score

VQA Task

Contact

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages