🪞[ICML2025] Reflection-Bench

Overview

Reflection-Bench is an open-source benchmark inspired by cognitive psychology, designed to systematically evaluate the epistemic agency of large language models (LLMs) as the core of autonomous agents. It focuses on seven interrelated cognitive dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, Reflection-Bench involves 7 parameterized cogntive tests: Weather Prediction Task, Wisconsin Card Sorting Task, Oddbal Test, N-back, Double-Choice Iowa Gambling Task, Probabilistic Reversal Learning Task, and Meta-bandit Task, respectively.

🌟 Explore more on our project website: ReflectionBench.github.io.

Tasks

Reflection-Bench: assessment architecture

	Cognition focus	Task	Trials	Sessions
A	Perception	Oddball Paradigm	50	3
B	Working memory	N-back task	52	2
C	Belief updating	Probability Reversal Task (PRT)	40	2
D	Decision making	Wisconsin Card Sorting Task	72	2
E	Prediction	Weather Prediction Task	50	2
F	Counterfactual	Double-Choice Iowa Gambling Task	50	2
G	Meta-reflection	Meta-PRT	40	2

Usage

1. Set Up the Configuration

Before running the pipeline, configure the settings in the

config.py

This file allows you to specify:

Model Information

'model name': The name of the model being evaluated.

'output strategy': The output strategy to be used during evaluation: None for free output; 'Direct' for direct generation; True for Chain of Thought (CoT) reasoning.

API Settings

Provide the base URL and API keys for the evaluated model and any external services (e.g., DeepSeek for result extraction or text-embedding-3-large APIs for the Oddball Test).

Task-Specific Settings

Customize the number of sessions, trials, probabilities, or other parameters for each task.

2. Run the Evaluation Pipeline

Once the configuration is set, you can execute the evaluation pipeline using the provided scripts

evaluate.py

This script will:

Load the model and configuration settings from config.py.

Run all seven tasks and generate scores sequentially.

Save the results for each task in a pickle file named <model_name>results<output_strategy>.pkl.

3. Customize the Pipeline

To customize or extend the evaluation process, you can edit the

pipeline.py

This file defines the core evaluation logic, including initializing tasks from the tasks/ directory, sequentially running evaluations, and saving task-specific results using the update_pickle method.

You can modify the pipeline to add or delete tasks, change task execution order, and adjust evaluation logic for specific requirements.

📊 Our Evaluations

We conducted comprehensive evaluations that covered a diverse range of LLMs, including:

Large reasoning models: OpenAI o1-preview, o1-mini, QwQ-32B-Preview

Prominent LLMs: GPT-4o, GPT-4o-mini, Claude-3.5-Sonnet, Claude-3.5-Haiku, Deepseek-V3, Gemini-2-flash, Grok-2, Llama-3.3-70B

Qwen-2.5 family with varying sizes: 72B, 32B, 14B, and 7B

The results demonstrate a three-tier hierarchy,

with six state-of-the-art LLMs scoring over 60:

eight moderate models between 50 and 60:

and the Qwen-2.5-7B-Instruct model scoring below 50 points:

Although current LLMs show certain agency, they all struggle with meta-reflection:

Citation

You can cite Reflection-Bench as:

@misc{li2025reflectionbenchevaluatingepistemicagency,
      title={Reflection-Bench: Evaluating Epistemic Agency in Large Language Models}, 
      author={Lingyu Li and Yixu Wang and Haiquan Zhao and Shuqi Kong and Yan Teng and Chunbo Li and Yingchun Wang},
      year={2025},
      eprint={2410.16270},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2410.16270}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dataset		dataset
figs		figs
scorer		scorer
tasks		tasks
utils		utils
README.md		README.md
config.py		config.py
evaluate.py		evaluate.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🪞[ICML2025] Reflection-Bench

Overview

Tasks

Usage

1. Set Up the Configuration

2. Run the Evaluation Pipeline

3. Customize the Pipeline

📊 Our Evaluations

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AI45Lab/ReflectionBench

Folders and files

Latest commit

History

Repository files navigation

🪞[ICML2025] Reflection-Bench

Overview

Tasks

Usage

1. Set Up the Configuration

2. Run the Evaluation Pipeline

3. Customize the Pipeline

📊 Our Evaluations

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages