Reflection-Bench is an open-source benchmark inspired by cognitive psychology, designed to systematically evaluate the epistemic agency of large language models (LLMs) as the core of autonomous agents. It focuses on seven interrelated cognitive dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, Reflection-Bench involves 7 parameterized cogntive tests: Weather Prediction Task, Wisconsin Card Sorting Task, Oddbal Test, N-back, Double-Choice Iowa Gambling Task, Probabilistic Reversal Learning Task, and Meta-bandit Task, respectively.
🌟 Explore more on our project website: ReflectionBench.github.io.
Reflection-Bench: assessment architecture
| Cognition focus | Task | Trials | Sessions | |
|---|---|---|---|---|
| A | Perception | Oddball Paradigm | 50 | 3 |
| B | Working memory | N-back task | 52 | 2 |
| C | Belief updating | Probability Reversal Task (PRT) | 40 | 2 |
| D | Decision making | Wisconsin Card Sorting Task | 72 | 2 |
| E | Prediction | Weather Prediction Task | 50 | 2 |
| F | Counterfactual | Double-Choice Iowa Gambling Task | 50 | 2 |
| G | Meta-reflection | Meta-PRT | 40 | 2 |
Before running the pipeline, configure the settings in the
config.py
This file allows you to specify:
Model Information
'model name': The name of the model being evaluated.
'output strategy': The output strategy to be used during evaluation:
None for free output; 'Direct' for direct generation; True for Chain of Thought (CoT) reasoning.
API Settings
Provide the base URL and API keys for the evaluated model and any external services (e.g., DeepSeek for result extraction or text-embedding-3-large APIs for the Oddball Test).
Task-Specific Settings
Customize the number of sessions, trials, probabilities, or other parameters for each task.
Once the configuration is set, you can execute the evaluation pipeline using the provided scripts
evaluate.py
This script will:
Load the model and configuration settings from config.py.
Run all seven tasks and generate scores sequentially.
Save the results for each task in a pickle file named <model_name>results<output_strategy>.pkl.
To customize or extend the evaluation process, you can edit the
pipeline.py
This file defines the core evaluation logic, including initializing tasks from the tasks/ directory, sequentially running evaluations, and saving task-specific results using the update_pickle method.
You can modify the pipeline to add or delete tasks, change task execution order, and adjust evaluation logic for specific requirements.
We conducted comprehensive evaluations that covered a diverse range of LLMs, including:
Large reasoning models: OpenAI o1-preview, o1-mini, QwQ-32B-Preview
Prominent LLMs: GPT-4o, GPT-4o-mini, Claude-3.5-Sonnet, Claude-3.5-Haiku, Deepseek-V3, Gemini-2-flash, Grok-2, Llama-3.3-70B
Qwen-2.5 family with varying sizes: 72B, 32B, 14B, and 7B
The results demonstrate a three-tier hierarchy,
with six state-of-the-art LLMs scoring over 60:
eight moderate models between 50 and 60:
and the Qwen-2.5-7B-Instruct model scoring below 50 points:
Although current LLMs show certain agency, they all struggle with meta-reflection:
You can cite Reflection-Bench as:
@misc{li2025reflectionbenchevaluatingepistemicagency,
title={Reflection-Bench: Evaluating Epistemic Agency in Large Language Models},
author={Lingyu Li and Yixu Wang and Haiquan Zhao and Shuqi Kong and Yan Teng and Chunbo Li and Yingchun Wang},
year={2025},
eprint={2410.16270},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.16270},
}




