IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

📢 We are currently organizing the code for IS-Bench. If you are interested in our work, please star ⭐ our project.

🔥 Updates

📆[2025-11-08] 🎈 Our paper has been accepted to AAAI Conference. See you in Singapore~ 🎈

📆[2025-07-07] 🎈 Our paper, code and dataset are released! 🎈

🎉 Introduction

Existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.

📍 Results of IS-Bench

Our experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion.

⚙️ Installation

System Requirements of Omnigibson

OS: Linux (Ubuntu 20.04+), Windows 10+

RAM: 32GB+ recommended

VRAM: 8GB+

GPU: NVIDIA RTX 2080+

Install Vulkan

sudo apt update
sudo apt install vulkan-utils libvulkan1
vulkaninfo | grep "Vulkan Instance Version" # test
export VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/nvidia/bin:$PATH

Install Omnigibson

conda create -n isbench python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 "numpy<2" -c pytorch -c nvidia
conda activate isbench
pip install omnigibson==1.1.1 --index-url https://pypi.org/simple
python -m pip install --force-reinstall numpy scipy --index-url https://pypi.org/simple 
python -m omnigibson.install    # install omnigibson assets and datasets

If you want to use Omnigibson in Docker, please see this document:

docker pull stanfordvl/omnigibson:1.1.1 # this image has already contained Vulkan
docker run stanfordvl/omnigibson:1.1.1
micromamba run -n omnigibson pip install python -m omnigibson.install # install dataset

Download Source Code and BDDL of IS-Bench

git clone https://github.com/AI45Lab/IS-Bench
cd IS-Bench
pip install -r requirements.txt # for docker image: micromamba run -n omnigibson pip install -r requirements.txt
pip install -e ./bddl # for docker image: micromamba run -n omnigibson pip install -e ./bddl

Download Scene Dataset

cd ../data
wget https://huggingface.co/datasets/Ursulalala/IS_Bench_scenes/resolve/main/scenes.tar.gz
tar -xzvf scenes.tar.gz
rm scenes.tar.gz

🚀 Usage

If you are using slurm to run IS-Bench, please first revise your launcher for benchmark at scripts/launcher.sh

Evaluate Close-Source Models

Our code supports api-based model with openai or google-genai format.

Configure api_base and api_key in entrypoints/env.sh
Add proxy at og_ego_prim/models/server_inference.py if needed.
Execute the following script:

bash entrypoints/eval_close.sh $MODEL_NAME $DATA_PARALLEL

Evaluate Open-Source Models

Execute entrypoints/vllm_serve.sh to deploy a serve for the evaluated model and check the serve ip.

bash entrypoints/vllm_serve.sh $LOCAL_MODEL_PATH $GPUS

Execute the following script:

bash entrypoints/eval_open.sh $MODEL_NAME_OR_PATH $SERVER_IP $DATA_PARALLEL

Advanced Configuration

Revise entrypoints/task_list.txt to specify the tasks that need to be evaluated.
Revise prompt_setting to change safety reminder:

v0: no safety reminder.
v1: implicit safety reminder.
v2: safety Chain-of-Thought (CoT) reminder.
v3: explicit safety reminder.

Set the following parameters for optional scene information:

draw_bbox_2d
use_initial_setup
use_self_caption

Set the following parameters for partial evaluation:

not_eval_process_safety
not_eval_termination_safety
not_eval_awareness
not_eval_execution

Since the performance of Omnigibson may vary depending on the hardware environment, you can run the following script to check whether the tasks in IS-Bench can be successfully executed in your environment.

bash entrypoints/validate_gt.sh

🙏 Acknowledge

Leveraged part of data and code framework fromBehavior-1K dataset and Omnigibson simulator.

📑 Citation

@misc{lu2025isbench,
      title={IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks}, 
      author={Xiaoya Lu and Zeren Chen and Xuhao Hu and Yijin Zhou and Weichen Zhang and Dongrui Liu and Lu Sheng and Jing Shao},
      year={2025},
      eprint={2506.16402},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.16402}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

🔥 Updates

🎉 Introduction

📍 Results of IS-Bench

⚙️ Installation

System Requirements of Omnigibson

🚀 Usage

Evaluate Close-Source Models

Evaluate Open-Source Models

Advanced Configuration

🙏 Acknowledge

📑 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
bddl		bddl
data		data
entrypoints		entrypoints
generation		generation
og_ego_prim		og_ego_prim
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

AI45Lab/IS-Bench

Folders and files

Latest commit

History

Repository files navigation

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

🔥 Updates

🎉 Introduction

📍 Results of IS-Bench

⚙️ Installation

System Requirements of Omnigibson

🚀 Usage

Evaluate Close-Source Models

Evaluate Open-Source Models

Advanced Configuration

🙏 Acknowledge

📑 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages