Skip to content

AI45Lab/IS-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

📢 We are currently organizing the code for IS-Bench. If you are interested in our work, please star ⭐ our project.

🔥 Updates

📆[2025-11-08] 🎈 Our paper has been accepted to AAAI Conference. See you in Singapore~ 🎈

📆[2025-07-07] 🎈 Our paper, code and dataset are released! 🎈

🎉 Introduction

Intro_img

Existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.

📍 Results of IS-Bench

Intro_img

Our experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion.

⚙️ Installation

System Requirements of Omnigibson

OS: Linux (Ubuntu 20.04+), Windows 10+

RAM: 32GB+ recommended

VRAM: 8GB+

GPU: NVIDIA RTX 2080+
  1. Install Vulkan
sudo apt update
sudo apt install vulkan-utils libvulkan1
vulkaninfo | grep "Vulkan Instance Version" # test
export VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64:$LD_LIBRARY_PATH
export PATH=/usr/local/nvidia/bin:$PATH
  1. Install Omnigibson
conda create -n isbench python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 "numpy<2" -c pytorch -c nvidia
conda activate isbench
pip install omnigibson==1.1.1 --index-url https://pypi.org/simple
python -m pip install --force-reinstall numpy scipy --index-url https://pypi.org/simple 
python -m omnigibson.install    # install omnigibson assets and datasets

If you want to use Omnigibson in Docker, please see this document:

docker pull stanfordvl/omnigibson:1.1.1 # this image has already contained Vulkan
docker run stanfordvl/omnigibson:1.1.1
micromamba run -n omnigibson pip install python -m omnigibson.install # install dataset
  1. Download Source Code and BDDL of IS-Bench
git clone https://github.com/AI45Lab/IS-Bench
cd IS-Bench
pip install -r requirements.txt # for docker image: micromamba run -n omnigibson pip install -r requirements.txt
pip install -e ./bddl # for docker image: micromamba run -n omnigibson pip install -e ./bddl
  1. Download Scene Dataset
cd ../data
wget https://huggingface.co/datasets/Ursulalala/IS_Bench_scenes/resolve/main/scenes.tar.gz
tar -xzvf scenes.tar.gz
rm scenes.tar.gz

🚀 Usage

If you are using slurm to run IS-Bench, please first revise your launcher for benchmark at scripts/launcher.sh

Evaluate Close-Source Models

Our code supports api-based model with openai or google-genai format.

  1. Configure api_base and api_key in entrypoints/env.sh
  2. Add proxy at og_ego_prim/models/server_inference.py if needed.
  3. Execute the following script:
bash entrypoints/eval_close.sh $MODEL_NAME $DATA_PARALLEL

Evaluate Open-Source Models

  1. Execute entrypoints/vllm_serve.sh to deploy a serve for the evaluated model and check the serve ip.
bash entrypoints/vllm_serve.sh $LOCAL_MODEL_PATH $GPUS
  1. Execute the following script:
bash entrypoints/eval_open.sh $MODEL_NAME_OR_PATH $SERVER_IP $DATA_PARALLEL

Advanced Configuration

  1. Revise entrypoints/task_list.txt to specify the tasks that need to be evaluated.

  2. Revise prompt_setting to change safety reminder:

  • v0: no safety reminder.
  • v1: implicit safety reminder.
  • v2: safety Chain-of-Thought (CoT) reminder.
  • v3: explicit safety reminder.
  1. Set the following parameters for optional scene information:
  • draw_bbox_2d
  • use_initial_setup
  • use_self_caption
  1. Set the following parameters for partial evaluation:
  • not_eval_process_safety
  • not_eval_termination_safety
  • not_eval_awareness
  • not_eval_execution
  1. Since the performance of Omnigibson may vary depending on the hardware environment, you can run the following script to check whether the tasks in IS-Bench can be successfully executed in your environment.
bash entrypoints/validate_gt.sh

🙏 Acknowledge

Leveraged part of data and code framework fromBehavior-1K dataset and Omnigibson simulator.

📑 Citation

@misc{lu2025isbench,
      title={IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks}, 
      author={Xiaoya Lu and Zeren Chen and Xuhao Hu and Yijin Zhou and Weichen Zhang and Dongrui Liu and Lu Sheng and Jing Shao},
      year={2025},
      eprint={2506.16402},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.16402}, 
}

About

Data and Code for Paper IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •