Haiyang Mei
Qiming Huang
Hai Ci
Mike Zheng Shou✉️
Show Lab, National University of Singapore
We introduce RobotSeg, the first foundation model for robot segmentation that:
- supports both images and videos,
- enables fine-grained segmentation of the robot arm, gripper, and whole robot, and
- offers promptable capabilities for flexible editing and annotation.
Table of Contents
🚀 1. Introduction
🎥 2. VRS Dataset
✨ 3. RobotSeg Model
🏆 4. State-of-the-Art Performance
🦾 5. Applications of RobotSeg
🙌 6. Acknowledgments
📚 7. Citation
Existing segmentation models such as SAM 1/2/3 are remarkably powerful, yet it is surprising ⚡️ that they still struggle to segment robots reliably.
We are thrilled to introduce RobotSeg ✨, the first foundation model and dataset designed specifically for segmenting robots in images and videos.
RobotSeg targets four challenges that make robot segmentation uniquely difficult ⚡️:
- Embodiment Diversity – robots vary dramatically in shape, size, and articulation
- Appearance Ambiguity – their visual patterns often blend with cluttered backgrounds
- Structural Complexity – articulated arm links, joints, and grippers form intricate structures
- Rapid Shape Changes – fast manipulation causes large geometric and motion variations
RobotSeg delivers accurate and consistent robot masks that support:
🧩 robot-centric data augmentation
🏗️ digital-twin reconstruction for robotic systems
🤖 robot pose and action extraction
To support comprehensive evaluation and training, we construct VRS, the first video robot segmentation benchmark:
📌 2,812 videos (138,707 frames)
📌 10 robot embodiments (Franka, Fanuc Mate, UR5, Kuka iiwa, Google Robot, MobileALOHA, xArm, WindowX, Sawyer, Hello Stretch)
📌 Fine-grained masks for arm, gripper, and whole robot
Built upon SAM 2, RobotSeg introduces three robot-centric innovations:
✨ Structure-Enhanced Memory Associator (SEMA): injects robot structural cues into memory matching to maintain stable, structure-preserving masks across video frames
✨ Robot Prompt Generator (RPG): produces semantic robot prompts that guide segmentation without requiring manual click or box inputs
✨ Label-Efficient Training (LET): supervises the model using only the first-frame ground-truth mask through cycle, semantic, and patch consistency losses
🔥 Leading performance over robot-specific baselines (RoVi-Aug, RoboEngine)
🔥 Outperforms language-conditioned approaches including CLIPSeg, LISA, EVF-SAM, VideoLISA, and SAM 3
🔥 Surpasses SAM 2.1 across prompt settings (automatic, 1-click, 3-click, box, online-interactive)
🔥 Lightweight: only 41.3M parameters and runs >10 FPS in inference
🔥 Robust to 10 diverse robot embodiments
Table below summarizes the quantitative comparisons on the RoboEngine (image) and VRS (video) datasets across diverse settings (i.e., automatic AU, 1-click 1C, 3-click 3C, bounding-box BB, and online-interactive OI). "–" denotes that the method does not support this setting. RobotSeg delivers the best segmentation performance while maintaining competitive computational efficiency.
(a) Comparison against image-level robot segmentation method RoboEngine
(b) Comparison against general promptable segmentation method SAM 2.1
(c) Comparison against concept segmentation method SAM 3
(d) Comparison under point or box prompts
RobotSeg delivers accurate and consistent robot masks that support:
Precise robot masks allow compositing the robot into new environments, generating diverse visual conditions for robust policy learning and sim-to-real adaptation.
RobotSeg provides accurate robot masks that can be used by modern 3D reconstruction pipelines (e.g., SAM-3D Objects) to generate high-quality robot geometry for digital-twin modeling.
RobotSeg is built upon SAM 2.
If you find our work useful, please consider citing our paper:
@article{mei2025robotseg,
title={RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video},
author={Mei, Haiyang and Huang, Qiming and Ci, Hai and Shou, Mike Zheng},
journal={arXiv:2511.xxxxx},
year={2025}
}
















