RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

Haiyang Mei Qiming Huang Hai Ci Mike Zheng Shou^✉️
Show Lab, National University of Singapore

We introduce RobotSeg, the first foundation model for robot segmentation that:

supports both images and videos,
enables fine-grained segmentation of the robot arm, gripper, and whole robot, and
offers promptable capabilities for flexible editing and annotation.

Table of Contents
🚀 1. Introduction
🎥 2. VRS Dataset
✨ 3. RobotSeg Model
🏆 4. State-of-the-Art Performance
🦾 5. Applications of RobotSeg
🙌 6. Acknowledgments
📚 7. Citation

🚀 1. Introduction

Existing segmentation models such as SAM 1/2/3 are remarkably powerful, yet it is surprising ⚡️ that they still struggle to segment robots reliably.

We are thrilled to introduce RobotSeg ✨, the first foundation model and dataset designed specifically for segmenting robots in images and videos.

RobotSeg targets four challenges that make robot segmentation uniquely difficult ⚡️:

Embodiment Diversity – robots vary dramatically in shape, size, and articulation
Appearance Ambiguity – their visual patterns often blend with cluttered backgrounds
Structural Complexity – articulated arm links, joints, and grippers form intricate structures
Rapid Shape Changes – fast manipulation causes large geometric and motion variations

RobotSeg delivers accurate and consistent robot masks that support:
🧩 robot-centric data augmentation
🏗️ digital-twin reconstruction for robotic systems
🤖 robot pose and action extraction

🎥 2. VRS Dataset

To support comprehensive evaluation and training, we construct VRS, the first video robot segmentation benchmark:
📌 2,812 videos (138,707 frames)
📌 10 robot embodiments (Franka, Fanuc Mate, UR5, Kuka iiwa, Google Robot, MobileALOHA, xArm, WindowX, Sawyer, Hello Stretch)
📌 Fine-grained masks for arm, gripper, and whole robot

✨ 3. RobotSeg Model

Built upon SAM 2, RobotSeg introduces three robot-centric innovations:

✨ Structure-Enhanced Memory Associator (SEMA): injects robot structural cues into memory matching to maintain stable, structure-preserving masks across video frames
✨ Robot Prompt Generator (RPG): produces semantic robot prompts that guide segmentation without requiring manual click or box inputs
✨ Label-Efficient Training (LET): supervises the model using only the first-frame ground-truth mask through cycle, semantic, and patch consistency losses

🏆 4. State-of-the-Art Performance

🔥 Leading performance over robot-specific baselines (RoVi-Aug, RoboEngine)
🔥 Outperforms language-conditioned approaches including CLIPSeg, LISA, EVF-SAM, VideoLISA, and SAM 3
🔥 Surpasses SAM 2.1 across prompt settings (automatic, 1-click, 3-click, box, online-interactive)
🔥 Lightweight: only 41.3M parameters and runs >10 FPS in inference
🔥 Robust to 10 diverse robot embodiments

4.1 Quantitative Comparison

Table below summarizes the quantitative comparisons on the RoboEngine (image) and VRS (video) datasets across diverse settings (i.e., automatic AU, 1-click 1C, 3-click 3C, bounding-box BB, and online-interactive OI). "–" denotes that the method does not support this setting. RobotSeg delivers the best segmentation performance while maintaining competitive computational efficiency.

4.2 Qualitative Comparison

(a) Comparison against image-level robot segmentation method RoboEngine

(b) Comparison against general promptable segmentation method SAM 2.1

(c) Comparison against concept segmentation method SAM 3

(d) Comparison under point or box prompts

🦾 5. Applications of RobotSeg

RobotSeg delivers accurate and consistent robot masks that support:

5.1 Robot-Centric Data Augmentation

Precise robot masks allow compositing the robot into new environments, generating diverse visual conditions for robust policy learning and sim-to-real adaptation.

5.2 Robot 3D Reconstruction

RobotSeg provides accurate robot masks that can be used by modern 3D reconstruction pipelines (e.g., SAM-3D Objects) to generate high-quality robot geometry for digital-twin modeling.

🙌 6. Acknowledgments

RobotSeg is built upon SAM 2.

📚 7. Citation

If you find our work useful, please consider citing our paper:

@article{mei2025robotseg,
      title={RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video}, 
      author={Mei, Haiyang and Huang, Qiming and Ci, Hai and Shou, Mike Zheng},
      journal={arXiv:2511.xxxxx},
      year={2025}
}

⬆ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
Post_v0.md		Post_v0.md
Post_v1.md		Post_v1.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

🚀 1. Introduction

🎥 2. VRS Dataset

✨ 3. RobotSeg Model

🏆 4. State-of-the-Art Performance

4.1 Quantitative Comparison

4.2 Qualitative Comparison

🦾 5. Applications of RobotSeg

5.1 Robot-Centric Data Augmentation

5.2 Robot 3D Reconstruction

🙌 6. Acknowledgments

📚 7. Citation

About

Uh oh!

Releases

Packages

License

showlab/RobotSeg

Folders and files

Latest commit

History

Repository files navigation

RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

🚀 1. Introduction

🎥 2. VRS Dataset

✨ 3. RobotSeg Model

🏆 4. State-of-the-Art Performance

4.1 Quantitative Comparison

4.2 Qualitative Comparison

🦾 5. Applications of RobotSeg

5.1 Robot-Centric Data Augmentation

5.2 Robot 3D Reconstruction

🙌 6. Acknowledgments

📚 7. Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

RobotSeg:
A Model and Dataset for Segmenting Robots
in Image and Video

Packages