Discussion about Refining Zero-Shot Object Segmentation by Combining Vision Foundation Models #29549

saadkhi · 2025-03-18T19:20:15Z

saadkhi
Mar 18, 2025

Dear Daan Krol, Klaas Dijkstra, and Samet Akcay,

I hope you're doing well. My name is Saad Ather Ali, and I am excited about the Refining Zero-Shot Object Segmentation project for Google Summer of Code. With my background in computer vision, AI, and deep learning, I believe I can contribute effectively to this research-driven initiative.

My Interest in This Project

The concept of visual prompting with foundational models like DINOv2 and Segment Anything Model (SAM) is fascinating, as it eliminates the dependency on labeled data while enabling efficient object segmentation. However, as mentioned in the project description, false positives and incomplete segmentations pose significant challenges. I am particularly interested in exploring novel filtering and merging techniques to enhance segmentation robustness and improve generalization across datasets.

How Can I Contribute?

I would love to understand how I can contribute to this project.

Does the project already have a codebase that I can explore?
Are there any specific areas where contributions are currently needed?
What would be the best starting point for someone looking to contribute?

I have hands-on experience with Python, OpenCV, TensorFlow, and PyTorch, and I am eager to further enhance my understanding of foundational vision models like DINOv2 and CLIP while working on this project.

Looking forward to your guidance and insights on how I can get started!

Best regards,
Saad Ather Ali
GitHub | LinkedIn

kimdoeon · 2025-03-21T08:53:55Z

kimdoeon
Mar 21, 2025

Dear Daan Krol, Klaas Dijkstra, and Samet Akcay,

Hello, my name is Doeon Kim, a Master’s student at the Graduate School of AI at Soongsil University in Korea.

I recently conducted research on cross-attention of heterogeneous feature maps — specifically fusing features from FPN-based backbones and segmentation model to improve object detection performance — and also I've handled various vision transformers(vit, segformer, hrformer, swinformer) That’s why the GSoC 2025 project titled “Visual Prompting-based Segmentation Refinement” stood out to me!

From the project description, I understand that the core goal is to improve the segmentation quality of Visual Prompting pipelines by refining masks generated by SAM—especially in cases where they are noisy or incomplete.

💡 While I’m still learning and have much to improve, I’ve been thinking about a few ideas that might be helpful:

Attention-guided mask refinement
Using cross-attention as a lightweight refinement module to improve mask consistency. Given that DINOv2 captures strong semantic features and SAM provides high-resolution masks, I believe a cross-attention mechanism between their outputs could help refine ambiguous regions.
Feature similarity-based mask filtering
I propose computing cosine similarity heatmaps between reference object features from DINOv2 and features across the query image. These maps can serve as a confidence prior to filter out false positive masks, or even weight mask predictions by similarity.
Benchmarking across various datasets
I'd love to help implement a flexible benchmarking framework that compares different mask refinement strategies(thresholding, filtering..) using metrics like IoU, Dice, and qualitative visualization.
Segmentation fusion using multi-layer feature maps
Based on my prior work with FPN + segmentation fusion, I’m curious to explore whether multi-scale feature integration between DINOv2 and SAM can help refine mask outputs. For example, we can experiment with fusing mid-level features from DINOv2 with SAM’s high-resolution mask representations to better handle fragmented or small-object masks.

❓ I also had a few quick questions:

Is there an existing refinement baseline or method you'd prefer contributors to build upon?
Would it be possible to experiment with alternative backbones or multi-modal signals (e.g. text-guided refinement using CLIP)?

Thank you for your time and consideration. I'm genuinely excited about this project and hope to contribute with both implementation and idea exploration! I'm also very eager to learn from your expertise throughout the project!

Best regards,
Doeon Kim

1 reply

kimdoeon Mar 21, 2025

@Daankrol @samet-akcay

saadkhi · 2025-03-21T23:30:57Z

saadkhi
Mar 21, 2025
Author

@Daankrol
@samet-akcay

0 replies

Daankrol · 2025-03-25T13:54:46Z

Daankrol
Mar 25, 2025

@saadkhi @kimdoeon
The idea is to follow the interface/implementation of Visual Prompting in https://github.com/openvinotoolkit/model_api/tree/master
You are free to choose a method for mask refinement. It could be heuristic based, ML based or something new.
You could check out repos such as SAM, DINO, PerSAM or Matcher.
We currently have no benchmarking framework in mind, so you are free to choose a suitable approach.

2 replies

saadkhi Mar 26, 2025
Author

@Daankrol , can I select an issue and have it assigned to me so that I can better understand the codebase and become familiar with it?

Also, I have drafted a proposal. Could you please provide me with your email address so that I can share it with you? I would really appreciate your help in refining and polishing it.

Daankrol Apr 2, 2025

@saadkhi the current issues in the modelAPI repo are not about visual prompting so I don't think that will help you. I would recommend running the pipeline that is currently implemented. Also look at repo's such as Matcher.
You could share a public Google Docs with me. I might be able to review it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion about Refining Zero-Shot Object Segmentation by Combining Vision Foundation Models #29549

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Discussion about Refining Zero-Shot Object Segmentation by Combining Vision Foundation Models #29549

Uh oh!

saadkhi Mar 18, 2025

My Interest in This Project

How Can I Contribute?

Replies: 3 comments · 3 replies

Uh oh!

kimdoeon Mar 21, 2025

Uh oh!

kimdoeon Mar 21, 2025

Uh oh!

saadkhi Mar 21, 2025 Author

Uh oh!

Uh oh!

Daankrol Mar 25, 2025

Uh oh!

saadkhi Mar 26, 2025 Author

Uh oh!

Daankrol Apr 2, 2025

saadkhi
Mar 18, 2025

Replies: 3 comments 3 replies

kimdoeon
Mar 21, 2025

saadkhi
Mar 21, 2025
Author

Daankrol
Mar 25, 2025

saadkhi Mar 26, 2025
Author