Discussion about : Improve OpenVINO training extensions classification component by introducing DinoV2 architecture with DoRA #29204

faizan102418 · 2025-02-28T06:27:09Z

faizan102418
Feb 28, 2025

Subject: Interest in OpenVINO GSoC Project: DinoV2 + DoRA for OTX

Dear Vladislav Sovrasov & Kirill Prokofiev,

I am Faizan, a final-year Computer Science student with a strong background in Machine Learning, Python, and AI. I am deeply interested in contributing to OpenVINO's Training Extensions project for GSoC 2025, particularly integrating DinoV2 with DoRA fine-tuning.

To better understand the project, I have:

Explored DinoV2 and its potential for self-supervised learning.
Read about DoRA, which optimizes fine-tuning efficiency.
Reviewed OpenVINO's OTX repository and how models are currently integrated.
I would love to discuss:

Which specific use cases or datasets would be ideal for testing DinoV2 with DoRA?
How OpenVINO handles model fine-tuning (e.g., will DinoV2 need custom preprocessing in OTX)?
What initial contributions I can make before GSoC starts?
Would it be possible to schedule a quick call or discuss this on OpenVINO’s community forum? I want to ensure I fully understand the expectations before preparing my proposal.

Looking forward to your guidance.
Best regards,
Faizan
@adrianboguszewski could you please connect me with the mentors.

mlukasze · 2025-02-28T06:43:09Z

mlukasze
Feb 28, 2025
Collaborator

@kprokofi & @sovrasov
please join discussion

1 reply

faizan102418 Feb 28, 2025
Author

Thanks for your precious reply, I hope the mentors will join the discussion soon. @mlukasze

faizan102418 · 2025-03-01T15:51:33Z

faizan102418
Mar 1, 2025
Author

@kprokofi & @sovrasov , dear mentors I want the Initial step for the project.

14 replies

faizan102418 Mar 5, 2025
Author

@faizan102418 it looks like the link hasn't been inserted

The link was not working. I have updated you can check now @sovrasov

faizan102418 Mar 6, 2025
Author

Really appreciate the time and insights in our discussion! The guidance on fine-tuning, datasets, and evaluation helped clarify the next steps. I'll focus on implementing LoRA and keep you updated.

@sovrasov, could you also share a summary of what we discussed and the next steps in the discussion thread? That would help keep everything aligned.

Looking forward to contributing—thanks again for your support!

Best,
Muhammad Faizan Sajid

aakashv000 Mar 7, 2025

Yes, the key discussion points would also asynchronously help others, who are interested in this project.

sovrasov Mar 7, 2025

Hey guys, sure, the agreed on the following things:

DoRA + linear layer finetuning is to be implemented in pytorch part of OTX. The goal is to reduce training speed vs existing recipes, and also vs full model fine-tuning.
Besides Dino V2 other transformer-based backbones can be considered during the experiments
All added models should be convertible to OV IR format
As a preliminary performance estimation, we can use FLOPS computed on a torch model, see: https://pytorch.org/tnt/stable/utils/generated/torchtnt.utils.flops.FlopTensorDispatchMode.html
As an accuracy benchmark a VTAB-like set of datasets should be prepared. Model accuracy score is the average validation score across the selected set of classification benchmark datasets.

faizan102418 Mar 8, 2025
Author

Hey guys, sure, the agreed on the following things:

DoRA + linear layer finetuning is to be implemented in pytorch part of OTX. The goal is to reduce training speed vs existing recipes, and also vs full model fine-tuning.

Besides Dino V2 other transformer-based backbones can be considered during the experiments

All added models should be convertible to OV IR format

As a preliminary performance estimation, we can use FLOPS computed on a torch model, see: https://pytorch.org/tnt/stable/utils/generated/torchtnt.utils.flops.FlopTensorDispatchMode.html

As an accuracy benchmark a VTAB-like set of datasets should be prepared. Model accuracy score is the average validation score across the selected set of classification benchmark datasets.

Hi @sovrasov ,
For implementing DoRA + linear layer fine-tuning in the PyTorch part of OTX, should I do it directly in the repo and then submit a PR, or is it better to first implement it in a Colab notebook as an experiment and share the results before integrating it into the repo?
Looking forward to your guidance.
BR,
Muhammad Faizan Sajid

aakashv000 · 2025-03-07T04:27:44Z

aakashv000
Mar 7, 2025

Hi mentors @sovrasov, @kprokofi - Aakash here, keenly interested to contribute to this project, as part of GSoC 2025.

I wasn't able to find a good first issue, either, related to this project, or, unassigned and requiring Python.

Could you suggest an initial contribution for me too -
would it be same, i.e.

baseline implementation of frozen DinoV2 + linear layer fine-tuning

or, would you like to suggest something new?

2 replies

sovrasov Mar 7, 2025

Hi @aakashv000 great to hear you'd like to join!
One more task that might be useful for us is adding a visual encoder from SmolVLM to OTX as a classification model, see https://huggingface.co/blog/smolvlm#architecture
It should be aligned with the other models we have in terms of code structure and training recipe should be available in https://github.com/openvinotoolkit/training_extensions/tree/develop/src/otx/recipe/classification/multi_class_cls alongside with the other models

sovrasov Mar 17, 2025

Also, one more good first issue is now available: openvinotoolkit/training_extensions#4286
@kprokofi is to create a couple more issues

faizan102418 · 2025-03-10T04:53:55Z

faizan102418
Mar 10, 2025
Author

Hi @sovrasov,
I’ve implemented DoRA + linear layer fine-tuning in PyTorch for OTX, aiming to improve training speed compared to existing recipes and full model fine-tuning. I explored using DinoV2 and other transformer-based backbones. All models are convertible to OV IR format. For performance estimation, I used FLOPS computed on the PyTorch model. The accuracy benchmark is based on a VTAB-like dataset, with average validation scores across selected classification benchmarks. I tried pushing to the repository but faced issues. Here’s the link to the Google Colab notebook: https://colab.research.google.com/drive/1HDsV5ctJ-yDBRrlP5oOIgau5M1EfSw5j?usp=sharing .
I need your guidance more
BR,
Muhammad Faizan Sajid

0 replies

faizan102418 · 2025-03-10T05:03:25Z

faizan102418
Mar 10, 2025
Author

@sovrasov , @kprokofi Could you please assign me a task for my initial contribution, as it is a prerequisite for GSoC 2025 by OpenVINO?

3 replies

kprokofi Mar 16, 2025

Hi @faizan102418, yes, we will create a good first issue early next week

faizan102418 Mar 17, 2025
Author

Hi @kprokofi,
For dataset selection, should we prioritize fine-grained classification tasks, domain shifts, or a combination of both? Also, are there specific preprocessing steps or evaluation metrics we should follow when selecting and sampling datasets?

kprokofi Mar 18, 2025

Hi @kprokofi, For dataset selection, should we prioritize fine-grained classification tasks, domain shifts, or a combination of both? Also, are there specific preprocessing steps or evaluation metrics we should follow when selecting and sampling datasets?

Hi @faizan102418 , yes, it is important to focus on fine-grained classification tasks and also consider diverse domains. (medical images, textures, technical images, animals, vehicles ... etc.). There are no required specific preprocessing steps. You could consider long tailed sampling strategies (class imbalance) to evaluate the performance of the VLMs against vanilla CNNs

GauravSRC · 2025-03-16T11:27:05Z

GauravSRC
Mar 16, 2025

Hello mentors @sovrasov and @kprokofi,

I'm writing to express my interest in contributing to the OpenVINO Training Extensions (OTX) project .

I'm a Sophomore(IIT (BHU) Varanasi)with experience in deep learning, Python, and computer vision architectures. I've worked with transformer models previously and am familiar with fine-tuning approaches.

Based on the project description and your guidance, I understand the goals include:
Implementing DoRA + linear layer fine-tuning in OTX's PyTorch component,Optimizing for reduced training speed compared to existing approaches and full model fine-tuning,Ensuring convertibility to OpenVINO IR format and Benchmarking using FLOPS metrics and VTAB-like datasets for accuracy evaluation.
I recognize that I'm reaching out a bit later than ideal, but I'm bit confident in my ability to quickly come up to speed and successfully complete the assigned tasks. I'm prepared to dedicate the necessary time and effort to meet all project milestones.

Some questions coming in my mind right now are:
Would you recommend starting with a baseline implementation of frozen DinoV2 + linear layer fine-tuning as a first contribution?
Are there specific datasets from the VTAB benchmark you'd like to prioritize for testing?
Or are there any other transformer backbones that might be worth exploring ?

Looking forward to your response.
Best regards,
Gaurav

1 reply

kprokofi Mar 16, 2025

Hi Gaurav,

Would you recommend starting with a baseline implementation of frozen DinoV2 + linear layer fine-tuning as a first contribution?

Yes, this would be a reasonable first contribution. However, please note that we already have an implementation of DinoV2 in our repository, but we’ve encountered performance issues. As an initial contribution, improving performance or addressing these issues could be a valuable starting point.

Are there specific datasets from the VTAB benchmark you'd like to prioritize for testing?

I’d recommend avoiding overly simple datasets like CIFAR or MNIST. Also, keep in mind that the VTAB benchmark is just an example of the testing approach we’re aiming for. In general, you're free to choose datasets as long as they meet our requirements regarding size, diversity, and complexity.

Are there any other transformer backbones that might be worth exploring?

DinoV2 is just one example—we’re open to considering any VLM backbones that meet our requirements. The backbone should train quickly, achieve real-time performance (100+ FPS) with OpenVINO IR on modern Intel CPUs, and deliver high, robust, and stable accuracy across a diverse range of datasets, including smaller ones with only 6–10 training images. SmolVLM is another promising option to consider.

gyuilLim · 2025-03-18T09:02:15Z

gyuilLim
Mar 18, 2025

Dear @sovrasov and @kprokofi,

Hello, my name is Gyu Il Lim, and I am currently studying in ai convergence master's program at soongsil university in Korea.

I’m conducting research on VLM lightweighting and fine-tuning, and I found an interesting project in OpenVINO that I would like to participate in, so I am leaving a comment!

Reaching out regarding the GSoC 2025 project: Improve OpenVINO training extensions classification component by introducing DinoV2 architecture with DoRA. I apologize for late contact.

Having read through this Discussion page, I have summarized the following points.

Model and Training Approach
- Implement fine-tuning based on DinoV2 + DoRA (baseline: frozen DinoV2 + linear classifier)
- Other transformer backbones are also considered
- Optimize training speed and achieve faster learning compared to existing methods
Datasets and Benchmarks
- Reference the VTAB benchmark, but exclude CIFAR/MNIST
- Focus on experiments with small datasets (20–1,000 images)
- Evaluate performance by sampling datasets from various categories
Performance Evaluation and OpenVINO Conversion
- Analyze FLOPs in PyTorch for performance prediction
- Maintain 100+ FPS after converting to OpenVINO
- Measure average accuracy across multiple datasets
Additional Potential Tasks
- Solve performance issues with DinoV2 (existing implementation available)
- Add SmolVLM-based classifier
- The mentors recommended first contribution issue (Support linear classifier fine-tuning for classification models training_extensions#4286)

In conclusion, I think the project will focus on optimizing fine-tuning based on Backbone (e.g., DinoV2) + DoRA in PyTorch, and maintaining performance (FPS, Acc) after conversion to OpenVINO, while conducting ablation studies to measure performance.

I have previous experience fine-tuning VLM with LoRA for a project, and currently researching PEFT methods such as LoRA, QLoRA, and DoRA to improve the performance of lightweight Vision Language Models, so I believe I can contribute to this project.

First of all, I will implement the first Issue recommended. Since the target datasets are very small (20-1,000 images) size, tuning only part of the model will likely be more beneficial than full fine-tuning.

Lastly, I have a few questions:

Is there a specific reason for focusing on dataset sizes (20–10,000)? It seems like overfitting could easily occur.
Is the core focus on finding a combination that improves inference performance in OTX after fine-tuning in PyTorch?
I am curious why DoRA was chosen as the PEFT method.

Thank you for your time and consideration. I look forward to your response and the possibility of contributing to this exciting project.

Best regards!!
Gyu Il Lim

0 replies

sovrasov · 2025-03-18T11:07:09Z

sovrasov
Mar 18, 2025

Hi @gyuilLim, welcome to the discussion! Great summary, more first issues are to come today, so you can choose.
Brief answers to the questions:

Fine-tuning on small or medium dataset requirement comes from one of OTX major users -- Geti system. Since overfitting is possible, we're looking towards methods that efficiently fine-tune only a fraction of model parameters. Thus, the risk is reduced, but that's not the only counter-measure. Training strategy should be adjusted accordingly, in some cases we use early stopping and adaptive learning rate schedule for instance.
The core focus here is on efficient training, inference performance should be taken into account as well by considering backbones with lower FLOPS budget.
We already have LoRA implementation in OTX and would like to explore other similar methods. DoRA provides a number of improvements over LoRA, on paper at least. Overall, it's like with backbone: DINOv2 is a starting point, but once we have a skeleton to benchmark different backbones and PEFT methods, we can relatively quickly explore more of them.

1 reply

gyuilLim Mar 18, 2025

Hi @sovrasov!!

First of all, thanks for answering my questions!

I got a general understanding of the project. So there was a specific target user.
It also seems to be related to the users’ GPU conditions.
I’ll look into Geti.

Looking forward to the additional issues.

THANK YOU!

kprokofi · 2025-03-18T13:42:50Z

kprokofi
Mar 18, 2025

Few more good first issues: openvinotoolkit/training_extensions#4288
openvinotoolkit/training_extensions#4289

1 reply

faizan102418 Mar 19, 2025
Author

@sovrasov @kprokofi ,

I’ll be working on 4288 to improve DinoV2 fine-tuning. I’ll start by evaluating its performance against CNNs, optimizing the linear head, and exploring full fine-tuning as a baseline.
BR,
Muhammad Faizan Sajid

GauravSRC · 2025-03-19T15:35:20Z

GauravSRC
Mar 19, 2025

Dear @sovrasov and @kprokofi,
I had intended to begin work on issue #4289 (openvinotoolkit/training_extensions#4289); however, I've noticed that another contributor appears to have started addressing this task, although it has not yet been formally assigned.
In light of this, I would like to inquire about the following options:

Would you prefer to assign issue Introduce MobileNetV4 for Improved Performance training_extensions#4289 to me officially so I can proceed with it?
Is there another "good first issue" that you would recommend for my contribution?
Should I continue with my ongoing work on the SmolVLM implementation instead?
Thank you for your consideration. I look forward to your response.
Best regards,
Gaurav

1 reply

sovrasov Mar 19, 2025

@GauravSRC I'll sort this out, drop a comment in the MNV4 ticket, so I can assign you there

mohame54 · 2025-03-19T17:32:53Z

mohame54
Mar 19, 2025

Dear Mentors I'd like to contribute to this project, as part of GSoC 2025.
I found a first good issue but I think there is a problem because I saw someone commenting that he had already finished this task #20559
this is my pull request #29499
If you can help I would be very grateful and thank you all in advance

2 replies

kprokofi Mar 20, 2025

Hi @mohame54! Do you want to contribute to "Improve OpenVINO training extensions classification component by introducing DinoV2 architecture with DoRA" project or a different one? I see you are already assigned to #20559

mohame54 Mar 22, 2025

I'm just exploring my options basically my main focus on this project and another one which is creating an openvino wrapper for pytorch ,tensorflow and sicket learn

saadkhi · 2025-03-19T19:33:25Z

saadkhi
Mar 19, 2025

Dear @sovrasov and @kprokofi

I hope you are doing well. My name is Saad Ather Ali, and I am excited about the Gesture Control with OpenVINO project for Google Summer of Code. With my experience in computer vision and gesture-based interaction, I believe I can contribute effectively to this project.

I have been working on a project called Air Tracker ([GitHub Repository](https://github.com/saadkhi/AIR-TRACKER)), which is a hand gesture control system for project presentations. It enables users to interact with slides, multimedia, and on-screen elements using gesture recognition, eliminating the need for physical remotes or touch-based interfaces. My work on Air Tracker has given me hands-on experience with gesture-based navigation, media control, and custom gesture mapping, which aligns well with the objectives of this project.

Interest in This Project
The idea of porting MediaPipe Gesture Recognizer to OpenVINO and developing a local gesture control system for monitors greatly interests me. I see this as an opportunity to further optimize gesture recognition for real-time applications while enhancing the performance of gesture-based interactions using OpenVINO’s optimized inference capabilities.

Potential Contributions & Customization
With my prior experience, I would like to explore the following enhancements:

Efficient Model Optimization: Improve the performance of gesture recognition using OpenVINO optimizations.
Adaptive Gesture Control: Implement a dynamic mapping system that allows users to define their own gestures for various applications.
Cross-Application Compatibility: Extend the system to control not only presentations but also media players, browsers, and accessibility tools.
Multi-Device Integration: Investigate how this system can work across multiple screens or devices for a seamless user experience.
I have experience in Python, OpenCV, and TensorFlow, and I am eager to enhance my skills in C++ and OpenVINO while contributing to this project. I would love to hear your thoughts on additional areas I should focus on and any recommendations you might have for refining my approach.

Looking forward to your guidance and the opportunity to work on this project!

Best regards,
Saad Ather Ali

1 reply

kprokofi Mar 20, 2025

Hi @saadkhi ! This is the "Improve OpenVINO training extensions classification component by introducing DinoV2 architecture with DoRA" project, not a Gesture Recognition. You can reach out mentors in different topic. I see you already did though. Thank you!

faizan102418 · 2025-03-29T02:58:54Z

faizan102418
Mar 29, 2025
Author

Hey @sovrasov , how much improvement do we need in the dino_v2 for classification? Is there a specific accuracy target or percentage increase we should aim for?
BR,
Muhammad Faizan Sajid

1 reply

sovrasov Apr 1, 2025

dino_v2 supposed to perform on par with other OTX models in classification layer fine-tuning scenario.
Lately, we've merged a PR allowing to benchmark other models in linear fine-tuning mode: openvinotoolkit/training_extensions#4298
Results of models like MobileNetV3 or EfficientNet-b0 can be used as a baseline

Discussion about : Improve OpenVINO training extensions classification component by introducing DinoV2 architecture with DoRA #29204

Replies: 13 comments · 28 replies

mlukasze Feb 28, 2025 Collaborator

faizan102418 Feb 28, 2025 Author

faizan102418 Mar 1, 2025 Author

faizan102418 Mar 5, 2025 Author

faizan102418 Mar 6, 2025 Author

faizan102418 Mar 8, 2025 Author

faizan102418 Mar 10, 2025 Author

faizan102418 Mar 10, 2025 Author

faizan102418 Mar 17, 2025 Author

faizan102418 Mar 19, 2025 Author

faizan102418 Mar 29, 2025 Author

Replies: 13 comments 28 replies

mlukasze
Feb 28, 2025
Collaborator

faizan102418 Feb 28, 2025
Author

faizan102418
Mar 1, 2025
Author

faizan102418 Mar 5, 2025
Author

faizan102418 Mar 6, 2025
Author

faizan102418 Mar 8, 2025
Author

faizan102418
Mar 10, 2025
Author

faizan102418
Mar 10, 2025
Author

faizan102418 Mar 17, 2025
Author

faizan102418 Mar 19, 2025
Author

faizan102418
Mar 29, 2025
Author