Skip to content

docs(2025): updated community bonding and week 1 documentation #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 47 additions & 1 deletion docs/2025/data-pipeline/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,50 @@
sidebar_position: 2
title: Introduction
slug: /2025/data-pipeline/
---
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <[email protected]>
-->

## Author

[Abdulsobur Oyewale](https://github.com/smilingprogrammer)

## Contact info

- [Email](mailto:[email protected])

## Project title

Data Pipelining For Safaa

## What's the project about?

Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:
1. Model Flexibility
2. Integration with scikit-learn
3. spaCy Integration
4. Preprocessing Tools

However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here.
This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve.

Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.


## What should be done?

Here are the key tasks planned for the project:

1. Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
2. Clean and preprocess fetched copyright data (utilize prewritten processing functions)
- Preprocess data should have label and clean text.
3. Split data for training/validation/test.
4. Train false/positive model as well as declutter model (utilize prewritten train functions)
5. Model evaluation (check for precision, recall etc..)
6. Model versioning and release.
7. Should work for both Gitlab and Github.
- Manual trigger.
- Should also have a functionality to work as cron job.
42 changes: 42 additions & 0 deletions docs/2025/data-pipeline/updates/2025-05-30.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Community bonding
author: Abdulsobur Oyewale
---
<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <[email protected]>
-->
# **Meeting Summary for GSoC Community Bonding Period**

# Introduction Meeting
*(May 29,2025)*

This was the inaugural meeting of the community bonding period for GSoC 2025.
* A general introduction of mentors and contributors took place.
* We were giving an introduction about the FOSSology community.
* Time and platform for the weekly general meeting were discussed.
* We were also engaged on the expectations for the GSoC program.
* The Mentors also emphasized the importance of communication in open source projects.
* At the end there was a Q&A session to address any queries we may have.

# Personal Meeting With The Mentors
*(May 30,2025)*


* They emphasize on the importance of documentation in this project.
* I was encouraged on the practice of regular updates.
* We discussed about the projects and what the targets and expectations are.
* We also discussed about timings for weekly technical calls but didn't make the final decisions since one of the mentors wasn't available with us on the call.
* There was also discussion with my mentor on reviewing last year works.
* We also discussed about adding my documentation to the fossology GSoC page, and submitting a pull request.
* Lastly, I engaged with mentors on how to start my coding period by locally installing Fossology, and trying out different test to understand how it works.


### Engagements

* Explored Fossology local setup installation process
* Treated some crucial pipeline requirements essential for Safaa's automation efforts.


**This report summarizes my activities and interactions during the GSoC community bonding period.**
31 changes: 31 additions & 0 deletions docs/2025/data-pipeline/updates/2025-06-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Week 1
author: Abdulsobur Oyewale
tags: [gsoc25, Safaa Data for Pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Abdulsobur Oyewale <[email protected]>
-->

# WEEK 1
*(May 30, 2024)*

## Attendees:
- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)
- [Ayush Kumar Bhardwaj](https://github.com/hastagAB)

### Enagagements
* I engaged in the installation of Fossology locally, and solved the obstacle of working with Windows. Since Fossology installation guide works best with Linux, I was able to achieve this installation with WSL2.
* I also conducted various examples on the Safaa agent to tests out it features and functionalities which also gives me the insight of how it currently works. You can find this here.

## Discussion:
* I discoursed about how I installed Fossology with the link provided for me by my mentors and familiarized myself with its features.
* Furthermore, I discussed with them about the test I conducted with Safaa current copyright detection agent and then experimented with false positive deactivation agent to assess its features and functionalities by playing around it with examples.
* Lastly, Safaa's performance was critically evaluated, and strategies for acquiring data for my Copyright script was discussed with me


## Subsequent Steps
* I was tasked to begin with the first task in the project list which is about the creation of script to get copyright data from a fossology instance.
4 changes: 4 additions & 0 deletions docs/2025/data-pipeline/updates/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"label": "Weekly Updates",
"position": 2
}
116 changes: 116 additions & 0 deletions docs/2025/minutes/2025-06-05.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
sidebar_position: 1
title: Week 1
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Shaheem Azmal M MD <[email protected]@gmail.com>
SPDX-FileCopyrightText: 2025 Siemens AG
-->

Welcome to meeting minutes page for GSoC 2025 at FOSSology 05-06-2025.


## Attendees:

- [Katharina Ettinger](https://github.com/EttingerK)

- [Gaurav Mishra](https://github.com/GMishx)

- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)

- [Kaushlendra Pratap](https://github.com/Kaushl2208)

- [Soham Banerjee](https://github.com/soham4abc)

- [Ayush Bhardwaj](https://github.com/hastagAB)

- [Anupam Ghosh](https://github.com/ag4ums)

- [Sahil Jha](https://github.com/sjha2048)

- [Avinal Kumar](https://github.com/avinal)

- [Rajul Jha](https://github.com/rajuljha)

- [Sushant Kumar Mishra](https://github.com/its-sushant)

- [Jan Altenberg](https://github.com/JanAltenberg)

- [Dearsh Oberoi](https://github.com/deo002)

- [Amrit Kumar Verma](https://github.com/amritkv)

- [Muhammad Salman](https://github.com/SalmanDeveloperz)

- [Tiyasa Kundu](https://github.com/tiyasakundu)

- [Prakash-Mishra](https://github.com/Prakash-Mishra-9ghz)

- [Vaibhav Sahu](https://github.com/Vaibhavsahu2810)

- [Harshit Gandhi](https://github.com/harshitg927)

- [Chayan Das](https://github.com/ChayanDass)

- [Ahmed Gamal](https://github.com/Ahmed-Gamal24)

- [Oyewale Abdulsobur](https://github.com/smilingprogrammer)

- [Devanshi Sachan](https://github.com/devxnshi)

## Missed:


## General

- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd): No Update Available.

- [Gaurav Mishra](https://github.com/GMishx): No Update Available.

- [Kaushlendra Pratap](https://github.com/Kaushl2208): No Update Available.

## Updates from contributors

- [Rajul Jha](https://github.com/rajuljha)

- No Update Available.

- [Prakash-Mishra](https://github.com/Prakash-Mishra-9ghz)

- No Update Available.

- [Vaibhav Sahu](https://github.com/Vaibhavsahu2810)

- No Update Available.

- [Harshit Gandhi](https://github.com/harshitg927)

- No Update Available.

- [Chayan Das](https://github.com/ChayanDass)

- No Update Available.

- [Ahmed Gamal](https://github.com/Ahmed-Gamal24)

- No Update Available.

- [Oyewale Abdulsobur](https://github.com/smilingprogrammer)

- No Update Available.

- [Devanshi Sachan](https://github.com/devxnshi)

- No Update Available.

- [Muhammad Salman](https://github.com/SalmanDeveloperz)

- No Update Available.

- [Tiyasa Kundu](https://github.com/tiyasakundu)

- No Update Available.

116 changes: 116 additions & 0 deletions docs/2025/minutes/2025-06-12.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
sidebar_position: 2
title: Week 2
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0

SPDX-FileCopyrightText: 2025 Shaheem Azmal M MD <[email protected]@gmail.com>
SPDX-FileCopyrightText: 2025 Siemens AG
-->

Welcome to meeting minutes page for GSoC 2025 at FOSSology 12-06-2025.


## Attendees:

- [Katharina Ettinger](https://github.com/EttingerK)

- [Gaurav Mishra](https://github.com/GMishx)

- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd)

- [Kaushlendra Pratap](https://github.com/Kaushl2208)

- [Soham Banerjee](https://github.com/soham4abc)

- [Ayush Bhardwaj](https://github.com/hastagAB)

- [Anupam Ghosh](https://github.com/ag4ums)

- [Sahil Jha](https://github.com/sjha2048)

- [Avinal Kumar](https://github.com/avinal)

- [Rajul Jha](https://github.com/rajuljha)

- [Sushant Kumar Mishra](https://github.com/its-sushant)

- [Jan Altenberg](https://github.com/JanAltenberg)

- [Dearsh Oberoi](https://github.com/deo002)

- [Amrit Kumar Verma](https://github.com/amritkv)

- [Muhammad Salman](https://github.com/SalmanDeveloperz)

- [Tiyasa Kundu](https://github.com/tiyasakundu)

- [Prakash-Mishra](https://github.com/Prakash-Mishra-9ghz)

- [Vaibhav Sahu](https://github.com/Vaibhavsahu2810)

- [Harshit Gandhi](https://github.com/harshitg927)

- [Chayan Das](https://github.com/ChayanDass)

- [Ahmed Gamal](https://github.com/Ahmed-Gamal24)

- [Oyewale Abdulsobur](https://github.com/smilingprogrammer)

- [Devanshi Sachan](https://github.com/devxnshi)

## Missed:


## General

- [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd): No Update Available.

- [Gaurav Mishra](https://github.com/GMishx): No Update Available.

- [Kaushlendra Pratap](https://github.com/Kaushl2208): No Update Available.

## Updates from contributors

- [Rajul Jha](https://github.com/rajuljha)

- No Update Available.

- [Prakash-Mishra](https://github.com/Prakash-Mishra-9ghz)

- No Update Available.

- [Vaibhav Sahu](https://github.com/Vaibhavsahu2810)

- No Update Available.

- [Harshit Gandhi](https://github.com/harshitg927)

- No Update Available.

- [Chayan Das](https://github.com/ChayanDass)

- No Update Available.

- [Ahmed Gamal](https://github.com/Ahmed-Gamal24)

- No Update Available.

- [Oyewale Abdulsobur](https://github.com/smilingprogrammer)

- No Update Available.

- [Devanshi Sachan](https://github.com/devxnshi)

- No Update Available.

- [Muhammad Salman](https://github.com/SalmanDeveloperz)

- No Update Available.

- [Tiyasa Kundu](https://github.com/tiyasakundu)

- No Update Available.

Loading