Go Code Collection & Preprocessing for Phi-3 Fine-Tuning

This repository contains scripts and tools to collect, preprocess, and prepare Go code/test file pairs for fine-tuning the Phi-3 language model for unit test generation. Phi-3 mini was selected due size and performance on tasks related to code-generation.

This project was imlemented from scratch using bash, GitHub API, and Unsloth library as a part of MSc project. Initial training was performed on Kaggle environment. The goal of the project was to evaluate the possibility of generating unit tests using a small fine-tuned model and compare results with state-of-the-art models.

📂 Repository Structure

File	Description
`fetch-repositories-urls.py`	Python script to fetch repository URLs from GitHub API using batching.
`clone-repositories-optimised.sh`	Efficiently clones the repositories fetched earlier using shallow clones and other tricks.
`collect-files-in-folder.sh`	Filters and copies Go files into a single folder for further processing.
`preprocess-files.sh`	Cleans and filters the collected files to prepare them for dataset generation.
`generate-training-data.sh`	Converts preprocessed files into structured input/output pairs suitable for model fine-tuning.
`fine-tuning-and-inference-project-go-phi.ipynb`	A notebook to fine-tune the Phi-3 model on the generated dataset and perform inference on the trained model.

Features

Repository Fetching: Automatically gather Go repository URLs according to specified criteria.
Preprocessing Pipeline: Clean, filter, and normalize code and corresponding tests.
Training Dataset Generation: Generate a dataset that contains Go code and test pairs.
Training pipeline using Unsloth library: Jupyter notebook that contains the training and evaluation pipeline.

Requirements

Git installed
GitHub API key needs to be generated
Environment to run the notebook
Bash

License

This project is licensed under the MIT license.

Todo

Upload fine-tuned model to Hugging Face
Upload generated dataset to Hugging Face
Add instructions on how to run the code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Go Code Collection & Preprocessing for Phi-3 Fine-Tuning

📂 Repository Structure

Features

Requirements

License

Todo

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clone-repositories-optimised.sh		clone-repositories-optimised.sh
collect-files-in-folder.sh		collect-files-in-folder.sh
fetch-repositories-urls.py		fetch-repositories-urls.py
fine-tuning-and-inference-project-go-phi.ipynb		fine-tuning-and-inference-project-go-phi.ipynb
generate-training-data.sh		generate-training-data.sh
preprocess-files.sh		preprocess-files.sh

License

flerka/golang-fine-tuning-phi

Folders and files

Latest commit

History

Repository files navigation

Go Code Collection & Preprocessing for Phi-3 Fine-Tuning

📂 Repository Structure

Features

Requirements

License

Todo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages