This repository contains a complete YOLOv1 (You Only Look Once) object detection model implemented from scratch using PyTorch. It follows the original YOLOv1 paper and is trained on a subset of the PASCAL VOC 2005 dataset.
![]() |
![]() |
![]() |
![]() |
Object detection is a cornerstone of computer vision, with YOLO standing out for its real-time performance. In this project, I implement YOLOv1 from scratch in PyTorch to understand its core mechanics and architecture. The model uses ResNet50 as a backbone and incorporates a custom YOLO loss function, processing the PASCAL VOC 2005 dataset for training and evaluation.
Install dependencies with:
pip install -r requirements.txt-
PASCAL VOC 2005 Dataset: Download here
-
Supported Classes:
bicyclecarmotorbikeperson
- ResNet50 pretrained on ImageNet is used as the feature extractor.
- Fully connected layers are removed.
- Parameters are frozen during training to leverage transfer learning.
-
Custom convolutional layers are added on top of the ResNet backbone.
-
Final output is reshaped to predict:
- Bounding box coordinates
- Confidence score
- Class probabilities per grid cell, in a format compatible with YOLOv1.
- Images resized to 224×224.
- Annotations are parsed from
.txtfiles using a custompreprocess_txtfunction. - Output labels are encoded into YOLOv1’s S×S grid format using a
generate_outputfunction. - Normalization and augmentation applied using
torchvision.transforms.
A custom loss function is implemented following the YOLOv1 paper, combining four components:
- Localization Loss: Penalizes bounding box prediction errors.
- Confidence Loss (Object): Penalizes confidence score errors when an object is present.
- Confidence Loss (No Object): Penalizes false positives.
- Classification Loss: Ensures correct class prediction.
- Backbone: ResNet50 (pretrained)
- Epochs: 50
- Batch Size: 32
- Optimizer: Adam
- Final Loss: 6.9583
- Average Inference Time: < 300ms/image
- Hardware Used: NVIDIA RTX 4060 GPU
- Overall IoU: 58%
- Best performance: Motorbike, Bicycle
- Worst performance: Car (due to limited samples in dataset)





