This project focuses on object detection using the Berkeley DeepDrive (BDD100K) dataset, featuring analysis of class distributions, occlusion patterns, and model development.
# clone
git clone https://github.com/basaanithanaveenkumar/object-detection-BBD.git
cd object-detection-BBDsudo docker build -t object-detection .
#run the docker
sudo docker run -it --rm --gpus all object-detection /bin/bash
# Download and extract dataset
mkdir -p data
python scripts/download_dataset.pymv data/100k/val data/100k/validInstall the required dependencies using:
pip install -r requirements.txtTo get class-wise statistics, run the following script:
python scripts/get_stats_classwise.pyTo get scene-wise statistics, run the following script:
python scripts/get_stats_scenewise.py
To get Class distribution, run the following script:
python scripts/get_stats_dist.py| Class | Count |
|---|---|
| Car | 714,121 |
| Traffic Sign | 239,961 |
| Traffic Light | 186,301 |
| Person | 91,435 |
| Truck | 30,012 |
| Bus | 11,688 |
| Bike | 7,227 |
| Rider | 4,522 |
| Motor | 3,002 |
| Train | 136 |
| Class | Count |
|---|---|
| Car | 714,121 |
| Lane/Single White | 247,108 |
| Traffic Sign | 239,961 |
| Traffic Light | 186,301 |
| Lane/Road Curb | 109,868 |
| Lane/Crosswalk | 108,284 |
| Person | 91,435 |
| Area/Drivable | 64,050 |
| Area/Alternative | 61,799 |
| Lane/Double Yellow | 37,519 |
| Truck | 30,012 |
| Lane/Single Yellow | 20,220 |
| Bus | 11,688 |
| Bike | 7,227 |
| Lane/Double White | 5,674 |
| Rider | 4,522 |
| Motor | 3,002 |
| Lane/Single Other | 249 |
| Train | 136 |
An analysis was performed on all classes with respect to the occluded attribute. Results show that nearly half of the objects are occluded.
| Occlusion Status | Count |
|---|---|
| False | 678,981 |
| True | 609,424 |
To get BBOX distribution statstics, run the following script:
python scripts/get_additional_analysis.py| Size Category | Percentage | Interpretation |
|---|---|---|
| Small (<2% area) | 92.2% | Dominates dataset - consider higher resolution or small object detection techniques |
| Medium (2-10%) | 6.3% | Underrepresented - may need augmentation |
| Large (>10%) | 1.5% | Very rare - ensure model can handle context |
Recommendations:
- Implement multi-scale training
- Use Feature Pyramid Networks (FPN)
| Aspect Ratio | Percentage | Interpretation |
|---|---|---|
| Tall (<0.5) | 12.3% | Vertical objects present - adjust anchor boxes |
| **Square & rectangle ** (0.5-2) | 76.9% | Majority class - standard anchors should work well |
| Wide (≥2) | 10.8% | Horizontal objects - may need custom anchors |
To generate scene-wise statistics, run the following script:
python scripts/get_stats_sceneiwise.pyTo train the model using RF-DETR, it need to be converted into coco:
python scripts/convert_to_coco.pyTo visualize the bboxes, run the following script:
# please modify your script according to your needs
python scripts/visualize_image.py To train the model using RF-DETR, run the following script:
python scripts/train_rfdetr.py
Description: Custom ViT architecture with hierarchical features through global and windowed attention

Description: Architecture of the DINO self-supervised learning framework, which uses knowledge distillation with a teacher-student network to learn robust visual representations without labeled data.

Description: Comparison between standard convolution (fixed grid) and deformable convolution (adaptive sampling locations). Enhances CNNs for irregular object shapes.

Description: Visualization of deformable convolution offsets dynamically adjusting to object geometry.

Description: Combines deformable convolutions with Transformer attention for efficient object detection, reducing training complexity of vanilla DETR.

Description: Improves bounding box regression by accounting for both overlap and enclosure, addressing limitations of standard IoU in non-overlapping cases.
Below are the key evaluation metrics and visualizations from from RF-DETR model:

Description: The confusion matrix shows the model's classification performance across different classes. High diagonal values indicate correct predictions, while off-diagonal values represent misclassifications.

Description: Side-by-side overlay of predicted bounding boxes (red) and ground truth (green).

Description: Mean Average Precision (mAP) across IoU thresholds. Higher mAP (close to 1.0) indicates better detection accuracy.
Model training
sampling techniques to balance dataset
--- as the dataset is imbalanced
tweak the focal loss
data aggumentation
--
Hyper parameter tunning
change the backbone
bapatite matching based architecture
Mean Average Precision (mAP) - Standard for COCO/PASCAL VOC
Precision-Recall Curves - Trade-off analysis
F1 Score - Balanced precision/recall
Inference Speed (FPS) - Critical for real-time applications (self-driving)






