Skip to content

This project replicates the results of the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". The goal is to validate the performance of Vision Transformer (ViT) on image classification tasks using CIFAR-10.

Notifications You must be signed in to change notification settings

hakeematyab/Paper-Replication-Vision-Transformer-ViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Paper Replication: Vision Transformer (ViT)

Atyab Hakeem

This project replicates the results of the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". The goal is to validate the performance of Vision Transformer (ViT) on image classification tasks using CIFAR-10.

Background

Architecture Details - ViT-Base

  • Layers: 12
  • Hidden size: 768
  • MLP size: 3072
  • Heads: 12
  • Params: 86M

Training Details

  • Train Dataset: JFT-300M
  • Optimizer: Adam
    • Beta1: 0.9
    • Beta2: 0.999
    • Weight decay: 0.1
  • LR Scheduler: Linear warmup and decay
  • Batch size: 4096
  • Dropout: Yes

Fine-tuning Details (Higher Resolution)

  • Fine-tune Dataset: CIFAR-10
  • Optimizer: SGD + Momentum
  • Batch size: 512
  • Callbacks: Early stopping

Results

Accuracy on CIFAR-10: 99.50 ± 0.06% (Pretrained on JFT-300M)

Proposed Benefits

  • Outperforms CNNs with less compute
  • Hybrids (CNN inputs + Transformer) work well for smaller ViT variants but not for larger ones

My Replication

These are the details of my implementation:

  • Framework: PyTorch
  • Optimizer: AdamW
  • Learning Rate: 3e-4
  • Weight Decay: 1e-3
  • Batch Size: 1024
  • Gradient Norm Clipping: 10.0
  • Scheduler: Cosine Annealing LR
  • Epochs: 142
  • Number of Parameters: 8.45M
  • VCallbacks: Early stopping, MLFlow experiment tracking. Training was performed on a CUDA-enabled GPU. The model architecture and hyperparameters were chosen based on the original paper and optimized to match the training setup as closely as possible. The training script used gradient clipping and a learning rate scheduler to stabilize the training process.

Results

  • Validation Accuracy on CIFAR-10: 70% (No pretraining)

Socials

Atyab Hakeem

GitHub

About

This project replicates the results of the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". The goal is to validate the performance of Vision Transformer (ViT) on image classification tasks using CIFAR-10.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published