This project replicates the results of the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". The goal is to validate the performance of Vision Transformer (ViT) on image classification tasks using CIFAR-10.
Architecture Details - ViT-Base
- Layers: 12
- Hidden size: 768
- MLP size: 3072
- Heads: 12
- Params: 86M
Training Details
- Train Dataset: JFT-300M
- Optimizer: Adam
- Beta1: 0.9
- Beta2: 0.999
- Weight decay: 0.1
- LR Scheduler: Linear warmup and decay
- Batch size: 4096
- Dropout: Yes
Fine-tuning Details (Higher Resolution)
- Fine-tune Dataset: CIFAR-10
- Optimizer: SGD + Momentum
- Batch size: 512
- Callbacks: Early stopping
Accuracy on CIFAR-10: 99.50 ± 0.06% (Pretrained on JFT-300M)
Proposed Benefits
- Outperforms CNNs with less compute
- Hybrids (CNN inputs + Transformer) work well for smaller ViT variants but not for larger ones
These are the details of my implementation:
- Framework: PyTorch
- Optimizer: AdamW
- Learning Rate: 3e-4
- Weight Decay: 1e-3
- Batch Size: 1024
- Gradient Norm Clipping: 10.0
- Scheduler: Cosine Annealing LR
- Epochs: 142
- Number of Parameters: 8.45M
- VCallbacks: Early stopping, MLFlow experiment tracking. Training was performed on a CUDA-enabled GPU. The model architecture and hyperparameters were chosen based on the original paper and optimized to match the training setup as closely as possible. The training script used gradient clipping and a learning rate scheduler to stabilize the training process.
- Validation Accuracy on CIFAR-10: 70% (No pretraining)