Vision Transformer (ViT) Replication

This project replicates the original ViT paper, "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", using PyTorch and applies it to the FoodVision Mini dataset (pizza, steak, sushi).

Project Structure

notebooks/ → step-by-step Jupyter notebook replication
src/ → modular PyTorch code for data setup, model, training, and utils
experiments/ → results, logs, trained models

Notes -

Data Loaders and Setup - Start out with image of (224, 224, 3) -> (height, width, colour_channels)
Patch and Positional Embedding
- We want a smaller "patch" of image to be processed parallelly - we choose patch_size = 16
- Expected output after patch embedding -> 2D tensor of (196, 768) -> ((14)^2, (16)^2 * 3) -> (number_of_patches, embedding_dimension = (patch_size)^2 * colour_channels)
- Then run each patch through a Conv2D layer (kernel_size = patch_size)(stride = patch_size) -> creating a "feature map" of each patch = embedding layer
- output of this layer -> (batch_size, embedding_dimension, height_feature_map, width_feature_map) -> (1, 768, 14, 14)
- Need to flatten (the spatial dimensions(14x14)) to a 1D sequence of flattened 2D patches - Linear layer of 196 patches (1, 768, 196) -> Patch embeddings!
- Transpose Patch Embeddings to get sequence-first format - (1, 196. 768)
- These are trainable but we need a "classifier prediction" at the end so we also prepend a CLS token(basically) to the patch embedding layer - CLS: (1, 1, 768); patch_embeddings: (1, 196, 768)
- Now, the Patch embedding + CLS layer looks like [CLS_token_embedding, patch_embeddings] = total size (number of patches + 1) (1, 197, 768)
- Finally, adding a positional embedding to each patch (to know where it was in the image originally), create a tensor of the same shape as the Patch embedding + CLS layer (1, 197, 768) and just add to the patch embeddings layer
Multi-Head Self-Attention (MSA)
- The core of transformer architecture - allows each patch to "attend" to other patches
- Takes input of shape (1, 197, 768) and maintains same output shape
- Multiple attention heads (12 heads for ViT-Base) run in parallel, each learning different relationships
- Each head has embedding_dim // num_heads = 768 // 12 = 64 dimensions
- LayerNorm applied BEFORE the MSA block (Pre-Norm) - normalizes across the embedding dimension
- Residual (skip) connections - adds input directly to output of MSA block to help with gradient flow
- Formula: output = LayerNorm(input) -> MSA -> + input (residual connection)
MLP Block (Feed-Forward Network)
- Simple 2-layer neural network applied to each patch embedding independently
- Takes (1, 197, 768) -> expands to (1, 197, 3072) -> back to (1, 197, 768)
- Uses GELU activation function (smoother than ReLU)
- Includes dropout for regularization (0.1 in ViT-Base)
- LayerNorm applied BEFORE MLP block (Pre-Norm)
- Residual connection again: output = LayerNorm(input) -> MLP -> + input
Transformer Encoder Block
- Combines MSA + MLP blocks in sequence with their respective LayerNorms and residual connections
- Complete block: x -> LayerNorm -> MSA -> +x -> LayerNorm -> MLP -> +x
- This is the fundamental building block - ViT-Base uses 12 of these blocks stacked
- Input/Output shape stays (1, 197, 768) throughout all blocks
Full ViT Architecture
- Patch Embedding (with class token + positional embedding) -> (1, 197, 768)
- Stack of Transformer Encoder Blocks (12 blocks for ViT-Base) -> (1, 197, 768)
- Classification Head - takes only the CLS token (first token) -> (1, 768) -> Linear layer -> (1, num_classes)
- The CLS token accumulates global information from all patches through self-attention
Training Process
- Custom ViT: Train from scratch with Adam optimizer (lr=3e-3, weight_decay=0.3)
- Loss Function: CrossEntropyLoss for multi-class classification (pizza/steak/sushi)
- Device-agnostic: Automatically detects CUDA/MPS/CPU
Pretrained ViT Transfer Learning
- Use torchvision.models.vit_b_16 with pretrained ImageNet weights
- Freeze base parameters - only train the classification head
- Replace classifier head: Linear(768, num_classes=3) for our food classes
- Pretrained transforms: Use the same transforms the model was trained with
- Fine-tuning: Much faster training (10 epochs) with lower learning rate (1e-3)
- Transfer learning advantage: Leverages learned visual features from ImageNet

Key Hyperparameters (ViT-Base)

Image Size: 224×224
Patch Size: 16×16 (196 patches total)
Embedding Dimension: 768
Number of Heads: 12
MLP Size: 3072 (4x embedding dim)
Number of Layers: 12
Dropout: 0.1

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notes		notes
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Transformer (ViT) Replication

Project Structure

Notes -

Key Hyperparameters (ViT-Base)

About

Uh oh!

Releases

Packages

Languages

License

nehaask/vision-transformer-replication

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer (ViT) Replication

Project Structure

Notes -

Key Hyperparameters (ViT-Base)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages