A complete PyTorch implementation of the Transformer architecture from the seminal paper "Attention Is All You Need" by Vaswani et al. This implementation is designed for English-Italian translation tasks and includes comprehensive training and inference pipelines.
This implementation includes all core components of the Transformer architecture:
- Input Embeddings: Token embeddings with scaling by √d_model
- Positional Encoding: Sinusoidal position embeddings
- Multi-Head Attention: Scaled dot-product attention with multiple heads
- Feed-Forward Networks: Position-wise fully connected layers
- Layer Normalization: Applied before each sub-layer (pre-norm)
- Residual Connections: Skip connections around each sub-layer
- Encoder-Decoder Architecture: Complete sequence-to-sequence model
├── config/
│ └── config.py # Model configuration and hyperparameters
├── data/
│ └── dataset.py # Bilingual dataset class with tokenization
├── transformer/
│ └── model.py # Complete Transformer implementation
├── scripts/
│ └── train.py # Training script with WandB integration
│ └── translate.py # Translation inference utilities
├── notebooks/
│ ├── transformer_train.ipynb # Interactive training notebook
│ ├── transformer_inference.ipynb # Inference and testing notebook
│ └── attention_visualization.ipynb # Attention pattern visualization
├── requirements.txt # Project dependencies
├── README.md # Project documentation
└── .gitignore # Git ignore patterns
- Clone the repository:
git clone https://github.com/Showmick119/Implementing-Attention-Is-All-You-Need.git
cd Implementing-Attention-Is-All-You-Need- Install dependencies:
pip install -r requirements.txt- Open
notebooks/transformer_train.ipynbin Google Colab - Follow the step-by-step training process
- Monitor training progress with built-in visualizations
python scripts/train.pyUse the notebooks/transformer_inference.ipynb notebook to:
- Load trained models
- Perform translation inference on custom inputs
Use the notebooks/attention_visualization.ipynb notebook to:
- Load trained models
- Visualize attention patterns
- Analyze model behavior
The model configuration is managed in config/config.py:
{
"batch_size": 8, # Training batch size
"num_epochs": 20, # Number of training epochs
"lr": 1e-4, # Learning rate
"seq_len": 350, # Maximum sequence length
"d_model": 512, # Model dimension
"lang_src": "en", # Source language (English)
"lang_tgt": "it", # Target language (Italian)
"model_folder": "weights", # Model checkpoint directory
"preload": None, # Path to pretrained model
"experiment_name": "runs/tmodel" # Experiment tracking name
}The implementation uses the OPUS Books dataset for English-Italian translation:
- Automatically downloaded via HuggingFace datasets
- Includes proper tokenization with special tokens ([SOS], [EOS], [PAD])
- Handles variable-length sequences with padding
- Creates appropriate attention masks for training
- Model Dimension (d_model): 512
- Feed-Forward Dimension: 2048
- Number of Heads: 8
- Number of Layers: 6 (encoder) + 6 (decoder)
- Vocabulary Size: Dynamic (based on tokenizer)
- Maximum Sequence Length: 350 tokens
- Attention Mechanism: Scaled dot-product attention
- Positional Encoding: Sinusoidal functions (sin/cos)
- Normalization: Layer normalization (pre-norm configuration)
- Dropout: Applied throughout the model for regularization
- Weight Initialization: Xavier initialization
- Data Preprocessing: Tokenization and sequence preparation
- Model Initialization: Transformer model with specified configuration
- Training Loop: Forward pass, loss calculation, backpropagation
- Validation: BLEU score evaluation on validation set
- Checkpointing: Model state saving for resuming training
- Monitoring: Real-time metrics via Weights & Biases
The model is evaluated using:
- BLEU Score, WER, CER: Standard metrics for translation quality
- Attention Visualization: Qualitative analysis of attention patterns
The implementation is fully compatible with Google Colab:
- All notebooks run seamlessly in Colab environment
- Automatic GPU detection and utilization
- Pre-configured for easy experimentation
- No local setup required
- Update
lang_srcandlang_tgtin configuration - Ensure dataset availability for the language pair
- Adjust vocabulary size if needed
- Modify the dataset loading in
scripts/train.py - Ensure data format compatibility with
BilingualDatasetclass - Update tokenizer training if needed
- Adjust hyperparameters in
config/config.py - Modify model architecture in
transformer/model.py - Update training script accordingly
- Attention Is All You Need - Original Transformer paper
- The Illustrated Transformer - Visual explanation
- The Annotated Transformer - Detailed implementation guide
Contributions are welcome! Please feel free to submit pull requests or open issues for:
- Bug fixes
- Performance improvements
- Additional features
- Documentation enhancements
This project is licensed under the MIT License - see the LICENSE file for details.
This implementation is based on the original Transformer paper. Special thanks to the PyTorch team and the open-source ML community for providing excellent tools and resources.