This project trains a transformer-based English ↔ Gujarati translation model from scratch using a custom-trained Byte-Level BPE tokenizer and Hugging Face's EncoderDecoderModel.
- Python 🐍
 - HuggingFace Transformers 🤗
 - Tokenizers (Byte-Level BPE)
 - PyTorch (GPU with Mixed Precision)
 uvfor clean dependency management- Logging + TQDM for progress tracking
 
uv venv --python python3.11
uv pip install -e .
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121eng-guj-translator/
├── data/
│   └── opus.gu-en.tsv
├── models/
│   └── v1/
├── logs/
│   └── train.log
├── src/
│   └── translator/
│       ├── train_model.py
│       ├── translate.py
│       └── version.py
├── pyproject.toml
└── README.mdflowchart TD
    A[TSV Dataset<br>English-Gujarati] --> B[Custom BPE Tokenizer]
    B --> C[PreTrainedTokenizerFast]
    C --> D[EncoderDecoderModel<br> BERT2BERT]
    D --> E[Trainer<br>(HuggingFace)]
    E --> F[Trained Model + Tokenizer Saved]
    flowchart TD
    RawText -->|split into| BPE[Byte-Level BPE Tokens]
    BPE --> TokenIDs[Assigned Token IDs]
    TokenIDs --> TokenizerJSON[tokenizer.json saved]
    TokenizerJSON --> HFTokenizer[PreTrainedTokenizerFast]
    python -m translator.train_modelTraining runs on GPU with:
- 🧠 Mixed Precision (fp16)
 - 📈 TQDM + custom ETA logging
 - 🔁 3 Epochs
 - 🧱 BERT as encoder + decoder
 - 🌟 Vocabulary size: 32,000
 
python -m translator.translateInput: "How are you?"
Output: "તમે કેમ છો?"
pad_token_id,bos_token_id,vocab_sizeset explicitlyfp16=Trueenables mixed-precisionTrainerhandles automatic GPU/CPU usagetokenizer.jsonis reusable across models
- Logs go to: 
logs/train.log - ETA + memory usage logged
 - Supports tqdm progress bars during tokenization
 
- Add BLEU score evaluation
 - Build a Gradio interface for GUI-based translation
 - Automate model versioning (
v1,v2, etc.) 
Divyang — Solution Architect working in Cloud, AI/ML & Semiconductors
MIT