Skip to content

A lightweight English-to-Gujarati neural machine translation model built using Transformer architecture. This project demonstrates custom training on parallel corpora using TensorFlow and provides inference capabilities via a simple interface.

Notifications You must be signed in to change notification settings

divyang4481/eng-guj-translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👀 English–Gujarati Translator (BERT2BERT + Custom Tokenizer)

This project trains a transformer-based English ↔ Gujarati translation model from scratch using a custom-trained Byte-Level BPE tokenizer and Hugging Face's EncoderDecoderModel.


🛠️ Tech Stack

  • Python 🐍
  • HuggingFace Transformers 🤗
  • Tokenizers (Byte-Level BPE)
  • PyTorch (GPU with Mixed Precision)
  • uv for clean dependency management
  • Logging + TQDM for progress tracking

🚀 Setup

uv venv --python python3.11
uv pip install -e .
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

📂 Directory Structure

eng-guj-translator/
├── data/
│   └── opus.gu-en.tsv
├── models/
│   └── v1/
├── logs/
│   └── train.log
├── src/
│   └── translator/
│       ├── train_model.py
│       ├── translate.py
│       └── version.py
├── pyproject.toml
└── README.md

🧠 Architecture Overview

flowchart TD
    A[TSV Dataset<br>English-Gujarati] --> B[Custom BPE Tokenizer]
    B --> C[PreTrainedTokenizerFast]
    C --> D[EncoderDecoderModel<br> BERT2BERT]
    D --> E[Trainer<br>(HuggingFace)]
    E --> F[Trained Model + Tokenizer Saved]
Loading

🧪 Tokenizer Pipeline

flowchart TD
    RawText -->|split into| BPE[Byte-Level BPE Tokens]
    BPE --> TokenIDs[Assigned Token IDs]
    TokenIDs --> TokenizerJSON[tokenizer.json saved]
    TokenizerJSON --> HFTokenizer[PreTrainedTokenizerFast]
Loading

🏋️‍♂️ Training

python -m translator.train_model

Training runs on GPU with:

  • 🧠 Mixed Precision (fp16)
  • 📈 TQDM + custom ETA logging
  • 🔁 3 Epochs
  • 🧱 BERT as encoder + decoder
  • 🌟 Vocabulary size: 32,000

🧪 Inference

python -m translator.translate

Input: "How are you?" Output: "તમે કેમ છો?"


⚙️ Configuration Highlights

  • pad_token_id, bos_token_id, vocab_size set explicitly
  • fp16=True enables mixed-precision
  • Trainer handles automatic GPU/CPU usage
  • tokenizer.json is reusable across models

📈 Metrics & Logging

  • Logs go to: logs/train.log
  • ETA + memory usage logged
  • Supports tqdm progress bars during tokenization

📌 Next Steps

  • Add BLEU score evaluation
  • Build a Gradio interface for GUI-based translation
  • Automate model versioning (v1, v2, etc.)

🙌 Author

Divyang — Solution Architect working in Cloud, AI/ML & Semiconductors


📜 License

MIT

About

A lightweight English-to-Gujarati neural machine translation model built using Transformer architecture. This project demonstrates custom training on parallel corpora using TensorFlow and provides inference capabilities via a simple interface.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published