This project is a pipeline for correcting spelling mistakes in handwritten Hebrew text. The solution involves three main components:
- Word Segmentation: Splits an image containing multiple words into individual word images.
- HTR: Extracts text from the segmented word images using an existing model.
- Spelling Correction: Fixes spelling mistakes in the extracted text using an MT5-based language model fine-tuned on Hebrew data.
The model takes an image of handwritten Hebrew text, splits it into individual word images, recognizes the text using HTR, and then corrects any spelling mistakes using a language model trained on Hebrew song lyrics.
This module handles segmenting an image into smaller images, each containing a single word. We implemented this component to support handwritten text.
The input image was taken from: https://github.com/Lotemn102/HebHTR
For text extraction, we integrated an existing HTR model from here: https://github.com/Lotemn102/HebHTR
It is based on the Harald Scheidl's CTC-WordBeam, implemented here: https://github.com/githubharald/CTCWordBeamSearch.git
The final component is a language model based on MT5. We fine-tuned the model to correct spelling errors by training it on a custom dataset.
We've trained the model on Guy Barash's Hebrew songs lyrics dataset (https://www.kaggle.com/datasets/guybarash/hebrew-songs-lyrics), which consists of lyrics of ~15,000 Hebrew songs. We randomly trimmed each entry and induced random spelling mistakes.
![image](https://private-user-images.githubusercontent.com/69907548/359826118-1c0220db-8700-424d-90e7-a7a8191fa6da.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMTY5MTEsIm5iZiI6MTczOTIxNjYxMSwicGF0aCI6Ii82OTkwNzU0OC8zNTk4MjYxMTgtMWMwMjIwZGItODcwMC00MjRkLTkwZTctYTdhODE5MWZhNmRhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDE5NDMzMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIyNWI5ODg3OGE0MDkzMTM0NzRlYmIyNzViOGVmZDRlZThlZDMxMjM3OTYzODg1NjFmMDg0MGIxYjBlNTZmOTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.zgin8eaoZBgGKQayAflqSwboE5f6112h5jyl7uAZL_Q)
-
Python and TensorFlow Requirements:
- Python version:
3.6
or3.7
- TensorFlow version:
1.15
- Python version:
-
Clone the repository:
git clone https://github.com/netabecker/handwritten_Hebrew_spellchecking.git
-
Clone Harald Scheidl's CTC-WordBeam repository:
git clone https://github.com/githubharald/CTCWordBeamSearch.git
-
Create a virtual enviorment for the project and install the required dependencies:
pip install -r requirements.txt
-
In the virtual enviorment, install your local clone of CTC-WordBeam:
pip install -e path/to/CTC-WordBeam
-
Run TrainModel.py, or download a trained model and place it in the main directory.
To run the model, run main.py
. If there's nothing in the default locations, the script will prompt you for input image and model checkpoint.
The pipeline outputs corrected text with significantly improved accuracy over raw OCR outputs, especially in cases where spelling errors are common in handwritten documents.