OSCAR: Hebrew Handwritten Spell Checker

This project is a pipeline for correcting spelling mistakes in handwritten Hebrew text. The solution involves three main components:

Word Segmentation: Splits an image containing multiple words into individual word images.
HTR: Extracts text from the segmented word images using an existing model.
Spelling Correction: Fixes spelling mistakes in the extracted text using an MT5-based language model fine-tuned on Hebrew data.

Overview

The model takes an image of handwritten Hebrew text, splits it into individual word images, recognizes the text using HTR, and then corrects any spelling mistakes using a language model trained on Hebrew song lyrics.

Components

1. Word Segmentation

This module handles segmenting an image into smaller images, each containing a single word. We implemented this component to support handwritten text. The input image was taken from: https://github.com/Lotemn102/HebHTR

2. Hebrew Handwritten Text Recognition

For text extraction, we integrated an existing HTR model from here: https://github.com/Lotemn102/HebHTR

It is based on the Harald Scheidl's CTC-WordBeam, implemented here: https://github.com/githubharald/CTCWordBeamSearch.git

3. Spelling Correction

The final component is a language model based on MT5. We fine-tuned the model to correct spelling errors by training it on a custom dataset.

We've trained the model on Guy Barash's Hebrew songs lyrics dataset (https://www.kaggle.com/datasets/guybarash/hebrew-songs-lyrics), which consists of lyrics of ~15,000 Hebrew songs. We randomly trimmed each entry and induced random spelling mistakes.

Installation

Python and TensorFlow Requirements:
- Python version: 3.6 or 3.7
- TensorFlow version: 1.15

Clone the repository:

git clone https://github.com/netabecker/handwritten_Hebrew_spellchecking.git

Clone Harald Scheidl's CTC-WordBeam repository:

git clone https://github.com/githubharald/CTCWordBeamSearch.git

Create a virtual enviorment for the project and install the required dependencies:
```
pip install -r requirements.txt
```
In the virtual enviorment, install your local clone of CTC-WordBeam:
```
pip install -e path/to/CTC-WordBeam
```
Run TrainModel.py, or download a trained model and place it in the main directory.

Usage

To run the model, run main.py. If there's nothing in the default locations, the script will prompt you for input image and model checkpoint.

Results

The pipeline outputs corrected text with significantly improved accuracy over raw OCR outputs, especially in cases where spelling errors are common in handwritten documents.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.ipynb_checkpoints		.ipynb_checkpoints
HebHTR		HebHTR
datasets		datasets
testing		testing
.gitignore		.gitignore
README.md		README.md
checkModel.py		checkModel.py
createDataset.py		createDataset.py
example.png		example.png
example2.png		example2.png
example3.jpeg		example3.jpeg
main.py		main.py
prepareModel.py		prepareModel.py
requirements.txt		requirements.txt
segmentWordsInSentence.py		segmentWordsInSentence.py
trainModel.py		trainModel.py
transformer.ipynb		transformer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSCAR: Hebrew Handwritten Spell Checker

Table of Contents

Overview

Components

1. Word Segmentation

2. Hebrew Handwritten Text Recognition

3. Spelling Correction

Installation

Usage

Results

References

About

Releases

Packages

Contributors 2

Languages

netabecker/handwritten_Hebrew_spellchecking

Folders and files

Latest commit

History

Repository files navigation

OSCAR: Hebrew Handwritten Spell Checker

Table of Contents

Overview

Components

1. Word Segmentation

2. Hebrew Handwritten Text Recognition

3. Spelling Correction

Installation

Usage

Results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages