Skip to content

We developed a model that performs spellchecking in handwritten Hebrew

Notifications You must be signed in to change notification settings

netabecker/handwritten_Hebrew_spellchecking

Repository files navigation

OSCAR: Hebrew Handwritten Spell Checker

This project is a pipeline for correcting spelling mistakes in handwritten Hebrew text. The solution involves three main components:

  1. Word Segmentation: Splits an image containing multiple words into individual word images.
  2. HTR: Extracts text from the segmented word images using an existing model.
  3. Spelling Correction: Fixes spelling mistakes in the extracted text using an MT5-based language model fine-tuned on Hebrew data.

Table of Contents

Overview

The model takes an image of handwritten Hebrew text, splits it into individual word images, recognizes the text using HTR, and then corrects any spelling mistakes using a language model trained on Hebrew song lyrics.

Components

1. Word Segmentation

This module handles segmenting an image into smaller images, each containing a single word. We implemented this component to support handwritten text. image The input image was taken from: https://github.com/Lotemn102/HebHTR

2. Hebrew Handwritten Text Recognition

For text extraction, we integrated an existing HTR model from here: https://github.com/Lotemn102/HebHTR

It is based on the Harald Scheidl's CTC-WordBeam, implemented here: https://github.com/githubharald/CTCWordBeamSearch.git

3. Spelling Correction

The final component is a language model based on MT5. We fine-tuned the model to correct spelling errors by training it on a custom dataset.

We've trained the model on Guy Barash's Hebrew songs lyrics dataset (https://www.kaggle.com/datasets/guybarash/hebrew-songs-lyrics), which consists of lyrics of ~15,000 Hebrew songs. We randomly trimmed each entry and induced random spelling mistakes.

image

Installation

  1. Python and TensorFlow Requirements:

    • Python version: 3.6 or 3.7
    • TensorFlow version: 1.15
  2. Clone the repository:

    git clone https://github.com/netabecker/handwritten_Hebrew_spellchecking.git
  3. Clone Harald Scheidl's CTC-WordBeam repository:

    git clone https://github.com/githubharald/CTCWordBeamSearch.git
  4. Create a virtual enviorment for the project and install the required dependencies:

    pip install -r requirements.txt
  5. In the virtual enviorment, install your local clone of CTC-WordBeam:

    pip install -e path/to/CTC-WordBeam
  6. Run TrainModel.py, or download a trained model and place it in the main directory.

Usage

To run the model, run main.py. If there's nothing in the default locations, the script will prompt you for input image and model checkpoint.

Results

The pipeline outputs corrected text with significantly improved accuracy over raw OCR outputs, especially in cases where spelling errors are common in handwritten documents.

image

References

  1. Harald Scheid's SimpleHTR model
  2. CTC-WordBeamSearch
  3. "Hebrew songs lyrics" on Kaggle
  4. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer". 2021

About

We developed a model that performs spellchecking in handwritten Hebrew

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published