Skip to content

RekhaMandyaRaju/Text-Entity-recogniser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

                    CSCI 544 Homework 3

Error detection and correction: dealing with homophone confusion

Training Data

  • Project Gutenberg
  • Free e-book downloads
  • NLTK Corpus Package

Creation of Training file

  • After obtaining the extracted files from Project Gutenberg and some free e-book, use mergeFiles.py to read and write it to a text file.

  • Use collect_data1.py to obtain POS_TAGGING for the text file.

  • Use collect_data2.py to obtain tagged_words from NLTK Corpus.

  • Append both the tagged words to single file to foem training data.

    • python2.7 mergeFiles.py train.dat training.txt
    • python2.7 collect_data1.py training.txt training_tag
    • python2.7 collect_data2.py training_corpus

After obtaining the tagged_words in the form of training _file, use hw3train.py to learn from training file execute following command

* python3 hw3train.py TRAININGFILE train_format

Execution of Perceptron

To learn from training file and create MODELFILEexecute following command

python3 perceplearn.py train_format MODELFILE

To classify the TEST_DATA taken from STDIN execute the following command

cat hw3.test.err.txt | python3 hw3tag.py MODELFILE hw3.output.txt

where MODELFILE is the modelfile generated by perceplearn.py

Here the input given is the error file and the output given is the corrected output file.

Third Party Software Used

  • NLTK Toolkit - POS_TAGGING feature of nltk
  • NLTK Corpus

About

Error detection and correction: dealing with homophone confusion

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages