CSCI 544 Homework 3
Error detection and correction: dealing with homophone confusion
Training Data
- Project Gutenberg
- Free e-book downloads
- NLTK Corpus Package
Creation of Training file
-
After obtaining the extracted files from Project Gutenberg and some free e-book, use mergeFiles.py to read and write it to a text file.
-
Use collect_data1.py to obtain POS_TAGGING for the text file.
-
Use collect_data2.py to obtain tagged_words from NLTK Corpus.
-
Append both the tagged words to single file to foem training data.
- python2.7 mergeFiles.py train.dat training.txt
- python2.7 collect_data1.py training.txt training_tag
- python2.7 collect_data2.py training_corpus
After obtaining the tagged_words in the form of training _file, use hw3train.py to learn from training file execute following command
* python3 hw3train.py TRAININGFILE train_format
Execution of Perceptron
To learn from training file and create MODELFILEexecute following command
python3 perceplearn.py train_format MODELFILE
To classify the TEST_DATA taken from STDIN execute the following command
cat hw3.test.err.txt | python3 hw3tag.py MODELFILE hw3.output.txt
where MODELFILE is the modelfile generated by perceplearn.py
Here the input given is the error file and the output given is the corrected output file.
Third Party Software Used
- NLTK Toolkit - POS_TAGGING feature of nltk
- NLTK Corpus