-
Notifications
You must be signed in to change notification settings - Fork 2
Laura's Code
Posts
##Post 1:
Program: LongestWord
Purpose and description the program: This program will find the longest word in the given text. The example shown here is Milton's Paradise Lost. This program, like the UnusualWords program, is useful for ascertaining the reading level of a given text.
Program capabilites:
- Read the given file
- Calculate the average word length in a given text
- Output the longest word/s
Code for LongestWord:
import nltk
text = nltk.corpus.gutenberg.words('milton-paradise.txt')
longest = ' '
for word in text:
if len(word) > len(longest):
longest = word
print longest
Output:
unextinguishable
##Post 2:
Program: GenderNames
Purpose and description: A program that reads a list of names and will predict whether a name is a male or a female name. This program will also produce a graph to illustrate the frequency of names ending in certain letters and their predicted gender.
Program capabilites:
- Read two files
- Read the words in both files
- Output the names
- Plot: be able to compare the last letter and output a plot that shows the results of names in by their last letters
Code for GenderNames:
import nltk
names = nltk.corpus.names
names.fileids()
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
cfd = nltk.ConditionalFreqDist((file, name[-1])
for file in names.fileids()
for name in names.words(file))
cfd.plot() # see a neat plot
Output:
Here you can see the frequeuncy of the last letter of a name and how often they occur in male or female names. This graph is showing that names that end in 'a' are usually female names. Pretty neat!
##Post 3:
Program: POSTTagger
Purpose and description: POSTTagger is a simple program that will parse sentences and show the parts of speech (POS). This can be useful for ESL instructors for breaking down sentences for their students.
Program capabilites:
- Have a variable that holds any sentence the user inputs
- Tokenize the sentence
- Attach parts of speech to each word in the sentence
- Output the grammatically parsed sentence
Code for WebpageFrequency:
import nltk
text = nltk.word_tokenize("And now for something completely different")
print nltk.pos_tag(text)
Output:
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
Note on the output:
- CC = coordinating conjunction
- RB = adverb
- IN = preposition/subordinating conjunction
- NN = noun
- JJ = adjective
- These tags and others can be found here .
##Post 4:
Program: UnusualWords
Purpose and description: This program can read a text for the most unusual words in a text. This is great for judging the reading level of a text.
Program capabilites:
- Read all the lowercase words in a text
- Compare the words in this text to the words in another text file holding a list of unusual words.
- Output the resulting unusual words
Code for UnusualWords:
This example is going to look through the 1801 Inaugural Speeches, Specifically Jefferson's. What kind of words do you think might be there?
import nltk
def unusual_words(text):
text_vocab = set(w.lower() for w in text if w.isalpha())
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
unusual = text_vocab.difference(english_vocab)
return sorted(unusual)
print unusual_words(nltk.corpus.inaugural.words('1801-Jefferson.txt'))
Output:
['abuses', 'acknowledging', 'acquisitions', 'actions', 'administrations', 'adoring', 'affairs', 'agonizing', 'alliances', 'angels', 'announced', 'assembled', 'authorities', 'banished', 'bestowed', 'billows', 'bled', 'blessings', 'bulwarks', 'burthened', 'called', 'cases', 'charged', 'citizens', 'committed', 'concerns', 'convulsions', 'councils', 'debts', 'decisions', 'degradations', 'delights', 'descendants', 'destined', 'destinies', ...]
Are you surprised?
##Post 5:
Program: VocabularyHTML
Purpose and description: This program takes out selecte words from a webpage and outputs those words into a list format for easier copy and pasting with no need to format further. This program is useful for gather quick lists of words, i.e.: vocabulary.
Program capabilites:
- Capture a webpage
- Import nltk, urllib, urlopen, pprint
- Target a certain number of words
- Output those words into a list format with no puntucation
Code for VocabularyHTML:
import nltk
from urllib import urlopen
import pprint
url="http://www.worldwidewords.org/weirdwords"
raw=urlopen(url).read()
html=urlopen(url).read()
raw=nltk.clean_html(html)
tokens=nltk.word_tokenize(raw)
tokens=[x.replace(";","") for x in tokens]
tokens = filter(None, tokens)
print '\n'.join(tokens[200:300])
Output:
Blurb
Bodacious
Bodger
Bombilation
Bonzer
Boondoggle
Bootless
Borborygmus
Boscage
Boustrophedonic
Bowdlerise
Bridewell
Brimborion
Brobdingnagian
Bromopnea
Brosiering
Brummagem
Bruxer
Burgoo
##Post 6:
Program: VocabularyHTML
Purpose and description: This program takes out selecte words from a webpage and outputs those words into a list format for easier copy and pasting with no need to format further. This program is useful for gather quick lists of words, i.e.: vocabulary.
Program capabilites:
- Capture a webpage
- Import nltk, urllib, urlopen, pprint
- Target a certain number of words
- Output those words into a list format with no puntucation
Code for VocabularyHTML:
import nltk
from urllib import urlopen
import pprint
url="http://www.worldwidewords.org/weirdwords"
raw=urlopen(url).read()
html=urlopen(url).read()
raw=nltk.clean_html(html)
tokens=nltk.word_tokenize(raw)
tokens=[x.replace(";","") for x in tokens]
tokens = filter(None, tokens)
print '\n'.join(tokens[200:300])
Output:
Blurb
Bodacious
Bodger
Bombilation
Bonzer
Boondoggle
Bootless
Borborygmus
Boscage
Boustrophedonic
Bowdlerise
Bridewell
Brimborion
Brobdingnagian
Bromopnea
Brosiering
Brummagem
Bruxer
Burgoo
##Post 7:
Program: WebpageFrequency
Purpose and description: This program allows a user to tokenize a webpage and get the most frequent words used in it. This program would be useful for those who need to stay abreast of the commonly spoken words on the internet, i.e. copywriters, webdesigners, etc.
Program capabilites:
- read a URL
- tokenize all the words
- calculate the given number for the most used words
- and then give the output of the word
Code for WebpageFrequency:
import nltk
constitution = "http://www.archives.gov/exhibits/charters/constitution_transcript.html"
def freq_words(url):
freqdist = nltk.FreqDist()
text = nltk.clean_url(url)
for word in nltk.word_tokenize(text):
freqdist.inc(word.lower())
return freqdist
fd = freq_words(constitution)
print fd.keys()[0:50]
Output:
['the', ',', 'of', 'and', 'shall', 'be', 'to', 'in', 'or', ';', 'states', 'united', 'a', 'by', 'for', 'state', 'any', 'which', 'all', 'may', 'such', 'president', 'have', 'as', 'on', 'congress', 'no', 'from', 'he', 'other', 'house', 'not', 'but', 'section.', 'law', 'their', 'each', 'one', 'constitution', 'office', 'that', 'two', 'this', ':', 'at', 'person', 'senate', 'time', 'under', 'with']
