Laura's Code

Posts

##Post 1:

Program: LongestWord

Purpose and description the program: This program will find the longest word in the given text. The example shown here is Milton's Paradise Lost. This program, like the UnusualWords program, is useful for ascertaining the reading level of a given text.

Program capabilites:

Read the given file
Calculate the average word length in a given text
Output the longest word/s

Code for LongestWord:

import nltk
text = nltk.corpus.gutenberg.words('milton-paradise.txt')					
longest = ' '				
for word in text:				
    if len(word) > len(longest):
	longest = word			
								
print longest

Output:

unextinguishable

##Post 2:

Program: GenderNames

Purpose and description: A program that reads a list of names and will predict whether a name is a male or a female name. This program will also produce a graph to illustrate the frequency of names ending in certain letters and their predicted gender.

Program capabilites:

Read two files
Read the words in both files
Output the names
Plot: be able to compare the last letter and output a plot that shows the results of names in by their last letters

Code for GenderNames:

import nltk
names = nltk.corpus.names
names.fileids()
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
	
cfd = nltk.ConditionalFreqDist((file, name[-1])
	for file in names.fileids()
	for name in names.words(file))
	cfd.plot()    		# see  a neat plot

Output: Alt GenderMap

Here you can see the frequeuncy of the last letter of a name and how often they occur in male or female names. This graph is showing that names that end in 'a' are usually female names. Pretty neat!

##Post 3:

Program: POSTTagger

Purpose and description: POSTTagger is a simple program that will parse sentences and show the parts of speech (POS). This can be useful for ESL instructors for breaking down sentences for their students.

Program capabilites:

Have a variable that holds any sentence the user inputs
Tokenize the sentence
Attach parts of speech to each word in the sentence
Output the grammatically parsed sentence

Code for WebpageFrequency:

import nltk
text = nltk.word_tokenize("And now for something completely different")
print nltk.pos_tag(text)

Output:

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

Note on the output:

CC = coordinating conjunction
RB = adverb
IN = preposition/subordinating conjunction
NN = noun
JJ = adjective
These tags and others can be found here .

##Post 4:

Program: UnusualWords

Purpose and description: This program can read a text for the most unusual words in a text. This is great for judging the reading level of a text.

Program capabilites:

Read all the lowercase words in a text
Compare the words in this text to the words in another text file holding a list of unusual words.
Output the resulting unusual words

Code for UnusualWords:

This example is going to look through the 1801 Inaugural Speeches, Specifically Jefferson's. What kind of words do you think might be there?

import nltk
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)

print unusual_words(nltk.corpus.inaugural.words('1801-Jefferson.txt'))

Output:

['abuses', 'acknowledging', 'acquisitions', 'actions', 'administrations', 'adoring', 'affairs', 'agonizing', 'alliances', 'angels', 'announced', 'assembled', 'authorities', 'banished', 'bestowed', 'billows', 'bled', 'blessings', 'bulwarks', 'burthened', 'called', 'cases', 'charged', 'citizens', 'committed', 'concerns', 'convulsions', 'councils', 'debts', 'decisions', 'degradations', 'delights', 'descendants', 'destined', 'destinies', ...]

Are you surprised?

##Post 5:

Program: VocabularyHTML

Purpose and description: This program takes out selecte words from a webpage and outputs those words into a list format for easier copy and pasting with no need to format further. This program is useful for gather quick lists of words, i.e.: vocabulary.

Program capabilites:

Capture a webpage
Import nltk, urllib, urlopen, pprint
Target a certain number of words
Output those words into a list format with no puntucation

Code for VocabularyHTML:

import nltk
from urllib import urlopen
import pprint
url="http://www.worldwidewords.org/weirdwords"
raw=urlopen(url).read()
html=urlopen(url).read()
raw=nltk.clean_html(html)
tokens=nltk.word_tokenize(raw)
tokens=[x.replace(";","") for x in tokens]
tokens = filter(None, tokens)
print '\n'.join(tokens[200:300])

Output:

Blurb
Bodacious
Bodger
Bombilation
Bonzer
Boondoggle
Bootless
Borborygmus
Boscage
Boustrophedonic
Bowdlerise
Bridewell
Brimborion
Brobdingnagian
Bromopnea
Brosiering
Brummagem
Bruxer
Burgoo

##Post 6:

Program: VocabularyHTML

Purpose and description: This program takes out selecte words from a webpage and outputs those words into a list format for easier copy and pasting with no need to format further. This program is useful for gather quick lists of words, i.e.: vocabulary.

Program capabilites:

Capture a webpage
Import nltk, urllib, urlopen, pprint
Target a certain number of words
Output those words into a list format with no puntucation

Code for VocabularyHTML:

import nltk
from urllib import urlopen
import pprint
url="http://www.worldwidewords.org/weirdwords"
raw=urlopen(url).read()
html=urlopen(url).read()
raw=nltk.clean_html(html)
tokens=nltk.word_tokenize(raw)
tokens=[x.replace(";","") for x in tokens]
tokens = filter(None, tokens)
print '\n'.join(tokens[200:300])

Output:

Blurb
Bodacious
Bodger
Bombilation
Bonzer
Boondoggle
Bootless
Borborygmus
Boscage
Boustrophedonic
Bowdlerise
Bridewell
Brimborion
Brobdingnagian
Bromopnea
Brosiering
Brummagem
Bruxer
Burgoo

##Post 7:

Program: WebpageFrequency

Purpose and description: This program allows a user to tokenize a webpage and get the most frequent words used in it. This program would be useful for those who need to stay abreast of the commonly spoken words on the internet, i.e. copywriters, webdesigners, etc.

Program capabilites:

read a URL
tokenize all the words
calculate the given number for the most used words
and then give the output of the word

Code for WebpageFrequency:

import nltk
constitution = "http://www.archives.gov/exhibits/charters/constitution_transcript.html"
def freq_words(url):
    freqdist = nltk.FreqDist()
    text = nltk.clean_url(url)
    for word in nltk.word_tokenize(text):
        freqdist.inc(word.lower())
    return freqdist
 	
fd = freq_words(constitution)
print fd.keys()[0:50]

Output:

['the', ',', 'of', 'and', 'shall', 'be', 'to', 'in', 'or', ';', 'states', 'united', 'a', 'by', 'for', 'state', 'any', 'which', 'all', 'may', 'such', 'president', 'have', 'as', 'on', 'congress', 'no', 'from', 'he', 'other', 'house', 'not', 'but', 'section.', 'law', 'their', 'each', 'one', 'constitution', 'office', 'that', 'two', 'this', ':', 'at', 'person', 'senate', 'time', 'under', 'with']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Laura's Code

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally