Skip to content
Lasher09 edited this page May 1, 2014 · 1 revision

Posts

##Post 1:

Program: LongestWord

Purpose and description the program: This program will find the longest word in the given text. The example shown here is Milton's Paradise Lost. This program, like the UnusualWords program, is useful for ascertaining the reading level of a given text.

Program capabilites:

  • Read the given file
  • Calculate the average word length in a given text
  • Output the longest word/s

Code for LongestWord:

import nltk
text = nltk.corpus.gutenberg.words('milton-paradise.txt')					
longest = ' '				
for word in text:				
    if len(word) > len(longest):
	longest = word			
								
print longest

Output:

unextinguishable

##Post 2:

Program: GenderNames

Purpose and description: A program that reads a list of names and will predict whether a name is a male or a female name. This program will also produce a graph to illustrate the frequency of names ending in certain letters and their predicted gender.

Program capabilites:

  • Read two files
  • Read the words in both files
  • Output the names
  • Plot: be able to compare the last letter and output a plot that shows the results of names in by their last letters

Code for GenderNames:

import nltk
names = nltk.corpus.names
names.fileids()
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]
	
cfd = nltk.ConditionalFreqDist((file, name[-1])
	for file in names.fileids()
	for name in names.words(file))
	cfd.plot()    		# see  a neat plot

Output: Alt GenderMap

Here you can see the frequeuncy of the last letter of a name and how often they occur in male or female names. This graph is showing that names that end in 'a' are usually female names. Pretty neat!


##Post 3:

Program: POSTTagger

Purpose and description: POSTTagger is a simple program that will parse sentences and show the parts of speech (POS). This can be useful for ESL instructors for breaking down sentences for their students.

Program capabilites:

  • Have a variable that holds any sentence the user inputs
  • Tokenize the sentence
  • Attach parts of speech to each word in the sentence
  • Output the grammatically parsed sentence

Code for WebpageFrequency:

import nltk
text = nltk.word_tokenize("And now for something completely different")
print nltk.pos_tag(text)

Output:

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

Note on the output:

  • CC = coordinating conjunction
  • RB = adverb
  • IN = preposition/subordinating conjunction
  • NN = noun
  • JJ = adjective
  • These tags and others can be found here .

##Post 4:

Program: UnusualWords

Purpose and description: This program can read a text for the most unusual words in a text. This is great for judging the reading level of a text.

Program capabilites:

  • Read all the lowercase words in a text
  • Compare the words in this text to the words in another text file holding a list of unusual words.
  • Output the resulting unusual words

Code for UnusualWords:

This example is going to look through the 1801 Inaugural Speeches, Specifically Jefferson's. What kind of words do you think might be there?

import nltk
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)

print unusual_words(nltk.corpus.inaugural.words('1801-Jefferson.txt'))

Output:

['abuses', 'acknowledging', 'acquisitions', 'actions', 'administrations', 'adoring', 'affairs', 'agonizing', 'alliances', 'angels', 'announced', 'assembled', 'authorities', 'banished', 'bestowed', 'billows', 'bled', 'blessings', 'bulwarks', 'burthened', 'called', 'cases', 'charged', 'citizens', 'committed', 'concerns', 'convulsions', 'councils', 'debts', 'decisions', 'degradations', 'delights', 'descendants', 'destined', 'destinies', ...]

Are you surprised?


##Post 5:

Program: VocabularyHTML

Purpose and description: This program takes out selecte words from a webpage and outputs those words into a list format for easier copy and pasting with no need to format further. This program is useful for gather quick lists of words, i.e.: vocabulary.

Program capabilites:

  • Capture a webpage
  • Import nltk, urllib, urlopen, pprint
  • Target a certain number of words
  • Output those words into a list format with no puntucation

Code for VocabularyHTML:

import nltk
from urllib import urlopen
import pprint
url="http://www.worldwidewords.org/weirdwords"
raw=urlopen(url).read()
html=urlopen(url).read()
raw=nltk.clean_html(html)
tokens=nltk.word_tokenize(raw)
tokens=[x.replace(";","") for x in tokens]
tokens = filter(None, tokens)
print '\n'.join(tokens[200:300])

Output:

Blurb
Bodacious
Bodger
Bombilation
Bonzer
Boondoggle
Bootless
Borborygmus
Boscage
Boustrophedonic
Bowdlerise
Bridewell
Brimborion
Brobdingnagian
Bromopnea
Brosiering
Brummagem
Bruxer
Burgoo

##Post 6:

Program: VocabularyHTML

Purpose and description: This program takes out selecte words from a webpage and outputs those words into a list format for easier copy and pasting with no need to format further. This program is useful for gather quick lists of words, i.e.: vocabulary.

Program capabilites:

  • Capture a webpage
  • Import nltk, urllib, urlopen, pprint
  • Target a certain number of words
  • Output those words into a list format with no puntucation

Code for VocabularyHTML:

import nltk
from urllib import urlopen
import pprint
url="http://www.worldwidewords.org/weirdwords"
raw=urlopen(url).read()
html=urlopen(url).read()
raw=nltk.clean_html(html)
tokens=nltk.word_tokenize(raw)
tokens=[x.replace(";","") for x in tokens]
tokens = filter(None, tokens)
print '\n'.join(tokens[200:300])

Output:

Blurb
Bodacious
Bodger
Bombilation
Bonzer
Boondoggle
Bootless
Borborygmus
Boscage
Boustrophedonic
Bowdlerise
Bridewell
Brimborion
Brobdingnagian
Bromopnea
Brosiering
Brummagem
Bruxer
Burgoo

##Post 7:

Program: WebpageFrequency

Purpose and description: This program allows a user to tokenize a webpage and get the most frequent words used in it. This program would be useful for those who need to stay abreast of the commonly spoken words on the internet, i.e. copywriters, webdesigners, etc.

Program capabilites:

  • read a URL
  • tokenize all the words
  • calculate the given number for the most used words
  • and then give the output of the word

Code for WebpageFrequency:

import nltk
constitution = "http://www.archives.gov/exhibits/charters/constitution_transcript.html"
def freq_words(url):
    freqdist = nltk.FreqDist()
    text = nltk.clean_url(url)
    for word in nltk.word_tokenize(text):
        freqdist.inc(word.lower())
    return freqdist
 	
fd = freq_words(constitution)
print fd.keys()[0:50]

Output:

['the', ',', 'of', 'and', 'shall', 'be', 'to', 'in', 'or', ';', 'states', 'united', 'a', 'by', 'for', 'state', 'any', 'which', 'all', 'may', 'such', 'president', 'have', 'as', 'on', 'congress', 'no', 'from', 'he', 'other', 'house', 'not', 'but', 'section.', 'law', 'their', 'each', 'one', 'constitution', 'office', 'that', 'two', 'this', ':', 'at', 'person', 'senate', 'time', 'under', 'with']

Clone this wiki locally