Skip to content
Lasher09 edited this page May 1, 2014 · 1 revision

layout: post author: jacobthill title: Lemmatizing a Text

Program: Text Lemmatizing

Purpose and Description: Lemmatization is the process of grouping the various inflections of a given word for the purpose of analysis. In english, it most often involves removing the affixes from the word. For example, if you want to determine the frequency words from the same root as 'apply', you would need to idenify words such as 'application', 'applied'. 'applies', etc. This nltk function will help with this process.

Program Capabilities: Becuase English grammar rules are often inconsistent, it can be difficult to write a program that works on all cases. The lemmatization funtion in nltk deals with this by listing each word along with its inflections in a dictionary. Consequently, the function will only lemmatize words found in its dictionary.

Code:

import nltk # this should be done once at the start of every session

raw = open('document.txt').read() # opens, reads content of 'document.txt' and stores in variable 'raw'

tokens = nltk.word_tokenize(raw) # stores the tokenize method in 'tokens' variable

wnl = nltk.WordNetLemmatizer()  # stores the WordNetLemmatizer method in 'wnl' variable
[wnl.lemmatize(t) for t in tokens] # calls lemmatizer method  for t in 'tokens' variable

Output: Output will vary depending on the ocntent of the 'document.txt' file.



layout: post author: jacobthill title: Scraping text from a webpage

Program: Scraping text from a webpage

Purpose and Description: This is a simple python script for extracting text from a webpage. It can be challenging to get all of the text formatted in the way you want, but this does a decent job. You should be able to find other resources on the web to adapt this script to your particular need. To get this to work, cut and paste the script into a text file, save it as name.py and run it from your terminal (python name.py). Before you start, make sure you have Beautiful Soup and urllib2 installed.

Program Capabilities: Can remove html, css, and javascript code from a webpage.

Code:

# modified from from http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript

import urllib2
from bs4 import BeautifulSoup

# get url from user
url = raw_input("Enter a url:")
# open url
html = urllib2.urlopen(url)
# read content from url
page_content = html.read()
soup = BeautifulSoup(page_content)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print text

Output: Output will vary depending on the content of the webpage.



layout: post author: jacobthill title: Scraping url's from a webpage

Program: Scraping url's from a webpage

Purpose and Description: This is a simple script for scraping url's from a webpage. To get this to work, cut and paste the script into a text file, save it as name.py and run it from your terminal (python name.py). Before getting started, make sure you have Beautiful Soup installed on your system.

Program Capabilities: Can find and extract url's embeded in a webpage.

Code:

# Simple script for extracting url's from webpages, modified from www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/

from bs4 import BeautifulSoup

import requests

url=raw_input("Enter a url:")

r = requests.get("http://" +url)

data=r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a'):
	print(link.get('href'))

Output: Output will vary depending on the content of the webpage.



layout: post author: ethan title: Text Concordance

Program: Concordance

Purpose: This will help a user find where a word shows up in a text.

Code

import nltk # this should be done once at the start of every session
text6.concordance("this") # Looks for the word "this" in text 6.

Output

The output will vary depending on which text you decide to use and which word you put in the quotation marks.

Displaying 25 of 67 matches:
 other one ! ARTHUR : I am , ... and this is my trusty servant Patsy . We have
en since the snows of winter covered this land , through the kingdom of Mercea
t do you mean ? SOLDIER # 1 : Well , this is a temperate zone . ARTHUR : The s
, good Sir Knight , but I must cross this bridge . BLACK KNIGHT : Then you sha
ne . WITCH : They dressed me up like this . CROWD : Augh , we didn ' t ! We di
dn ' t ! We didn ' t ... WITCH : And this isn ' t my nose . It ' s a false one
BEDEVERE : Did you dress her up like this ? VILLAGER # 1 : No ! VILLAGER # 2 a
 VILLAGER # 4 : Here is a duck . Use this duck . [ quack quack quack ] BEDEVER
tly named Sir Not - appearing - in - this - film . Together they formed a band
rth to be banana - shaped . ARTHUR : This new learning amazes me , Sir Bedever
ry to talk to someone it ' s ' sorry this ' and ' forgive me that ' and ' I ' 
 ! Behold ! [ angels sing ] Arthur , this is the Holy Grail . Look well , Arth
, for it is your sacred task to seek this grail . That is your purpose , Arthu
 the Round Table . Who ' s castle is this ? FRENCH GUARD : This is the castle 
 ' s castle is this ? FRENCH GUARD : This is the castle of my master Guy de Lo
: I ' m French ! Why do think I have this outrageous accent , you silly king -
 time - a ! [ sniff ] ARTHUR : Now , this is your last chance . I ' ve been mo
ge ! [ mayhem ] FRENCH GUARD : Hey , this one is for your mother ! There you g
u go . [ mayhem ] FRENCH GUARD : And this one ' s for your dad ! ARTHUR : Run 
 , l -- look , i -- i -- if we built this large wooden badger -- [ clank ] [ t
ividually . [ clop clop clop ] Now , this is what they did : Launcelot -- KNIG
k . I have seen it ! It is here , in this -- ZOOT : Sir Galahad ! You would no
neteen - and - a - half , cut off in this castle with no one to protect us . O
 . We are doctors . GALAHAD : Look ! This cannot be . I am sworn to chastity .
the Grail ! I have seen it , here in this castle ! DINGO : Oh no . Oh , no ! B


layout: post author: ethan title: Lexical Dispersion

Program Name: Lexical Text Dispersion

Program Purpose: The user wants to create a lexical dispersion plot that shows where words show up in the text in a chart.

import nltk # this should be done once at the start of every session
text1.distpersion_plot([“call”, “today”, “freedom”, “whale”]) # This will create a dispersion chart in a new window.

Output

Here is a screenshot of the output for this example:

![Lexical Dispersion] (http://i.imgur.com/uQo5tob.jpg)



layout: post author: ethan title: Text Count

Program Name: Text Count

Program Purpose: The user wants to know exactly how many times a word shows up in a text. Very simple.

import nltk # this should be done once at the start of every session
text2.count("see") # Sees how many times "see" shows up in text 2.

Output

The output for this is very simple and straight forward.

173 # "See" shows up this many times. 


layout: post author: ethan title: Top 50

Program Name: Top 50

Program Purpose: The user wants to generate a graph that shows a distribution of the top 50 words in the selected text.

Code and Output

First we want to establish a variable

import nltk # this should be done once at the start of every session
fdist1 = FreqDist(text2)
fdist1

This will give you some output

<FreqDist with 6833 samples and 141576 outcomes>

Continue with this:

vocabulary1 = fdist1.keys()
vocabulary1[:50]

This gives you the output of the top 50 words in a text

[',', 'to', '.', 'the', 'of', 'and', 'her', 'a', 'I', 'in', 'was', 'it', '"', ';', 'she', 'be', 'that', 'for', 'not', 'as', 'you', 'with', 'had', 'his', 'he', "'", 'have', 'at', 'by', 'is', '."', 's', 'Elinor', 'on', 'all', 'him', 'so', 'but', 'which', 'could', 'Marianne', 'my', 'Mrs', 'from', 'would', 'very', 'no', 'their', 'them', '--']

After that output is produced, continue with this:

fdist1[‘have’]

This will produce some output as well. It tells you the exact number of times that the word “have” shows up in the text.

807

Now to the good part, the distribution graph. Input this:

fdist1.plot(50, cumulative=True)

You will get a beautiful graph that looks something like this:

![Distribution] (http://i.imgur.com/JLGVhQq.jpg)