-
Notifications
You must be signed in to change notification settings - Fork 2
Jacob's Code
Program: Text Lemmatizing
Purpose and Description: Lemmatization is the process of grouping the various inflections of a given word for the purpose of analysis. In english, it most often involves removing the affixes from the word. For example, if you want to determine the frequency words from the same root as 'apply', you would need to idenify words such as 'application', 'applied'. 'applies', etc. This nltk function will help with this process.
Program Capabilities: Becuase English grammar rules are often inconsistent, it can be difficult to write a program that works on all cases. The lemmatization funtion in nltk deals with this by listing each word along with its inflections in a dictionary. Consequently, the function will only lemmatize words found in its dictionary.
Code:
import nltk # this should be done once at the start of every session
raw = open('document.txt').read() # opens, reads content of 'document.txt' and stores in variable 'raw'
tokens = nltk.word_tokenize(raw) # stores the tokenize method in 'tokens' variable
wnl = nltk.WordNetLemmatizer() # stores the WordNetLemmatizer method in 'wnl' variable
[wnl.lemmatize(t) for t in tokens] # calls lemmatizer method for t in 'tokens' variable
Output: Output will vary depending on the ocntent of the 'document.txt' file.
Program: Scraping text from a webpage
Purpose and Description: This is a simple python script for extracting text from a webpage. It can be challenging to get all of the text formatted in the way you want, but this does a decent job. You should be able to find other resources on the web to adapt this script to your particular need. To get this to work, cut and paste the script into a text file, save it as name.py and run it from your terminal (python name.py). Before you start, make sure you have Beautiful Soup and urllib2 installed.
Program Capabilities: Can remove html, css, and javascript code from a webpage.
Code:
# modified from from http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript
import urllib2
from bs4 import BeautifulSoup
# get url from user
url = raw_input("Enter a url:")
# open url
html = urllib2.urlopen(url)
# read content from url
page_content = html.read()
soup = BeautifulSoup(page_content)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
Output: Output will vary depending on the content of the webpage.
Program: Scraping url's from a webpage
Purpose and Description: This is a simple script for scraping url's from a webpage. To get this to work, cut and paste the script into a text file, save it as name.py and run it from your terminal (python name.py). Before getting started, make sure you have Beautiful Soup installed on your system.
Program Capabilities: Can find and extract url's embeded in a webpage.
Code:
# Simple script for extracting url's from webpages, modified from www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/
from bs4 import BeautifulSoup
import requests
url=raw_input("Enter a url:")
r = requests.get("http://" +url)
data=r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
Output: Output will vary depending on the content of the webpage.
Program: Concordance
Purpose: This will help a user find where a word shows up in a text.
Code
import nltk # this should be done once at the start of every session
text6.concordance("this") # Looks for the word "this" in text 6.
Output
The output will vary depending on which text you decide to use and which word you put in the quotation marks.
Displaying 25 of 67 matches:
other one ! ARTHUR : I am , ... and this is my trusty servant Patsy . We have
en since the snows of winter covered this land , through the kingdom of Mercea
t do you mean ? SOLDIER # 1 : Well , this is a temperate zone . ARTHUR : The s
, good Sir Knight , but I must cross this bridge . BLACK KNIGHT : Then you sha
ne . WITCH : They dressed me up like this . CROWD : Augh , we didn ' t ! We di
dn ' t ! We didn ' t ... WITCH : And this isn ' t my nose . It ' s a false one
BEDEVERE : Did you dress her up like this ? VILLAGER # 1 : No ! VILLAGER # 2 a
VILLAGER # 4 : Here is a duck . Use this duck . [ quack quack quack ] BEDEVER
tly named Sir Not - appearing - in - this - film . Together they formed a band
rth to be banana - shaped . ARTHUR : This new learning amazes me , Sir Bedever
ry to talk to someone it ' s ' sorry this ' and ' forgive me that ' and ' I '
! Behold ! [ angels sing ] Arthur , this is the Holy Grail . Look well , Arth
, for it is your sacred task to seek this grail . That is your purpose , Arthu
the Round Table . Who ' s castle is this ? FRENCH GUARD : This is the castle
' s castle is this ? FRENCH GUARD : This is the castle of my master Guy de Lo
: I ' m French ! Why do think I have this outrageous accent , you silly king -
time - a ! [ sniff ] ARTHUR : Now , this is your last chance . I ' ve been mo
ge ! [ mayhem ] FRENCH GUARD : Hey , this one is for your mother ! There you g
u go . [ mayhem ] FRENCH GUARD : And this one ' s for your dad ! ARTHUR : Run
, l -- look , i -- i -- if we built this large wooden badger -- [ clank ] [ t
ividually . [ clop clop clop ] Now , this is what they did : Launcelot -- KNIG
k . I have seen it ! It is here , in this -- ZOOT : Sir Galahad ! You would no
neteen - and - a - half , cut off in this castle with no one to protect us . O
. We are doctors . GALAHAD : Look ! This cannot be . I am sworn to chastity .
the Grail ! I have seen it , here in this castle ! DINGO : Oh no . Oh , no ! B
Program Name: Lexical Text Dispersion
Program Purpose: The user wants to create a lexical dispersion plot that shows where words show up in the text in a chart.
import nltk # this should be done once at the start of every session
text1.distpersion_plot([“call”, “today”, “freedom”, “whale”]) # This will create a dispersion chart in a new window.
Output
Here is a screenshot of the output for this example:
![Lexical Dispersion] (http://i.imgur.com/uQo5tob.jpg)
Program Name: Text Count
Program Purpose: The user wants to know exactly how many times a word shows up in a text. Very simple.
import nltk # this should be done once at the start of every session
text2.count("see") # Sees how many times "see" shows up in text 2.
Output
The output for this is very simple and straight forward.
173 # "See" shows up this many times.
Program Name: Top 50
Program Purpose: The user wants to generate a graph that shows a distribution of the top 50 words in the selected text.
Code and Output
First we want to establish a variable
import nltk # this should be done once at the start of every session
fdist1 = FreqDist(text2)
fdist1
This will give you some output
<FreqDist with 6833 samples and 141576 outcomes>
Continue with this:
vocabulary1 = fdist1.keys()
vocabulary1[:50]
This gives you the output of the top 50 words in a text
[',', 'to', '.', 'the', 'of', 'and', 'her', 'a', 'I', 'in', 'was', 'it', '"', ';', 'she', 'be', 'that', 'for', 'not', 'as', 'you', 'with', 'had', 'his', 'he', "'", 'have', 'at', 'by', 'is', '."', 's', 'Elinor', 'on', 'all', 'him', 'so', 'but', 'which', 'could', 'Marianne', 'my', 'Mrs', 'from', 'would', 'very', 'no', 'their', 'them', '--']
After that output is produced, continue with this:
fdist1[‘have’]
This will produce some output as well. It tells you the exact number of times that the word “have” shows up in the text.
807
Now to the good part, the distribution graph. Input this:
fdist1.plot(50, cumulative=True)
You will get a beautiful graph that looks something like this:
![Distribution] (http://i.imgur.com/JLGVhQq.jpg)