5
5
.. standard global imports
6
6
7
7
>>> import nltk, re, pprint
8
+ >>> from nltk import word_tokenize
8
9
9
10
.. TODO: more on regular expressions, including ()
10
11
.. TODO: talk about fact that English lexicon is open set (e.g. malware = malicious software)
@@ -57,8 +58,9 @@ see how to dispense with markup.
57
58
begin your interactive session or your program with the following import
58
59
statements:
59
60
60
- >>> from __future__ import division
61
+ >>> from __future__ import division # Python 2 users only
61
62
>>> import nltk, re, pprint
63
+ >>> from nltk import word_tokenize
62
64
63
65
.. _sec-accessing-text :
64
66
@@ -114,11 +116,11 @@ words and punctuation, as we saw in chap-introduction_. This step is
114
116
called `tokenization `:dt: , and it produces our familiar structure, a list of words
115
117
and punctuation.
116
118
117
- >>> tokens = nltk. word_tokenize(raw)
119
+ >>> tokens = word_tokenize(raw)
118
120
>>> type (tokens)
119
121
<class 'list'>
120
122
>>> len (tokens)
121
- 255809
123
+ 254354
122
124
>>> tokens[:10 ]
123
125
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
124
126
@@ -132,7 +134,7 @@ like slicing:
132
134
>>> text = nltk.Text(tokens)
133
135
>>> type (text)
134
136
<class 'nltk.text.Text'>
135
- >>> text[1020 :1060 ]
137
+ >>> text[1020 :1062 ]
136
138
['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
137
139
'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
138
140
'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
@@ -197,7 +199,7 @@ available from ``http://www.crummy.com/software/BeautifulSoup/``:
197
199
198
200
>>> from bs4 import BeautifulSoup
199
201
>>> raw = BeautifulSoup(html).get_text()
200
- >>> tokens = nltk. word_tokenize(raw)
202
+ >>> tokens = word_tokenize(raw)
201
203
>>> tokens
202
204
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]
203
205
@@ -287,14 +289,14 @@ of a blog, as shown below:
287
289
>>> content[:70 ]
288
290
'<p>Today I was chatting with three of our visiting graduate students f'
289
291
>>> raw = BeautifulSoup(content).get_text()
290
- >>> nltk. word_tokenize(raw)
292
+ >>> word_tokenize(raw)
291
293
['Today', 'I', 'was', 'chatting', 'with', 'three', 'of', 'our', 'visiting',
292
294
'graduate', 'students', 'from', 'the', 'PRC', '.', 'Thinking', 'that', 'I',
293
295
'was', 'being', 'au', 'courant', ',', 'I', 'mentioned', 'the', 'expression',
294
296
'DUI4XIANG4', '\u5c0d\u8c61', '("', 'boy', '/', 'girl', 'friend', '"', ...]
295
297
296
298
..
297
- >>> nltk. word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
299
+ >>> word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
298
300
299
301
With some further work, we can write programs to create a small corpus of blog posts,
300
302
and use this as the basis for our |NLP | work.
@@ -427,7 +429,7 @@ manipulate it just as we have done for other strings.
427
429
428
430
>>> s = input (" Enter some text: " )
429
431
Enter some text: On an exceptionally hot evening early in July
430
- >>> print (" You typed" , len (nltk. word_tokenize(s)), " words." )
432
+ >>> print (" You typed" , len (word_tokenize(s)), " words." )
431
433
You typed 8 words.
432
434
433
435
The NLP Pipeline
@@ -462,7 +464,7 @@ we are dealing with strings, Python's ``<str>`` data type
462
464
When we tokenize a string we produce a list (of words), and this is Python's ``<list> ``
463
465
type. Normalizing and sorting lists produces other lists:
464
466
465
- >>> tokens = nltk. word_tokenize(raw)
467
+ >>> tokens = word_tokenize(raw)
466
468
>>> type (tokens)
467
469
<class 'list'>
468
470
>>> words = [w.lower() for w in tokens]
@@ -1064,7 +1066,7 @@ the ``re`` module in the following section.)
1064
1066
|NLTK | tokenizers allow Unicode strings as input, and
1065
1067
correspondingly yield Unicode strings as output.
1066
1068
1067
- >>> nltk. word_tokenize(line) # doctest: +NORMALIZE_WHITESPACE
1069
+ >>> word_tokenize(line) # doctest: +NORMALIZE_WHITESPACE
1068
1070
['niemców', 'pod', 'koniec', 'ii', 'wojny', 'światowej', 'na', 'dolny', 'śląsk', ',', 'zostały']
1069
1071
1070
1072
Using your local encoding in Python
@@ -1510,7 +1512,7 @@ on to define a function to perform stemming, and apply it to a whole text:
1510
1512
>>> raw = """ DENNIS: Listen, strange women lying in ponds distributing swords
1511
1513
... is no basis for a system of government. Supreme executive power derives from
1512
1514
... a mandate from the masses, not from some farcical aquatic ceremony."""
1513
- >>> tokens = nltk. word_tokenize(raw)
1515
+ >>> tokens = word_tokenize(raw)
1514
1516
>>> [stem(t) for t in tokens]
1515
1517
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut',
1516
1518
'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', 'Supreme',
@@ -1624,7 +1626,7 @@ to define the data we will use in this section:
1624
1626
>>> raw = """ DENNIS: Listen, strange women lying in ponds distributing swords
1625
1627
... is no basis for a system of government. Supreme executive power derives from
1626
1628
... a mandate from the masses, not from some farcical aquatic ceremony."""
1627
- >>> tokens = nltk. word_tokenize(raw)
1629
+ >>> tokens = word_tokenize(raw)
1628
1630
1629
1631
Stemmers
1630
1632
--------
0 commit comments