Skip to content

Commit 552dd89

Browse files
committed
py3 updates
1 parent df8f606 commit 552dd89

File tree

6 files changed

+34
-30
lines changed

6 files changed

+34
-30
lines changed

book/Makefile

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Tutorial Makefile
22
#
3-
# Copyright (C) 2001-2012 NLTK Project
4-
# Author: Steven Bird <[email protected]>
5-
# Edward Loper <edloper@gradient.cis.upenn.edu>
3+
# Copyright (C) 2001-2013 NLTK Project
4+
# Author: Steven Bird <[email protected]>
5+
# Edward Loper <edloper@gmail.com>
66
# URL: <http://www.nltk.org/>
77
# For license information, see LICENSE.TXT
88

@@ -31,7 +31,7 @@ PY := $(CHAPTERS:.rst=.py)
3131

3232
BIBTEX_FILE = ../refs.bib
3333

34-
PYTHON = python3
34+
PYTHON = python
3535
PDFLATEX = TEXINPUTS=".:..:ucs:" pdflatex -halt-on-error
3636
EXAMPLES = ../examples.py
3737
LATEXHACKS = ../latexhacks.py

book/ch03.rst

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
.. standard global imports
66
77
>>> import nltk, re, pprint
8+
>>> from nltk import word_tokenize
89
910
.. TODO: more on regular expressions, including ()
1011
.. TODO: talk about fact that English lexicon is open set (e.g. malware = malicious software)
@@ -57,8 +58,9 @@ see how to dispense with markup.
5758
begin your interactive session or your program with the following import
5859
statements:
5960

60-
>>> from __future__ import division
61+
>>> from __future__ import division # Python 2 users only
6162
>>> import nltk, re, pprint
63+
>>> from nltk import word_tokenize
6264

6365
.. _sec-accessing-text:
6466

@@ -114,11 +116,11 @@ words and punctuation, as we saw in chap-introduction_. This step is
114116
called `tokenization`:dt:, and it produces our familiar structure, a list of words
115117
and punctuation.
116118

117-
>>> tokens = nltk.word_tokenize(raw)
119+
>>> tokens = word_tokenize(raw)
118120
>>> type(tokens)
119121
<class 'list'>
120122
>>> len(tokens)
121-
255809
123+
254354
122124
>>> tokens[:10]
123125
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
124126

@@ -132,7 +134,7 @@ like slicing:
132134
>>> text = nltk.Text(tokens)
133135
>>> type(text)
134136
<class 'nltk.text.Text'>
135-
>>> text[1020:1060]
137+
>>> text[1020:1062]
136138
['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
137139
'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
138140
'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
@@ -197,7 +199,7 @@ available from ``http://www.crummy.com/software/BeautifulSoup/``:
197199

198200
>>> from bs4 import BeautifulSoup
199201
>>> raw = BeautifulSoup(html).get_text()
200-
>>> tokens = nltk.word_tokenize(raw)
202+
>>> tokens = word_tokenize(raw)
201203
>>> tokens
202204
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]
203205

@@ -287,14 +289,14 @@ of a blog, as shown below:
287289
>>> content[:70]
288290
'<p>Today I was chatting with three of our visiting graduate students f'
289291
>>> raw = BeautifulSoup(content).get_text()
290-
>>> nltk.word_tokenize(raw)
292+
>>> word_tokenize(raw)
291293
['Today', 'I', 'was', 'chatting', 'with', 'three', 'of', 'our', 'visiting',
292294
'graduate', 'students', 'from', 'the', 'PRC', '.', 'Thinking', 'that', 'I',
293295
'was', 'being', 'au', 'courant', ',', 'I', 'mentioned', 'the', 'expression',
294296
'DUI4XIANG4', '\u5c0d\u8c61', '("', 'boy', '/', 'girl', 'friend', '"', ...]
295297

296298
..
297-
>>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
299+
>>> word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
298300
299301
With some further work, we can write programs to create a small corpus of blog posts,
300302
and use this as the basis for our |NLP| work.
@@ -427,7 +429,7 @@ manipulate it just as we have done for other strings.
427429

428430
>>> s = input("Enter some text: ")
429431
Enter some text: On an exceptionally hot evening early in July
430-
>>> print("You typed", len(nltk.word_tokenize(s)), "words.")
432+
>>> print("You typed", len(word_tokenize(s)), "words.")
431433
You typed 8 words.
432434

433435
The NLP Pipeline
@@ -462,7 +464,7 @@ we are dealing with strings, Python's ``<str>`` data type
462464
When we tokenize a string we produce a list (of words), and this is Python's ``<list>``
463465
type. Normalizing and sorting lists produces other lists:
464466

465-
>>> tokens = nltk.word_tokenize(raw)
467+
>>> tokens = word_tokenize(raw)
466468
>>> type(tokens)
467469
<class 'list'>
468470
>>> words = [w.lower() for w in tokens]
@@ -1064,7 +1066,7 @@ the ``re`` module in the following section.)
10641066
|NLTK| tokenizers allow Unicode strings as input, and
10651067
correspondingly yield Unicode strings as output.
10661068

1067-
>>> nltk.word_tokenize(line) # doctest: +NORMALIZE_WHITESPACE
1069+
>>> word_tokenize(line) # doctest: +NORMALIZE_WHITESPACE
10681070
['niemców', 'pod', 'koniec', 'ii', 'wojny', 'światowej', 'na', 'dolny', 'śląsk', ',', 'zostały']
10691071

10701072
Using your local encoding in Python
@@ -1510,7 +1512,7 @@ on to define a function to perform stemming, and apply it to a whole text:
15101512
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
15111513
... is no basis for a system of government. Supreme executive power derives from
15121514
... a mandate from the masses, not from some farcical aquatic ceremony."""
1513-
>>> tokens = nltk.word_tokenize(raw)
1515+
>>> tokens = word_tokenize(raw)
15141516
>>> [stem(t) for t in tokens]
15151517
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut',
15161518
'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', 'Supreme',
@@ -1624,7 +1626,7 @@ to define the data we will use in this section:
16241626
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
16251627
... is no basis for a system of government. Supreme executive power derives from
16261628
... a mandate from the masses, not from some farcical aquatic ceremony."""
1627-
>>> tokens = nltk.word_tokenize(raw)
1629+
>>> tokens = word_tokenize(raw)
16281630

16291631
Stemmers
16301632
--------

book/ch04.rst

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
.. standard global imports
66
77
>>> import nltk, re, pprint
8+
>>> from nltk import word_tokenize
89
910
.. TODO: recipes for flattening a list of lists into a list, and for the reverse grouping a list into a list of lists
1011
.. TODO: discuss duck typing
@@ -404,7 +405,7 @@ Some other objects, such as a ``FreqDist``, can be converted into a sequence (us
404405
and support iteration, e.g.
405406

406407
>>> raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
407-
>>> text = nltk.word_tokenize(raw)
408+
>>> text = word_tokenize(raw)
408409
>>> fdist = nltk.FreqDist(text)
409410
>>> list(fdist)
410411
['lorry', ',', 'yellow', '.', 'Red', 'red']
@@ -558,16 +559,16 @@ processing of texts. Here's an example where we tokenize and normalize a text:
558559

559560
>>> text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
560561
... "it means just what I choose it to mean - neither more nor less."'''
561-
>>> [w.lower() for w in nltk.word_tokenize(text)]
562+
>>> [w.lower() for w in word_tokenize(text)]
562563
['``', 'when', 'i', 'use', 'a', 'word', ',', "''", 'humpty', 'dumpty', 'said', ...]
563564

564565
Suppose we now want to process these words further. We can do this by inserting the above
565566
expression inside a call to some other function max-comprehension_,
566567
but Python allows us to omit the brackets max-generator_.
567568

568-
>>> max([w.lower() for w in nltk.word_tokenize(text)]) # [_max-comprehension]
569+
>>> max([w.lower() for w in word_tokenize(text)]) # [_max-comprehension]
569570
'word'
570-
>>> max(w.lower() for w in nltk.word_tokenize(text)) # [_max-generator]
571+
>>> max(w.lower() for w in word_tokenize(text)) # [_max-generator]
571572
'word'
572573

573574
The second line uses a `generator expression`:dt:. This is more than a notational convenience:
@@ -1143,7 +1144,7 @@ passed in as a parameter, and it also prints a list of the
11431144

11441145
def freq_words(url, freqdist, n):
11451146
text = nltk.clean_url(url)
1146-
for word in nltk.word_tokenize(text):
1147+
for word in word_tokenize(text):
11471148
freqdist.inc(word.lower())
11481149
print(freqdist.keys()[:n])
11491150

@@ -1168,7 +1169,7 @@ and simplify its interface by providing a single ``url`` parameter.
11681169

11691170
def freq_words(url):
11701171
text = nltk.clean_url(url)
1171-
freqdist = nltk.FreqDist(word.lower() for word in nltk.word_tokenize(text))
1172+
freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
11721173
return freqdist
11731174

11741175
>>> fd = freq_words(constitution)
@@ -1179,7 +1180,7 @@ and simplify its interface by providing a single ``url`` parameter.
11791180
Note that we have now simplified the work of ``freq_words``
11801181
to the point that we can do its work with three lines of code:
11811182

1182-
>>> words = nltk.word_tokenize(nltk.clean_url(constitution))
1183+
>>> words = word_tokenize(nltk.clean_url(constitution))
11831184
>>> fd = nltk.FreqDist(word.lower() for word in words)
11841185
>>> fd.keys()[:20]
11851186
['the', 'of', 'charters', 'bill', 'constitution', 'rights', ',',
@@ -1484,7 +1485,7 @@ definition, along with three equivalent ways to call the function:
14841485

14851486
>>> def freq_words(file, min=1, num=10):
14861487
... text = open(file).read()
1487-
... tokens = nltk.word_tokenize(text)
1488+
... tokens = word_tokenize(text)
14881489
... freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
14891490
... return freqdist.most_common(num)
14901491
>>> fw = freq_words('ch01.rst', 4, 10)
@@ -1503,7 +1504,7 @@ progress if a ``verbose`` flag is set:
15031504
... if verbose: print("Opening", file)
15041505
... text = open(file).read()
15051506
... if verbose: print("Read in %d characters" % len(file))
1506-
... for word in nltk.word_tokenize(text):
1507+
... for word in word_tokenize(text):
15071508
... if len(word) >= min:
15081509
... freqdist.inc(word)
15091510
... if verbose and freqdist.N() % 100 == 0: print(".")

book/ch05.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
.. standard global imports
55
66
>>> import nltk, re, pprint
7+
>>> from nltk import word_tokenize
78
89
.. TODO: exercise on cascaded tagging
910
.. TODO: motivate trigram tagging by showing some cases where bigram tagging doesn't work
@@ -61,7 +62,7 @@ Using a Tagger
6162
A part-of-speech tagger, or `POS-tagger`:dt:, processes a sequence of words, and attaches a
6263
part of speech tag to each word (don't forget to ``import nltk``):
6364

64-
>>> text = nltk.word_tokenize("And now for something completely different")
65+
>>> text = word_tokenize("And now for something completely different")
6566
>>> nltk.pos_tag(text)
6667
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
6768
('completely', 'RB'), ('different', 'JJ')]
@@ -82,7 +83,7 @@ Here we see that `and`:lx: is ``CC``, a coordinating conjunction;
8283

8384
Let's look at another example, this time including some homonyms:
8485

85-
>>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
86+
>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")
8687
>>> nltk.pos_tag(text)
8788
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
8889
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
@@ -1138,7 +1139,7 @@ most likely (now using the unsimplified tagset):
11381139
Now we can create a tagger that tags everything as ``NN``.
11391140

11401141
>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
1141-
>>> tokens = nltk.word_tokenize(raw)
1142+
>>> tokens = word_tokenize(raw)
11421143
>>> default_tagger = nltk.DefaultTagger('NN')
11431144
>>> default_tagger.tag(tokens)
11441145
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),

definitions.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
.. def:: definitions
44

55
.. |version| replace:: 3.0
6-
.. |copyrightinfo| replace:: 2001-2012 the authors
6+
.. |copyrightinfo| replace:: 2001-2013 the authors
77
.. |license| replace:: Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License
88

99
.. CAP abbreviations (map to small caps in LaTeX)

doctest_split.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515

1616
# include this at the top of each output file
1717
HDR = """
18-
>>> from __future__ import division
1918
>>> import nltk, re, pprint
19+
>>> from nltk import word_tokenize
2020
"""
2121

2222
for filename in sys.argv[1:]:

0 commit comments

Comments
 (0)