py3 updates

stevenbird · stevenbird · commit 552dd89ed367 · 2013-11-11T14:37:52.000+11:00
diff --git a/book/Makefile b/book/Makefile
@@ -1,8 +1,8 @@
 # Tutorial Makefile
 #
-# Copyright (C) 2001-2012 NLTK Project
-# Author: Steven Bird <sb@csse.unimelb.edu.au>
-#         Edward Loper <edloper@gradient.cis.upenn.edu>
+# Copyright (C) 2001-2013 NLTK Project
+# Author: Steven Bird <stevenbird1@gmail.com>
+#         Edward Loper <edloper@gmail.com>
 # URL: <http://www.nltk.org/>
 # For license information, see LICENSE.TXT
 
@@ -31,7 +31,7 @@ PY := $(CHAPTERS:.rst=.py)
 
 BIBTEX_FILE = ../refs.bib
 
-PYTHON = python3
+PYTHON = python
 PDFLATEX = TEXINPUTS=".:..:ucs:" pdflatex -halt-on-error
 EXAMPLES = ../examples.py
 LATEXHACKS = ../latexhacks.py
diff --git a/book/ch03.rst b/book/ch03.rst
@@ -5,6 +5,7 @@
 .. standard global imports
 
     >>> import nltk, re, pprint
+    >>> from nltk import word_tokenize
 
 .. TODO: more on regular expressions, including ()
 .. TODO: talk about fact that English lexicon is open set (e.g. malware = malicious software)
@@ -57,8 +58,9 @@ see how to dispense with markup.
    begin your interactive session or your program with the following import
    statements:
    
-     >>> from __future__ import division 
+     >>> from __future__ import division  # Python 2 users only
      >>> import nltk, re, pprint
+     >>> from nltk import word_tokenize
 
 .. _sec-accessing-text:
 
@@ -114,11 +116,11 @@ words and punctuation, as we saw in chap-introduction_.  This step is
 called `tokenization`:dt:, and it produces our familiar structure, a list of words
 and punctuation.
 
-    >>> tokens = nltk.word_tokenize(raw)
+    >>> tokens = word_tokenize(raw)
     >>> type(tokens)
     <class 'list'>
     >>> len(tokens)
-    255809
+    254354
     >>> tokens[:10]
     ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
 
@@ -132,7 +134,7 @@ like slicing:
     >>> text = nltk.Text(tokens)
     >>> type(text)
     <class 'nltk.text.Text'>
-    >>> text[1020:1060]
+    >>> text[1020:1062]
     ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
     'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
     'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
@@ -197,7 +199,7 @@ available from ``http://www.crummy.com/software/BeautifulSoup/``:
 
     >>> from bs4 import BeautifulSoup
     >>> raw = BeautifulSoup(html).get_text()
-    >>> tokens = nltk.word_tokenize(raw)
+    >>> tokens = word_tokenize(raw)
     >>> tokens
     ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]
 
@@ -287,14 +289,14 @@ of a blog, as shown below:
     >>> content[:70]
     '<p>Today I was chatting with three of our visiting graduate students f'
     >>> raw = BeautifulSoup(content).get_text()
-    >>> nltk.word_tokenize(raw)
+    >>> word_tokenize(raw)
     ['Today', 'I', 'was', 'chatting', 'with', 'three', 'of', 'our', 'visiting',
     'graduate', 'students', 'from', 'the', 'PRC', '.', 'Thinking', 'that', 'I',
     'was', 'being', 'au', 'courant', ',', 'I', 'mentioned', 'the', 'expression',
     'DUI4XIANG4', '\u5c0d\u8c61', '("', 'boy', '/', 'girl', 'friend', '"', ...]
 
 ..
-    >>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
+    >>> word_tokenize(nltk.clean_html(llog.entries[2].content[0].value))
 
 With some further work, we can write programs to create a small corpus of blog posts,
 and use this as the basis for our |NLP| work.
@@ -427,7 +429,7 @@ manipulate it just as we have done for other strings.
 
     >>> s = input("Enter some text: ")
     Enter some text: On an exceptionally hot evening early in July
-    >>> print("You typed", len(nltk.word_tokenize(s)), "words.")
+    >>> print("You typed", len(word_tokenize(s)), "words.")
     You typed 8 words.
 
 The NLP Pipeline
@@ -462,7 +464,7 @@ we are dealing with strings, Python's ``<str>`` data type
 When we tokenize a string we produce a list (of words), and this is Python's ``<list>``
 type.  Normalizing and sorting lists produces other lists:
 
-    >>> tokens = nltk.word_tokenize(raw)
+    >>> tokens = word_tokenize(raw)
     >>> type(tokens)
     <class 'list'> 
     >>> words = [w.lower() for w in tokens]
@@ -1064,7 +1066,7 @@ the ``re`` module in the following section.)
 |NLTK| tokenizers allow Unicode strings as input, and
 correspondingly yield Unicode strings as output.
 
-    >>> nltk.word_tokenize(line)  # doctest: +NORMALIZE_WHITESPACE
+    >>> word_tokenize(line)  # doctest: +NORMALIZE_WHITESPACE
     ['niemców', 'pod', 'koniec', 'ii', 'wojny', 'światowej', 'na', 'dolny', 'śląsk', ',', 'zostały']
 
 Using your local encoding in Python
@@ -1510,7 +1512,7 @@ on to define a function to perform stemming, and apply it to a whole text:
     >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
     ... is no basis for a system of government.  Supreme executive power derives from
     ... a mandate from the masses, not from some farcical aquatic ceremony."""
-    >>> tokens = nltk.word_tokenize(raw)
+    >>> tokens = word_tokenize(raw)
     >>> [stem(t) for t in tokens]
     ['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut',
     'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', 'Supreme',
@@ -1624,7 +1626,7 @@ to define the data we will use in this section:
     >>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
     ... is no basis for a system of government.  Supreme executive power derives from
     ... a mandate from the masses, not from some farcical aquatic ceremony."""
-    >>> tokens = nltk.word_tokenize(raw)
+    >>> tokens = word_tokenize(raw)
 
 Stemmers
 --------
diff --git a/book/ch04.rst b/book/ch04.rst
@@ -5,6 +5,7 @@
 .. standard global imports
 
     >>> import nltk, re, pprint
+    >>> from nltk import word_tokenize
 
 .. TODO: recipes for flattening a list of lists into a list, and for the reverse grouping a list into a list of lists
 .. TODO: discuss duck typing
@@ -404,7 +405,7 @@ Some other objects, such as a ``FreqDist``, can be converted into a sequence (us
 and support iteration, e.g. 
 
     >>> raw = 'Red lorry, yellow lorry, red lorry, yellow lorry.'
-    >>> text = nltk.word_tokenize(raw)
+    >>> text = word_tokenize(raw)
     >>> fdist = nltk.FreqDist(text) 
     >>> list(fdist)
     ['lorry', ',', 'yellow', '.', 'Red', 'red']
@@ -558,16 +559,16 @@ processing of texts.  Here's an example where we tokenize and normalize a text:
 
     >>> text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
     ... "it means just what I choose it to mean - neither more nor less."'''
-    >>> [w.lower() for w in nltk.word_tokenize(text)]
+    >>> [w.lower() for w in word_tokenize(text)]
     ['``', 'when', 'i', 'use', 'a', 'word', ',', "''", 'humpty', 'dumpty', 'said', ...]
 
 Suppose we now want to process these words further.  We can do this by inserting the above
 expression inside a call to some other function max-comprehension_,
 but Python allows us to omit the brackets max-generator_.
 
-    >>> max([w.lower() for w in nltk.word_tokenize(text)]) # [_max-comprehension]
+    >>> max([w.lower() for w in word_tokenize(text)]) # [_max-comprehension]
     'word'
-    >>> max(w.lower() for w in nltk.word_tokenize(text)) # [_max-generator]
+    >>> max(w.lower() for w in word_tokenize(text)) # [_max-generator]
     'word'
 
 The second line uses a `generator expression`:dt:.  This is more than a notational convenience:
@@ -1143,7 +1144,7 @@ passed in as a parameter, and it also prints a list of the
 
    def freq_words(url, freqdist, n):
        text = nltk.clean_url(url)
-       for word in nltk.word_tokenize(text):
+       for word in word_tokenize(text):
            freqdist.inc(word.lower())
        print(freqdist.keys()[:n])
 
@@ -1168,7 +1169,7 @@ and simplify its interface by providing a single ``url`` parameter.
 
    def freq_words(url):
        text = nltk.clean_url(url)
-       freqdist = nltk.FreqDist(word.lower() for word in nltk.word_tokenize(text))
+       freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
        return freqdist
 
    >>> fd = freq_words(constitution)
@@ -1179,7 +1180,7 @@ and simplify its interface by providing a single ``url`` parameter.
 Note that we have now simplified the work of ``freq_words``
 to the point that we can do its work with three lines of code:
 
-    >>> words = nltk.word_tokenize(nltk.clean_url(constitution))
+    >>> words = word_tokenize(nltk.clean_url(constitution))
     >>> fd = nltk.FreqDist(word.lower() for word in words)
     >>> fd.keys()[:20]
     ['the', 'of', 'charters', 'bill', 'constitution', 'rights', ',',
@@ -1484,7 +1485,7 @@ definition, along with three equivalent ways to call the function:
 
     >>> def freq_words(file, min=1, num=10):
     ...     text = open(file).read()
-    ...     tokens = nltk.word_tokenize(text)
+    ...     tokens = word_tokenize(text)
     ...     freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
     ...     return freqdist.most_common(num)
     >>> fw = freq_words('ch01.rst', 4, 10)
@@ -1503,7 +1504,7 @@ progress if a ``verbose`` flag is set:
     ...     if verbose: print("Opening", file)
     ...     text = open(file).read()
     ...     if verbose: print("Read in %d characters" % len(file))
-    ...     for word in nltk.word_tokenize(text):
+    ...     for word in word_tokenize(text):
     ...         if len(word) >= min:
     ...             freqdist.inc(word)
     ...             if verbose and freqdist.N() % 100 == 0: print(".")
diff --git a/book/ch05.rst b/book/ch05.rst
@@ -4,6 +4,7 @@
 .. standard global imports
 
     >>> import nltk, re, pprint
+    >>> from nltk import word_tokenize
 
 .. TODO: exercise on cascaded tagging
 .. TODO: motivate trigram tagging by showing some cases where bigram tagging doesn't work
@@ -61,7 +62,7 @@ Using a Tagger
 A part-of-speech tagger, or `POS-tagger`:dt:, processes a sequence of words, and attaches a
 part of speech tag to each word (don't forget to ``import nltk``):
 
-    >>> text = nltk.word_tokenize("And now for something completely different")
+    >>> text = word_tokenize("And now for something completely different")
     >>> nltk.pos_tag(text)
     [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
     ('completely', 'RB'), ('different', 'JJ')]
@@ -82,7 +83,7 @@ Here we see that `and`:lx: is ``CC``, a coordinating conjunction;
 
 Let's look at another example, this time including some homonyms:
 
-    >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
+    >>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")
     >>> nltk.pos_tag(text)
     [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
     ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
@@ -1138,7 +1139,7 @@ most likely (now using the unsimplified tagset):
 Now we can create a tagger that tags everything as ``NN``.
 
     >>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
-    >>> tokens = nltk.word_tokenize(raw)
+    >>> tokens = word_tokenize(raw)
     >>> default_tagger = nltk.DefaultTagger('NN')
     >>> default_tagger.tag(tokens)
     [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
diff --git a/definitions.rst b/definitions.rst
@@ -3,7 +3,7 @@
 .. def:: definitions
 
 .. |version| replace:: 3.0
-.. |copyrightinfo| replace:: 2001-2012 the authors
+.. |copyrightinfo| replace:: 2001-2013 the authors
 .. |license| replace:: Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License
 
 .. CAP abbreviations (map to small caps in LaTeX)
diff --git a/doctest_split.py b/doctest_split.py
@@ -15,8 +15,8 @@
 
 # include this at the top of each output file
 HDR = """
-    >>> from __future__ import division
     >>> import nltk, re, pprint
+    >>> from nltk import word_tokenize
 """
 
 for filename in sys.argv[1:]: