Update README and help message to be more useful.

cbh66 · cbh66 · commit 05b87dd53b9c · 2016-05-16T11:46:25.000-04:00
diff --git a/README.md b/README.md
@@ -1 +1,58 @@
 # ngrams
+
+## Overview
+
+ngrams is for language analysis.  In particular it is for classifying documents
+whose language is unknown.
+
+## How to Use
+
+The accuraccy of the program's analysis depends on its training data.  When
+running the program, a source file should be included (or typed in when the
+program begins) specifying known languages and example files.  Such a file
+should have the form:
+
+    English english1.txt english2.txt
+    French  french/
+    English english3.txt
+    Unknown tbd.txt
+
+Where each line has a language name and the name of 1 or more files written in
+that language, or the name of a directory, all of whose files are written in
+that language.  If the language name is the 'Unknown' keyword, the document
+will be classified.  (A different keyword can be used if the program is invoked
+with the -u option).
+
+### Dealing with large training sets
+
+While prediction accuracy is enhanced by large sets of training data, this
+can both result in a slow program, and be difficult to manage.
+
+For speed, ngrams offers a caching option, which creates a cache file for each
+language.  This way, that language's training data only needs to be read the
+first time the program is run; subsequent runs will be much faster.  If the
+cache for a language is out of date (ie. a file has been added, deleted, or
+modified), the program will automatically recalculate that language's
+statistics and update the cache.
+
+Note that this magical cache business has not actually been implemented yet.
+
+To better manage large training sets, it is suggested that you put training
+documents in directories whose name is their language.  The source file then
+needs only specify the directory, and the program will automatically see
+any new or renamed files.
+
+### Other Options
+
+Eventually, I'll try to get a character prediction system in place.  It can
+either try to finish a sentence or generate a random string of characters
+that should look like it's from the language.  Fun!
+
+## Author
+
+Colin Hamilton, Tufts University
+
+## Acknowledgements
+
+The idea for this program came from the final project for Tufts COMP 11, Fall
+2015.  Thanks to Chris Gregg, Bruce Molay, and Ben Hescott.
diff --git a/main.py b/main.py
@@ -5,8 +5,6 @@
     Inspired by the Tufts University COMP 11 Final project "trigrams", Fall 2015
 
     TODO:
-    - Remove details from description (a help message should be concise);
-        add details to README
     - Allow wildcards and such in filenames.  See expanduser(), expandvars(), glob
     - Add caching for speedup (either single files, by language, or by directory)
         - Need to figure out when to use cache, when to update
@@ -15,9 +13,13 @@
             - Gotta make sure to minimize risk of accidental duplication of data,
                 ie having a file's data in cache and then reading it again in
                 addition to that
+            - Could keep filenames/last modified date to see if it's up to date
             - Would need a hard refresh option, probably one for individual
                 languages, and one to refresh all languages
     - Add option for directory traversal
+    - Implement prediction -- language name, optional "seed", num letters to
+        predict, choose randomized or max likelihood (both using randomization
+        for tiebreakers)
 """
 import sys
 import os
@@ -26,18 +28,7 @@
 import language_match
 
 
-# Make more concise, add details to README
-DESCRIPTION = ("Compares documents written in unknown languages to known languages.  " +
-"An input file must be provided with known languages, with lines of the form" +
-"""
-    English english1.txt english2.txt
-    French  french/
-    English english3.txt
-    Unknown tbd.txt
-""" +
-"Where each line has a language name and the name of 1 or more files written in that language, " +
-"or the name of a directory containing files in that language.  " +
-"If the language name is the 'Unknown' keyword, the language will be classified.")
+DESCRIPTION = ("Compares documents written in unknown languages to known languages.")
 
 
 parser = argparse.ArgumentParser(description=DESCRIPTION,
@@ -54,7 +45,7 @@
 parser.add_argument("--unknown", "-u",
                     help="the keyword designating unknown languages in input file " +
                         "(default '%(default)s')")
-parser.add_argument("--data", "-d", default=None,
+parser.add_argument("--data", "-d", default=None,  # Just a flag?  Make hidden files for langs?
                     help="file to use as cache for languages") #read and write to
 
 parser.set_defaults(n_gram_max=3,