Skip to content

Commit 05b87dd

Browse files
committed
Update README and help message to be more useful.
1 parent 6606454 commit 05b87dd

File tree

2 files changed

+63
-15
lines changed

2 files changed

+63
-15
lines changed

README.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,58 @@
11
# ngrams
2+
3+
## Overview
4+
5+
ngrams is for language analysis. In particular it is for classifying documents
6+
whose language is unknown.
7+
8+
## How to Use
9+
10+
The accuraccy of the program's analysis depends on its training data. When
11+
running the program, a source file should be included (or typed in when the
12+
program begins) specifying known languages and example files. Such a file
13+
should have the form:
14+
15+
English english1.txt english2.txt
16+
French french/
17+
English english3.txt
18+
Unknown tbd.txt
19+
20+
Where each line has a language name and the name of 1 or more files written in
21+
that language, or the name of a directory, all of whose files are written in
22+
that language. If the language name is the 'Unknown' keyword, the document
23+
will be classified. (A different keyword can be used if the program is invoked
24+
with the -u option).
25+
26+
### Dealing with large training sets
27+
28+
While prediction accuracy is enhanced by large sets of training data, this
29+
can both result in a slow program, and be difficult to manage.
30+
31+
For speed, ngrams offers a caching option, which creates a cache file for each
32+
language. This way, that language's training data only needs to be read the
33+
first time the program is run; subsequent runs will be much faster. If the
34+
cache for a language is out of date (ie. a file has been added, deleted, or
35+
modified), the program will automatically recalculate that language's
36+
statistics and update the cache.
37+
38+
Note that this magical cache business has not actually been implemented yet.
39+
40+
To better manage large training sets, it is suggested that you put training
41+
documents in directories whose name is their language. The source file then
42+
needs only specify the directory, and the program will automatically see
43+
any new or renamed files.
44+
45+
### Other Options
46+
47+
Eventually, I'll try to get a character prediction system in place. It can
48+
either try to finish a sentence or generate a random string of characters
49+
that should look like it's from the language. Fun!
50+
51+
## Author
52+
53+
Colin Hamilton, Tufts University
54+
55+
## Acknowledgements
56+
57+
The idea for this program came from the final project for Tufts COMP 11, Fall
58+
2015. Thanks to Chris Gregg, Bruce Molay, and Ben Hescott.

main.py

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@
55
Inspired by the Tufts University COMP 11 Final project "trigrams", Fall 2015
66
77
TODO:
8-
- Remove details from description (a help message should be concise);
9-
add details to README
108
- Allow wildcards and such in filenames. See expanduser(), expandvars(), glob
119
- Add caching for speedup (either single files, by language, or by directory)
1210
- Need to figure out when to use cache, when to update
@@ -15,9 +13,13 @@
1513
- Gotta make sure to minimize risk of accidental duplication of data,
1614
ie having a file's data in cache and then reading it again in
1715
addition to that
16+
- Could keep filenames/last modified date to see if it's up to date
1817
- Would need a hard refresh option, probably one for individual
1918
languages, and one to refresh all languages
2019
- Add option for directory traversal
20+
- Implement prediction -- language name, optional "seed", num letters to
21+
predict, choose randomized or max likelihood (both using randomization
22+
for tiebreakers)
2123
"""
2224
import sys
2325
import os
@@ -26,18 +28,7 @@
2628
import language_match
2729

2830

29-
# Make more concise, add details to README
30-
DESCRIPTION = ("Compares documents written in unknown languages to known languages. " +
31-
"An input file must be provided with known languages, with lines of the form" +
32-
"""
33-
English english1.txt english2.txt
34-
French french/
35-
English english3.txt
36-
Unknown tbd.txt
37-
""" +
38-
"Where each line has a language name and the name of 1 or more files written in that language, " +
39-
"or the name of a directory containing files in that language. " +
40-
"If the language name is the 'Unknown' keyword, the language will be classified.")
31+
DESCRIPTION = ("Compares documents written in unknown languages to known languages.")
4132

4233

4334
parser = argparse.ArgumentParser(description=DESCRIPTION,
@@ -54,7 +45,7 @@
5445
parser.add_argument("--unknown", "-u",
5546
help="the keyword designating unknown languages in input file " +
5647
"(default '%(default)s')")
57-
parser.add_argument("--data", "-d", default=None,
48+
parser.add_argument("--data", "-d", default=None, # Just a flag? Make hidden files for langs?
5849
help="file to use as cache for languages") #read and write to
5950

6051
parser.set_defaults(n_gram_max=3,

0 commit comments

Comments
 (0)