initial commit

timwee · Sep 20, 2016 · a202b93 · a202b93
commit a202b93
Show file tree

Hide file tree

Showing 41 changed files with 1,812 additions and 0 deletions.
diff --git a/2015_nyu_class/.DS_Store b/2015_nyu_class/.DS_Store
diff --git a/2015_nyu_class/lecture1.txt b/2015_nyu_class/lecture1.txt
@@ -0,0 +1,118 @@
+What Actually Works
+- Trigrams and beyond
+	unigrams and bigrams are useless
+	4-5 grams useful in MT but not as much for speech
+- Discounting
+	Absolute discounting, Good- Turing, held-out estimation, Witten-Bell
+- Context counting
+	Kneser-Ney construction
+
+
+
+Language Models (LM)
+1. NGram
+2. PCFG
+3. 
+
+
+
+
+
+Measuring LM quality
+http://nlpers.blogspot.com/2014/05/perplexity-versus-error-rate-for.html
+1. Perplexity
+	Interpretation: average branching factor in model
+	Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates after throwing out unseen words.
+	Seems to be more generative.
+2. Word Error Rate (prediction error)
+3. External metric - custom to whatever task you are doing
+
+Common issue: intrinsic measures like perplexity are easier to use, but extrinsic ones are more credible
+
+
+Sparsity
+- haven't seen most words/ngrams before.
+
+Parameter Estimation
+- Maximum likelihood estimates won’t get us very far. Need to smooth these estimates.
+
+
+Smoothing
+- counting/pseudo counting (laplace, hierarchical counting, dirichlet)
+	Problem: works quite poorly!
+- Linear Interpolation (Chen and Goodman)
+	works better than dirichlet priors. Not entirely clear why
+- Good Turing reweighing
+- Kneser-Ney - more successful
+	http://www.foldl.me/2014/kneser-ney-smoothing/
+	http://www.aclweb.org/website/old_anthology/P/P06/P06-1124.pdf
+	Idea 1: observed n-grams occur more in training than they will later:
+		Absolute Discounting:
+		+ Save ourselves some time and just subtract 0.75 (or some d)
+		+ Maybe have a separate value of d for very low counts
+	Idea 2: Type-based fertility rather than token counts
+		- how many words precede this word in the corpus. (probability allowed in a novel context)
+
+
+
+
+
+
+A Statistical MT Tutorial Workbook
+- syntactic transfer
+- could not stand reading this paper
+
+
+
+
+
+
+2007 Large Language Models in Machine Translation - Google
+- N-gram model
+- "Stupid backoff" - not quite Kneser-Ney. Uses un-normalized scores. 
+- mostly shows the effect of using more data
+
+
+
+
+
+
+
+
+A Neural Probabilistic Language Model
+- curse of dimensionality for building language models. (too many rare words, combinatorial when building sentences)
+- Non-parametric density estimation - probability mass initially concentrated on training points in a large volume, distribute probability mass where it matters instead of uniformly around training points.
+- Proposal
+1. associate with each word in the vocabulary a distributed word feature vector (a real- valued vector in Rm),
+2. express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence, and
+3. learn simultaneously the word feature vectors and the parameters of that probability function.
+- Intuition
+1. Similar words should have similar word vectors
+2. A small change in the features should induce a small change in probability because it's a smooth function of the features
+- instead of characterizing the similarity with a discrete random or deterministic variable (which corresponds to a soft or hard partition of the set of words), we use a continuous real-vector for each word, i.e. a learned distributed feature vector, to represent similarity between words
+- Experiments suggest that learning jointly the representation (word features) and the model is very useful. We tried (unsuccessfully) using as fixed word features for each word w the first principal components of the co-occurrence frequencies of w with the words occurring in text around the occurrence of w.
+- Learn both feature vectors C, and a neural network g that can predict the next word given past words.
+- Neural network description
+1. Single layer, tanh activation
+2. Have to account for variable length input
+- Neural network parameters
+1. b - output biases
+2. d - hidden layer biases
+3. W - word features to output weights - if direct connection to output layer
+4. U - hidden to output layer weights
+5. H - hidden layer weights (input to hidden?)
+6. C - word features
+- Comparison against linearly interpolated ngram model with backoff, params of backoff are estimated via EM (upto trigrams)
+P( w t | w t − 1 , w t − 2 ) = α 0 ( q t ) p 0 + α 1 ( q t ) p 1 ( w t ) + α 2 ( q t ) p 2 ( w t | w t − 1 ) + α 3 ( q t ) p 3 ( w t | w t − 1 , w t − 2 )
+- Out-of-vocabulary words - first guess an initial feature vector for such a word, by taking a weighted convex combination of the
+feature vectors of other words that could have occurred in the same context, with weights proportional to their conditional probability. Then put it in the network and train
+- seems to be a fixed window
+
+
+
+
+
+
+
+
+
diff --git a/2015_nyu_class/lecture3_postagging.txt b/2015_nyu_class/lecture3_postagging.txt
@@ -0,0 +1,232 @@
+Note:
+- look at MCollins class and Jufrasky book on POS tagging as well
+- about accuracy - most metrics are on the word level, not sentence. 
+Sentence is around 50-60% accurate. word is around 97% accurate
+
+http://spacy.io/blog/part-of-speech-pos-tagger-in-python
+- he says just use averaged perceptron
+- but then he praised collobert's system (nlp from scratch), as well as manning's work in 2011
+
+####################################################################################################################################
+MCollins
+---------------
+2 forms of tagging:
+1. POS
+2. NER
+
+POS tagging challenges
+----------------
+- one of the main challenges is ambiguity of words depending on location
+- sparsity of data for words
+
+NER as a tagging problem
+----------------
+- when looking to tag a particular entity, we can create a training dataset where the words that are not entities are labeled as such (NA)
+- if the entity consists of multiple words, we can tag the beginning of the word as a particular tag, (SC), or continuation of that tag (CC)
+
+MeMM features
+----------------
+- word-tag interaction
+- prefix and suffix features interacting with tag
+- ngram tag features - ngrams of tags up to current location to predict
+- surrounding words-tag interaction
+
+
+####################################################################################################################################
+
+Jufrasky
+---------------
+
+Other Languages - Chinese
+---------------
+Different problems occur with languages like Chinese in which words are not segmented in the writing system. For Chinese part-of-speech tagging word segmen- tation (Chapter 2) is therefore generally applied before tagging. It is also possible to build sequence models that do joint segmentation and tagging. Although Chinese words are on average very short (around 2.4 characters per unknown word com- pared with 7.7 for English) the problem of unknown words is still large, although while English unknown words tend to be proper nouns in Chinese the majority of unknown words are common nouns and verbs because of extensive compounding. Tagging models for Chinese use similar unknown word features to English, includ- ing character prefix and suffix features, as well as novel features like the radicals of each character in a word. One standard unknown feature for Chinese is to build a dictionary in which each character is listed with a vector of each part-of-speech tags that it occurred with in any word in the training set. The vectors of each of the characters in a word are then used as a feature in classification.
+
+
+####################################################################################################################################
+
+Data and TagSets
+--------------------------------
+1. Penn Tree TagSet
+2. Brown tagset
+
+####################################################################################################################################
+
+English
+--------------------------------
+Ambigious - even though most word classes are unambigious, the ambigious ones tend to be the most common words used.
+
+
+####################################################################################################################################
+
+
+Linguistic Structures
+--------------------------------
+1. Syntactic word classes (lexical categories - to distinguish from phrasal categories)
+- https://en.wikipedia.org/wiki/Syntactic_category
+- Example types
+	1. verbs
+	2. adverbs
+	3. nouns
+	4. adjectives
+	5. numbers
+	6. determiners
+	7. conjunctions
+	8. pronouns
+	9. particles
+2. phrasal categories 
+- https://en.wikipedia.org/wiki/Phrase_structure_grammar
+- aka constituency grammars
+- context-free
+- Examples:
+	a. Noun phrase (NP)
+	b. VP (verb phrase)
+
+
+Open vs Closed Word Class
+--------------------------------
+Closed:
+prepositions: on, under, over, near, by, at, from, to, with 
+determiners: a, an, the
+pronouns: she, who, I, others
+conjunctions: and, but, or, as, if, when
+auxiliary verbs: can, may, should, are 
+particles: up, down, on, off, in, out, at, by 
+numerals: one, two, three, first, second, third
+
+
+Constituency vs dependency
+--------------------------------
+- https://en.wikipedia.org/wiki/Dependency_grammar#Dependency_vs._constituency
+
+
+Constituency Relation
+--------------------------------
+- initial binary division of sentence. (subject-predicate)
+- one-to-one-or-more correspondence
+
+
+Dependency relation
+--------------------------------
+- one-to-one relation (1 node per word)
+
+
+
+
+Content word
+--------------------------------
+- means nouns, verbs, adjectives, etc. Not "function words"
+
+Function words
+--------------------------------
+- words to express grammatical relationships between words in a sentence
+- examples
+1. pronouns
+2. conjunctions
+3. adpositions
+
+
+Lexical category
+--------------------------------
+- can have 2 distinct meanings
+1. word classes
+2. phrases that start with a content word
+
+
+Why POS tagging?
+--------------------------------
+1. useful in of itself
+- text to speech
+- lemmatization
+- quick and dirty NP chunk detection
+2. Useful as preprocessing step for parsing
+- less ambiguity means fewer parses
+
+HMM - POS tagging classical solutions
+--------------------------------
+- condition only on the state so far.
+- assumptions are fairly naive/broken.
+- HMM variations
+	bigram/trigram/ngram tagger.
+- state estimation
+	last N tags.
+	Use smoothed estimation like with ngram language model (kneser Ney, etc)
+- emission estimation
+	tricky bec. of unknown words and unseen state-word pairs.
+	Can use good-turing, or create an unknown word class.
+- disambiguation (Inference)
+1. Find the most likely (Viterbi path through sequence) can create too many paths.
+2. First solution: use beam search. (recall: just keep top N at each step or candidates within % of best)
+	- works ok in practice.
+	- sometimes we want optimal though
+3. Viterbi Algorithm
+
+
+--------------------------------
+Has nice slides for accuracy of state of the art for various methods and languages.
+
+
+TnT tagger
+--------------------------------
+- uses trigrams of tags to estimate next tag. (last 2 tags as state)
+- add smoothing (similar to ngram language models)
+- the smoothing's parameters (on unigram/bigram/etc) are global (not context or word dependent)
+- used suffix tree to handle unknown words. Also did conditioning on letters/suffix of the unknown word
+
+
+Taggers:
+--------------------------------
+http://www.nltk.org/api/nltk.tag.html#module-nltk.tag
+1. crf
+2. MEMM (maximum entropy markov model)
+3. MaxEnt
+4. TriGram HMM
+5. TnT (tags and TriGram) HMM
+6. MeMM with neural network (state of the art 2016?) 
+
+- (Mine) how about bi-RNN with previous tags?
+
+
+Evaluating taggers:
+--------------------------------
+- accuracy on known vs unknown words
+
+
+Tagger features:
+--------------------------------
+1. The word itself, and "shape" (suffix, prefix, capitalization, with number, with dash). 
+2. The surrounding words without ordering and their features/shapes.
+3. Put ordering in.
+
+MeMM taggers
+--------------------------------
+- condition on N previous tags.
+- Natural extension of MaxEnt: neural net version! (latest state-of-the-art)
+- label bias problem
+“This per-state normalization of transition scores implies a “conservation of score mass” (Bottou,
+1991) whereby all the mass that arrives at a state must be distributed among the possible successor states. An observation can affect which destination states get the mass, but not how much total mass to pass on. This causes a bias toward states with fewer outgoing transitions. In the extreme case, a state with a single outgoing transition effectively ignores the observation. In those cases, unlike in HMMs, Viterbi decoding cannot downgrade a branch based on observations after the branch point, and models with statetransition
+structures that have sparsely connected chains of states are not properly handled. The Markovian assumptions
+in MEMMs and similar state-conditional models insulate decisions at one state from future decisions in a way
+that does not match the actual dependencies between consecutive states.”
+
+
+
+Accuracy:
+--------------------------------
+- in-domain > 97%
+- out-of-doamin < 90&
+
+
+Papers:
+--------------------------------
+- A Universal Part-of-Speech Tagset
+http://arxiv.org/abs/1104.2086
+- Senna (2011 NLP (almost) from scratch)
+from Ronan Collobert
+
+
+
+
+
+
+
+