Dense notes on NLP.
More extensive lists of abbreviations 1, 2
ASGD Averaged Stochastic Gradient Descent
AWD-LSTM ASGD Weight-Dropped LSTM
BERT Bidirectional Encoder Representations from Transformers
BPE Byte Pair Encoding
BiLM Bidirectional Language Model
CBOW Continuous Bag-Of-Words
CFG Context-free Grammar
CL Computational Linguistics
CVT Cross-View Training
CoLA Corpus of Linguistic Acceptability
CoVe Contextual Word Vectors
CRF Conditional Random Field
DAG Directed Acyclic Graph
DAE Denoising Auto-Encoder
DCN Dynamic Coattention Network
DCNN Dynamic Convolutional Neural Network
DMN Dynamic Memory Network
EDA Exploratory Data Analysis
ELMo Embeddings from Language Model
ESA Explicit Semantic Analysis
FGN Fine-Grained NER
FOL First-Order Logic
GAN Generative Adversarial Network
GEC Grammatical Error Correction
GPT Generative Pre-training Transformer
GRU Gated-Recurrent Network
GloVe Global Vectors for Word Representation
HAL Hyperspace Analogue to Language
HDP Hierarchical Dirichlet Process
IE Information Extraction
IR Information Retrieval
LDA Latent Dirichlet Allocation
LSA Latent Semantic Analysis (Truncated SVD)
LSI Latent Semantic Indexing
LSTM Long Short-Term Memory
MAE Mean Absolute Error
MLM Mask Language Model
MNLI Multi-Genre NLI
MRPC MicRosoft Paraphrase Corpus
MSE Mean Squared Error
MaxEnt Maximum Entropy (classifier) (softmax)
NER Named-Entity Recognition
NLG Natural Language Generation
NLI Natural Language Inference (Text Entailment)
NLP Natural Language Processing
NLU Natural Language Understanding
NMT Neural Machine Translation
NTN Neural Tensor Network
NiN Network-in-network (1x1 convconnections)
PCFG Probabilistic Context Free Grammar
POS Parts-Of-Speech
QRNN Quasi-Recurrent Neural Networks
QNLI Question NLI
RACE ReAding Comprehension from Examinations
RMSE Root Mean Squared Error
RNN Recurrent Neural Network
RNN Recursive Neural Network
RNTN Recursive Neural Tensor Network
RP Random Projections
RTE Recognizing Textual Entailment (now called NLI)
SG Skip-Gram
SNLI Stanford Natural Language Inference
SOTA State-Of-The-Art
SQuAD Stanford Question Answering Dataset
SRL Semantic Role Labeling
SST Stanford Sentiment Treebank
STLR Slanted Triangular Learning Rates
SWAG Situations With Adversarial Generations
TDNN Time-Delayed Neural Network
TF TermFrequency
TFIDF TermFrequencyInverseDocumentFrequency
TLM Translation Language Modeling
ULMFiT Universal Language Model Fine-tuning
USE Universal Sentence Encoder
VAE Variational Autoenconder
VSM Vector Space Model
WSD Word Sense Disambiguation
ZSL Zero-Shot Learning
t-SNE t-distributed Stochastic Neighbor Embedding
Denotational semantics: The concept of representing an idea as a symbol (a word or a one-hot vector). It is sparse and cannot capture similarity. This is a "localist" representation.
Distributional semantics: The concept of representing the meaning of a word based on the context in which it usually appears. It is dense and can better capture similarity.
Distributional similarity: similar words have similar context.
Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder).
Constituency Parsing is a way to break a piece of text (e.g. one sentence) into sub-phrases. One of the goals of constituency parsing (also known as "phrase structure parsing") is to identify the constituents in the text which would be useful when extracting information from text. By knowing the constituents after parsing the sentence, it is possible to generate similar sentences that are syntactically correct.
Lemmas are root forms of words.
Named Entity Recognition: which words in a sentence are a proper name, organization name, or entity?
Textual Entailment: given two sentences, does the first sentence entail or contradict the second sentence?
Coreference Resolution: given a pronoun like “it” in a sentence that discusses multiple objects, which object does “it” refer to?
Popular off-the-shelf word embedding models:
- Word2Vec (by Google)
- GloVe (by Stanford)
- fastText (by Facebook)
-
2 algorithms: continuous bag-of-words (CBOW) and skip-gram. CBOW aims to predict a center word from the surrounding context in terms of word vectors. Skip-gram does the opposite, and predicts the distribution (probability) of context words from a center word.
-
2 training methods: negative sampling and hierarchical softmax. Negative sampling defines an objective by sampling negative examples, while hierarchical softmax defines an objective using an efficient tree structure to compute probabilities for all the vocabulary.
https://amitness.com/2020/05/data-augmentation-for-nlp/
Perplexity is often used as an intrinsic evaluation metric for gauging how well a language model can capture the real word distribution conditioned on the context.
A perplexity of a discrete probability distribution pp is defined as the exponentiation of the entropy:
Given a sentence with
The perplexity for the sentence becomes:
$ 2^{H(s)} = 2^{-\frac{1}{N} \sum_{i=1}^N \log_2 p(w_i)} = (2^{\sum_{i=1}^N \log_2 p(w_i)})^{-\frac{1}{N}} = (p(w_1) \dots p(w_N))^{-\frac{1}{N}} $
A good language model should predict high word probabilities. Therefore, the smaller perplexity the better.
- cs224u: Evaluation metrics in NLP
- scikit: Metrics and scoring: quantifying the quality of predictions
After applying softmax, each component will be in the interval (0, 1) and the total will add up to 1, so that they can be interpreted as probabilities.
The larger input components will correspond to larger probabilities.
Temperature is used to scale the logits before applying softmax. (logits/τ)
-
For high temperatures (τ → ∞), all components have nearly the same probability and the lower the temperature, the more expected values affect the probability. This results in more diversity and also more mistakes.
-
When the temperature is 1, the softmax is computed on unscaled logits.
-
For a low temperature (τ → 0), the probability of the action with the highest expected value tends to 1. Larger logit values makes softmax more confident, but also more conservative in its samples (it is less likely to sample from unlikely candidates).
https://cs.stackexchange.com/a/79242/113823
-
Hyponymy && Hypernymy: a hyponym is a word or phrase whose semantic field is included within that of another word. San Francisco (hyponym) is an instance of a city (hypernym). A pigeon is a hyponym of bird; which, in turn, is a hyponym of animal. A bird is a hypernym of a pigeon. An animal is a hypernym of a bird.
-
Antonymy: acidic is the opposite of basic
-
Meronymy: an alternator is a part of a car
-
Polysemy is the capacity for a word or phrase to have multiple meanings, usually related by contiguity of meaning within a semantic field. e.g. crane: (n) machine, (n) bird, (v) to strain out one's neck.
Semantic change (also semantic shift, semantic progression, semantic development, or semantic drift) is a form of language change regarding the evolution of word usage—usually to the point that the modern meaning is radically different from the original usage.
Monotonicity. A system is monotonic if it grows without shrinking.
Monotonicity reasoning is a type of reasoning based on word replacement, requires the ability to capture the interaction between lexical and syntactic structures. Consider examples in (1) and (2).
(1) a. All [ workers ↓] joined for a [French dinner ↑] b. All [ workers ] joined for a [ dinner ] c. All [new workers ] joined for a [French dinner ]
(2) a. Not all [new workers ↑] joined for a dinner b. Not all [ workers ] joined for a dinner
A context is upward entailing (shown by [... ↑]) that allows an inference from (1a) to (1b), where French dinner is replaced by a more general concept dinner. On the other hand, a downward entailing context (shown by [... ↓]) allows an inference from (1a) to (1c), where workers is replaced by a more specific concept new workers. Interestingly, the direction of monotonicity can be reversed again by embedding yet another downward entailing context (e.g., not in (2)), as witness the fact that (2a) entails (2b). To properly handle both directions of monotonicity, NLI models must detect monotonicity operators (e.g., all, not) and their arguments from the syntactic structure. (this excerpt is from Can neural networks understand monotonicity reasoning?)
Useful libraries and modules:
- Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point.
-
Huggingface transformers. The main transformers library.
-
Simple Transformers. Transformers made simple with training, evaluation, and prediction possible with one line each
-
AdaptNLP. A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models for end to end tasks. Built on top of Zalando Research's Flair and Hugging Face's Transformers.
-
spacy-transformers provides spaCy model pipelines that wrap Hugging Face's transformers package, so you can use them in spaCy.
-
AWD-LSTM: Average-SGD Weight-Dropped LSTM
-
ULMFiT: Universal Language Model Fine Tuning
- Chris Manning and Hinrich Schuetze - Foundations of Statistical Natural Language Processing