tswd/16-text.qmd at main · RohanAlexander/tswd · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
---
engine: knitr
---

# Text as data {#sec-text-as-data}

**Prerequisites**

- Read *Text as data: An overview*, [@benoit2020text]
  - This chapter provides an overview of using text as data.
- Read *Supervised Machine Learning for Text Analysis in R*, [@hvitfeldt2021supervised]
  - Focus on Chapters 6 "Regression", and 7 "Classification", which implements linear and generalized linear models using text as data.
- Read *The Naked Truth: How the names of 6,816 complexion products can reveal bias in beauty*, [@thenakedtruth]
  - Analysis of text on make-up products.

**Key concepts and skills**

- Understanding text as a source of data that we can analyze enables many interesting questions to be considered.
- Text cleaning and preparation are especially critical because of the large number of possible outcomes. There are many decisions that need to be made at this stage, which have important effects later in the analysis.
- One way to consider a text dataset is to look at which words distinguish particular documents.
- Another is to consider which topics are contained in a document.

**Software and packages**

- Base R [@citeR]
- `astrologer` [@astrologer] (this package is not on CRAN, so install it with: `devtools::install_github("sharlagelfand/astrologer")`)
- `beepr` [@beepr]
- `fs` [@fs]
- `gutenbergr` [@gutenbergr]
- `quanteda` [@quanteda]
- `stm` [@stm]
- `tidytext` [@SilgeRobinson2016]
- `tidyverse` [@tidyverse]
- `tinytable` [@tinytable]

```{r}
#| message: false
#| warning: false

library(astrologer)
library(beepr)
library(fs)
library(gutenbergr)
library(quanteda)
library(stm)
library(tidytext)
library(tidyverse)
library(tinytable)
```

## Introduction

Text is all around us.\index{text!data} In many cases, text is the earliest type of data that we are exposed to. Increases in computational power, the development of new methods, and the enormous availability of text, mean that there has been a great deal of interest in using text as data. Using text as data provides opportunities for unique analyses. For instance:

- text analysis of state-run newspapers in African countries can identify manipulation by governments [@Hassan2022];
- the text from UK daily newspapers can be used to generate better forecasts of GDP and inflation [@Kalamara2022], and similarly, *The New York Times*\index{text!newspapers} can be used to create an uncertainty index which correlates with US economic activity [@Alexopoulos2015];
- the analysis of notes in Electronic Health Records (EHR)\index{Electronic Health Records (EHR)} can improve the efficiency of disease prediction [@jessgronsbell]; and
- analysis of US congressional records indicates just how often women legislators are interrupted by men [@millersutherland2022].

Earlier approaches to the analysis of text tend to convert words into numbers, divorced of context.\index{text!context} They could then be analyzed using traditional approaches, such as variants of logistic regression. More recent methods try to take advantage of the structure inherent in text, which can bring additional meaning.\index{text!structure} The difference is perhaps like a child who can group similar colors, compared with a child who knows what objects are; although both crocodiles and trees are green, and you can do something with that knowledge, it is useful to know that a crocodile could eat you while a tree probably would not.

Text can be considered an unwieldy, yet similar, version of the datasets that we have used throughout this book. The main difference is that we will typically begin with wide data, where each variable is a word, or token more generally. Often each entry is then a count. We would then typically transform this into rather long data, with one variable of words and another of the counts. Considering text as data naturally requires some abstraction from its context. But this should not be entirely separated as this can perpetuate historical inequities. For instance, @koenecke2020 find that automated speech recognition systems perform much worse for Black compared with White speakers, and @davidson2019racial find that tweets that use Black American English\index{Black American!English}, which is a specifically defined technical term, are classified at hate speech at higher rates than similar tweets in Standard American English, which again is a technical term.

One exciting aspect of text data is that it is typically not generated for the purposes of our analysis. The trade-off is that we typically must do a bunch more work to get it into a form that we can work with. There are a lot of decisions to be made in the data cleaning and preparation stages.\index{data cleaning}

The larger size of text datasets means that it is especially important to simulate, and start small, when it comes to their analysis.\index{simulation} Using text as data is exciting because of the quantity and variety of text that is available to us. But in general, dealing with text datasets is messy. There is a lot of cleaning and preparation that is typically required. Often, text datasets are large. As such having a reproducible workflow in place and then clearly communicating your findings, becomes critical.\index{reproducibility}\index{workflow} Nonetheless, it is an exciting area.

:::{.callout-note}
## Shoulders of giants

Professor Kenneth Benoit is Professor of Computational Social Science and Director of the Data Science Institute at the London School of Economics and Political Science (LSE). After obtaining a PhD in Government from Harvard University in 1998, supervised by Gary King and Kenneth Shepsle, he took a position at Trinity College, Dublin, where he was promoted to professor in 2007. He moved to the LSE in 2020. He is an expert in using quantitative methods to analyse text data, especially political text, and social media. Some of his important papers include @laver2003 which extracted policy positions from political texts and helped start the "text as data" subfield in political science. He has also worked extensively in other methods for estimating policy positions, such as @benoitbook, which provided original expert survey positions in dozens of countries, and @benoit2007 in which  he compared expert surveys with hand coded analysis of party policy positions. A core contribution is the family of R packages known as `quanteda`, for the "quantitative analysis of textual data" [@quanteda] which makes it easy to analyse text data.
:::

In this chapter we first consider preparing text datasets. We then consider Term Frequency-Inverse Document Frequency (TF-IDF) and topic models.

## Text cleaning and preparation

Text modeling is an exciting area of research.\index{text!cleaning} But, and this is true more generally, the cleaning and preparation aspect is often at least as difficult as the modeling.\index{data!cleaning} We will cover some essentials and provide a foundation that can be built on.

The first step is to get some data. We discussed data gathering in @sec-gather-data and mentioned in passing many sources including:

- Using *Inside Airbnb*, which provides text from reviews.
- Project Gutenberg which provides the text from out-of-copyright books.
- Scraping Wikipedia or other websites.

The workhorse packages that we need for text cleaning and preparation are `stringr`, which is part of the `tidyverse`, and `quanteda`.

For illustrative purposes we construct a corpus of the first sentence or two, from three books: *Beloved* by Toni Morrison, *The Last Samurai* by Helen DeWitt, and *Jane Eyre* by Charlotte Brontë.

```{r}
#| message: false
#| warning: false

last_samurai <-"My father's father was a Methodist minister."

beloved <- "124 was spiteful. Full of Baby's venom."

jane_eyre <- "There was no possibility of taking a walk that day."

bookshelf <-
  tibble(
    book = c("Last Samurai", "Beloved", "Jane Eyre"),
    first_sentence = c(last_samurai, beloved, jane_eyre)
  )

bookshelf
```

We typically want to construct a document-feature matrix, which has documents in each observation, words in each column, and a count for each combination, along with associated metadata.\index{text!document-feature matrix} For instance, if our corpus was the text from Airbnb reviews, then each document may be a review, and typical features could include: "The", "Airbnb", "was", "great". Notice here that the sentence has been split into different words. We typically talk of "tokens" to generalize away from words, because of the variety of aspects we may be interested in, but words are commonly used.\index{text!tokens}

```{r}
books_corpus <-
  corpus(bookshelf,
         docid_field = "book",
         text_field = "first_sentence")

books_corpus
```

We use the tokens in the corpus to construct a document-feature matrix (DFM) using `dfm()` from `quanteda` [@quanteda].\index{text!document-feature matrix}

```{r}
#| message: false
#| warning: false

books_dfm <-
  books_corpus |>
  tokens() |>
  dfm()

books_dfm
```

We now consider some of the many decisions that need to be made as part of this process. There is no definitive right or wrong answer. Instead, we make those decisions based on what we will be using the dataset for.

### Stop words

Stop words are words such as "the", "and", and "a".\index{text!stop words} For a long time stop words were not thought to convey much meaning, and there were concerns around memory-constrained computation. A common step of preparing a text dataset was to remove stop words. We now know that stop words can have a great deal of meaning [@schofield2017]. The decision to remove them is a nuanced one that depends on circumstances.

We can get a list of stop words using `stopwords()` from `quanteda`.

```{r}
stopwords(source = "snowball")[1:10]
```

We could then look for all instances of words in that list and crudely remove them with `str_replace_all()`.

```{r}
stop_word_list <-
  paste(stopwords(source = "snowball"), collapse = " | ")

bookshelf |>
  mutate(no_stops = str_replace_all(
    string = first_sentence,
    pattern = stop_word_list,
    replacement = " ")
  ) |>
  select(no_stops, first_sentence)
```

There are many different lists of stop words that have been put together by others.\index{text!stop words} For instance, `stopwords()` can use lists including: "snowball", "stopwords-iso", "smart", "marimo", "ancient", and "nltk". More generally, if we decide to use stop words then we often need to augment such lists with project-specific words. We can do this by creating a count of individual words in the corpus, and then sorting by the most common and adding those to the stop words list as appropriate.

```{r}
stop_word_list_updated <-
  paste(
    "Methodist |",
    "spiteful |",
    "possibility |",
    stop_word_list,
    collapse = " | "
  )

bookshelf |>
  mutate(no_stops = str_replace_all(
    string = first_sentence,
    pattern = stop_word_list_updated,
    replacement = " ")
  ) |>
  select(no_stops)
```

We can integrate the removal of stop words into our construction of the DFM with `dfm_remove()` from `quanteda`.

```{r}
#| message: false
#| warning: false

books_dfm |>
  dfm_remove(stopwords(source = "snowball"))
```

When we remove stop words we artificially adjust our dataset.\index{text!stop words} Sometimes there may be a good reason to do that. But it must not be done unthinkingly. For instance, in @sec-farm-data and @sec-store-and-share we discussed how sometimes datasets may need to be censored, truncated, or manipulated in other similar ways, to preserve the privacy of respondents. It is possible that the integration of the removal of stop words as a default step in natural language processing was due to computational power, which may have been more limited when these methods were developed. In any case, @jurafskymartin [p. 62] conclude that removing stop words does not improve performance for text classification. Relatedly, @schofield2017 find that inference from topic models is not improved by the removal of anything other than the most frequent words. If stop words are to be removed, then they recommend doing this after topics are constructed.

### Case, numbers, and punctuation

There are times when all we care about is the word, not the case or punctuation.\index{text!case}\index{text!numbers}\index{text!punctuation}\index{text!cleaning} For instance, if the text corpus was particularly messy or the existence of particular words was informative. We trade-off the loss of information for the benefit of making things simpler. We can convert to lower case with `str_to_lower()`, and use `str_replace_all()` to remove punctuation with "[:punct:]", and numbers with "[:digit:]".

```{r}
bookshelf |>
  mutate(lower_sentence = str_to_lower(string = first_sentence)) |>
  select(lower_sentence)
```


```{r}
bookshelf |>
  mutate(no_punctuation_numbers = str_replace_all(
    string = first_sentence,
    pattern = "[:punct:]|[:digit:]",
    replacement = " "
  )) |>
  select(no_punctuation_numbers)
```

As an aside, we can remove letters, numbers, and punctuation with "[:graph:]" in `str_replace_all()`. While this is rarely needed in textbook examples, it is especially useful with real datasets, because they will typically have a small number of unexpected symbols that we need to identify and then remove. We use it to remove everything that we are used to, leaving only that which we are not.

More generally, we can use arguments in `tokens()` from `quanteda()` to do this.

```{r}
books_corpus |>
  tokens(remove_numbers = TRUE, remove_punct = TRUE)
```

### Typos and uncommon words

Then we need to decide what to do about typos and other minor issues.\index{text!typos}\index{text!cleaning} Every real-world text has typos. Sometimes these should clearly be fixed. But if they are made in a systematic way, for instance, a certain writer always makes the same mistakes, then they could have value if we were interested in grouping by the writer. The use of OCR\index{Optical Character Recognition} will introduce common issues as well, as was seen in @sec-gather-data. For instance, "the" is commonly incorrectly recognized as "thc".

We could fix typos in the same way that we fixed stop words, i.e. with lists of corrections. When it comes to uncommon words, we can build this into our document-feature matrix creation with `dfm_trim()`. For instance, we could use "min_termfreq = 2" to remove any word that does not occur at least twice, or "min_docfreq = 0.05" to remove any word that is not in at least five per cent of documents or "max_docfreq = 0.90" to remove any word that is in at least 90 per cent of documents.

```{r}
#| message: false
#| warning: false

books_corpus |>
  tokens(remove_numbers = TRUE, remove_punct = TRUE) |>
  dfm(tolower = TRUE) |>
  dfm_trim(min_termfreq = 2)
```

### Tuples

A tuple is an ordered list of elements. In the context of text it is a series of words.\index{text!tuples} If the tuple comprises two words, then we term this a "bi-gram", three words is a "tri-gram", etc. These are an issue when it comes to text cleaning and preparation because we often separate terms based on a space. This would result in an inappropriate separation.

This is a clear issue when it comes to place names. For instance, consider "British Columbia", "New Hampshire", "United Kingdom", and "Port Hedland". One way forward is to create a list of such places and then use `str_replace_all()` to add an underscore, for instance, "British_Columbia", "New_Hampshire", "United_Kingdom", and "Port_Hedland". Another option is to use `tokens_compound()` from `quanteda`.

```{r}
some_places <- c("British Columbia",
                 "New Hampshire",
                 "United Kingdom",
                 "Port Hedland")
a_sentence <-
c("Vancouver is in British Columbia and New Hampshire is not")

tokens(a_sentence) |>
  tokens_compound(pattern = phrase(some_places))
```

In that case, we knew what the tuples were. But it might be that we were not sure what the common tuples were in the corpus. We could use `tokens_ngrams()` to identify them. We could ask for, say, all bi-grams in an excerpt from *Jane Eyre*.\index{Brontë, Charlotte!Jane Eyre} We showed how to download the text of this book from Project Gutenberg\index{Project Gutenberg} in @sec-its-just-a-generalized-linear-model and so here we load the local version that we saved earlier.

```{r}
#| eval: false
#| echo: true

jane_eyre <- read_csv(
  "jane_eyre.csv",
  col_types = cols(
    gutenberg_id = col_integer(),
    text = col_character()
  )
)

jane_eyre
```

```{r}
#| eval: true
#| echo: false

# INTERNAL

jane_eyre <- read_csv(
  "inputs/jane_eyre.csv",
  col_types = cols(
    gutenberg_id = col_integer(),
    text = col_character()
  )
)

jane_eyre
```

As there are many blank lines we will remove them.

```{r}
jane_eyre <-
  jane_eyre |>
  filter(!is.na(text))
```


```{r}
jane_eyre_text <- tibble(
  book = "Jane Eyre",
  text = paste(jane_eyre$text, collapse = " ") |>
    str_replace_all(pattern = "[:punct:]",
                    replacement = " ") |>
    str_replace_all(pattern = stop_word_list,
                    replacement = " ")
)

jane_eyre_corpus <-
  corpus(jane_eyre_text, docid_field = "book", text_field = "text")
ngrams <- tokens_ngrams(tokens(jane_eyre_corpus), n = 2)
ngram_counts <-
  tibble(ngrams = unlist(ngrams)) |>
  count(ngrams, sort = TRUE)

head(ngram_counts)
```

Having identified some common bi-grams, we could add them to the list to be changed. This example includes names like "Mr Rochester" and "St John" which would need to remain together for analysis.

### Stemming and lemmatizing

Stemming and lemmatizing words is another common approach for reducing the dimensionality of a text dataset.\index{text!stemming} Stemming means to remove the last part of the word, in the expectation that this will result in more general words. For instance, "Canadians", "Canadian", and "Canada" all stem to "Canad". Lemmatizing\index{text!lemmatizing} is similar, but is more involved. It means changing words, not just on their spelling, but on their canonical form [@textasdata, p. 54]. For instance, "Canadians", "Canadian", "Canucks", and "Canuck" may all be changed to "Canada".

We can do this with `dfm_wordstem()`. We notice, that, say, "minister", has been changed to "minist".

```{r}
char_wordstem(c("Canadians", "Canadian", "Canada"))

books_corpus |>
  tokens(remove_numbers = TRUE, remove_punct = TRUE) |>
  dfm(tolower = TRUE) |>
  dfm_wordstem()
```

While this is a common step in using text as data, @schofield2017understanding find that in the context of topic modeling, which we cover later, stemming has little effect and there is little need to do it.

### Duplication

Duplication is a major concern with text datasets because of their size.\index{text!duplication} For instance, @bandy2021addressing showed that around 30 per cent of the data were inappropriately duplicated in the BookCorpus dataset,\index{computer science} and @schofield2017quantifying show that this is a major concern and could substantially affect results. However, it can be a subtle and difficult to diagnose problem. For instance, in @sec-its-just-a-generalized-linear-model when we considered counts of page numbers for various authors in the context of Poisson regression, we could easily have accidentally included each Shakespeare entry twice because not only are there entries for each play, but also many anthologies that contained all of them. Careful consideration of our dataset identified the issue, but that would be difficult at scale.

## Term Frequency-Inverse Document Frequency (TF-IDF)

### Distinguishing horoscopes

Install and load `astrologer`, which is a dataset of horoscopes to explore a real dataset.\index{zodiac}\index{astrology}

We can then access the "horoscopes" dataset.

```{r}
horoscopes
```

There are four variables: "startdate", "zodiacsign", "horoscope", and "url" (note that URL is out-of-date because the website has been updated, for instance, the first one refers to [here](https://chaninicholas.com/horoscopes-week-january-5th/)). We are interested in the words that are used to distinguish the horoscope of each zodiac sign.

```{r}
horoscopes |>
  count(zodiacsign)
```

There are 106 horoscopes for each zodiac sign. In this example we first tokenize by word, and then create counts based on zodiac sign only, not date. We use `tidytext` because it is used extensively in @hvitfeldt2021supervised.

```{r}
horoscopes_by_word <-
  horoscopes |>
  select(-startdate,-url) |>
  unnest_tokens(output = word,
                input = horoscope,
                token = "words")

horoscopes_counts_by_word <-
  horoscopes_by_word |>
  count(zodiacsign, word, sort = TRUE)

horoscopes_counts_by_word
```

We can see that the most popular words appear to be similar for the different zodiacs. At this point, we could use the data in a variety of ways.

We might be interested to know which words characterize each group---that is to say, which words are commonly used only in each group.\index{term frequency–inverse document frequency} We can do that by first looking at a word's term frequency (TF), which is how many times a word is used in the horoscopes for each zodiac sign. The issue is that there are a lot of words that are commonly used regardless of context. As such, we may also like to look at the inverse document frequency (IDF) in which we "penalize" words that occur in the horoscopes for many zodiac signs. A word that occurs in the horoscopes of many zodiac signs would have a lower IDF than a word that only occurs in the horoscopes of one. The term frequency–inverse document frequency (tf-idf) is then the product of these.

We can create this value using `bind_tf_idf()` from `tidytext`. It will create new variables for each of these measures.

```{r}
horoscopes_counts_by_word_tf_idf <-
  horoscopes_counts_by_word |>
  bind_tf_idf(
    term = word,
    document = zodiacsign,
    n = n
  ) |>
  arrange(-tf_idf)

horoscopes_counts_by_word_tf_idf
```

In @tbl-zodiac we look at the words that distinguish the horoscopes of each zodiac sign. The first thing to notice is that some of them have their own zodiac sign. On the one hand, there is an argument for removing this, but on the other hand, the fact that it does not happen for all of them is perhaps informative of the nature of the horoscopes for each sign.

```{r}
#| label: tbl-zodiac
#| tbl-cap: "Most common words in horoscopes that are unique to a particular zodiac sign"

horoscopes_counts_by_word_tf_idf |>
  slice(1:5,
        .by = zodiacsign) |>
  select(zodiacsign, word) |>
  summarise(all = paste0(word, collapse = "; "),
            .by = zodiacsign) |>
  tt() |>
  style_tt(j = 1:2, align = "lr") |>
  setNames(c("Zodiac sign", "Most common words unique to that sign"))
```

## Topic models

Topic models\index{text!topic models} are useful when we have many statements and we want to create groups based on which sentences that use similar words. We consider those groups of similar words to define topics. One way to get consistent estimates of the topics of each statement is to use topic models. While there are many variants, one way is to use the latent Dirichlet allocation (LDA)\index{latent Dirichlet allocation} method of @Blei2003latent, as implemented by `stm`. For clarity, in the context of this chapter, LDA refers to latent Dirichlet allocation and not Linear Discriminant Analysis, although this is another common subject associated with the acronym LDA.

The key assumption behind the LDA method is that for each statement, a document, is made by a person who decides the topics they would like to talk about in that document, and who then chooses words, terms, that are appropriate to those topics. A topic could be thought of as a collection of terms, and a document as a collection of topics. The topics are not specified *ex ante*; they are an outcome of the method. Terms are not necessarily unique to a particular topic, and a document could be about more than one topic. This provides more flexibility than other approaches such as a strict word count method. The goal is to have the words found in documents group themselves to define topics.

LDA considers each statement to be a result of a process where a person first chooses the topics they want to speak about. After choosing the topics, the person then chooses appropriate words to use for each of those topics. More generally, the LDA topic model works by considering each document as having been generated by some probability distribution over topics. For instance, if there were five topics and two documents, then the first document may be comprised mostly of the first few topics; the other document may be mostly about the final few topics (@fig-topicsoverdocuments).

```{r}
#| echo: false
#| fig-cap: "Probability distributions over topics"
#| label: fig-topicsoverdocuments
#| layout-ncol: 2
#| fig-subcap: ["Distribution for Document 1", "Distribution for Document 2"]

topics <- c("topic 1", "topic 2", "topic 3", "topic 4", "topic 5")

document_1 <- tibble(
  Topics = topics,
  Probability = c(0.40, 0.40, 0.1, 0.05, 0.05)
)

document_2 <- tibble(
  Topics = topics,
  Probability = c(0.01, 0.04, 0.35, 0.20, 0.4)
)

ggplot(document_1, aes(Topics, Probability)) +
  geom_point() +
  theme_classic() +
  coord_cartesian(ylim = c(0, 0.4))

ggplot(document_2, aes(Topics, Probability)) +
  geom_point() +
  theme_classic() +
  coord_cartesian(ylim = c(0, 0.4))
```

Similarly, each topic could be considered a probability distribution over terms. To choose the terms used in each document the speaker picks terms from each topic in the appropriate proportion. For instance, if there were ten terms, then one topic could be defined by giving more weight to terms related to immigration; and some other topic may give more weight to terms related to the economy (@fig-topicsoverterms).

```{r}
#| echo: false
#| fig-cap: "Probability distributions over terms"
#| label: fig-topicsoverterms
#| layout-ncol: 2
#| fig-subcap: ["Distribution for Topic 1", "Distribution for Topic 2"]

some_terms <- c(
  "immigration", "race", "influx", "loans", "wealth",
  "saving", "chinese", "france", "british", "english")

topic_1 <- tibble(
  Terms = some_terms,
  Probability = c(0.0083, 0.0083, 0.0083, 0.0083, 0.0083, 0.0083, 0.2, 0.15, 0.4, 0.2)
)

topic_2 <- tibble(
  Terms = some_terms,
  Probability = c(0.0142, 0.0142, 0.0142, 0.25, 0.35, 0.30, 0.0142, 0.0142, 0.0142, 0.0142)
)

ggplot(topic_1, aes(Terms, Probability)) +
  geom_point() +
  theme_classic() +
  coord_cartesian(ylim = c(0, 0.4))
ggplot(topic_2, aes(Terms, Probability)) +
  geom_point() +
  theme_classic() +
  coord_cartesian(ylim = c(0, 0.4))
```

<!-- Following @BleiLafferty2009, @blei2012 and @GriffithsSteyvers2004, the process by which a document is generated is more formally considered to be: -->

<!-- 1. There are $1, 2, \dots, k, \dots, K$ topics and the vocabulary consists of $1, 2, \dots, V$ terms. For each topic, decide the terms that the topic uses by randomly drawing distributions over the terms. The distribution over the terms for the $k$th topic is $\beta_k$. Typically a topic would be a small number of terms and so the Dirichlet distribution with hyperparameter $0<\eta<1$ is used: $\beta_k \sim \mbox{Dirichlet}(\eta)$.[^Dirichletfootnote] Strictly, $\eta$ is actually a vector of hyperparameters, one for each $K$, but in practice they all tend to be the same value. -->
<!-- 2. Decide the topics that each document will cover by randomly drawing distributions over the $K$ topics for each of the $1, 2, \dots, d, \dots, D$ documents. The topic distributions for the $d$th document are $\theta_d$, and $\theta_{d,k}$ is the topic distribution for topic $k$ in document $d$. Again, the Dirichlet distribution with the hyperparameter $0<\alpha<1$ is used here because usually a document would only cover a handful of topics: $\theta_d \sim \mbox{Dirichlet}(\alpha)$. Again, strictly $\alpha$ is vector of length $K$ of hyperparameters, but in practice each is usually the same value. -->
<!-- 3. If there are $1, 2, \dots, n, \dots, N$ terms in the $d$th document, then to choose the $n$th term, $w_{d, n}$: -->
<!--     a. Randomly choose a topic for that term $n$, in that document $d$, $z_{d,n}$, from the multinomial distribution over topics in that document, $z_{d,n} \sim \mbox{Multinomial}(\theta_d)$.  -->
<!--     b. Randomly choose a term from the relevant multinomial distribution over the terms for that topic, $w_{d,n} \sim \mbox{Multinomial}(\beta_{z_{d,n}})$. -->

By way of background, the Dirichlet distribution\index{distribution!Dirichlet} is a variation of the beta distribution that is commonly used as a prior for categorical and multinomial variables.\index{distribution!beta} If there are just two categories, then the Dirichlet and the beta distributions are the same. In the special case of a symmetric Dirichlet distribution, $\eta=1$, it is equivalent to a uniform distribution. If $\eta<1$, then the distribution is sparse and concentrated on a smaller number of the values, and this number decreases as $\eta$ decreases. A hyperparameter, in this usage, is a parameter of a prior distribution.

<!-- Given this set-up, the joint distribution for the variables is [@blei2012, p.6]: -->
<!-- $$p(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N}, w_{1:D, 1:N}) = \prod^{K}_{i=1}p(\beta_i) \prod^{D}_{d=1}p(\theta_d) \left(\prod^N_{n=1}p(z_{d,n}|\theta_d)p\left(w_{d,n}|\beta_{1:K},z_{d,n}\right) \right).$$ -->

<!-- Based on this document generation process the analysis problem, discussed in the next section, is to compute a posterior over $\beta_{1:K}$ and $\theta_{1:D}$, given $w_{1:D, 1:N}$. This is intractable directly, but can be approximated [@GriffithsSteyvers2004; @blei2012]. -->

After the documents are created, they are all that we can analyze. The term usage in each document is observed, but the topics are hidden, or "latent". We do not know the topics of each document, nor how terms defined the topics. That is, we do not know the probability distributions of @fig-topicsoverdocuments or @fig-topicsoverterms. In a sense we are trying to reverse the document generation process---we have the terms, and we would like to discover the topics.

If we observe the terms in each document, then we can obtain estimates of the topics [@SteyversGriffiths2006]. The outcomes of the LDA process are probability distributions. It is these distributions that define the topics. Each term will be given a probability of being a member of a particular topic, and each document will be given a probability of being about a particular topic.

<!-- That is, we are trying to calculate the posterior distribution of the topics given the terms observed in each document (@blei2012, p.7):  -->
<!-- $$p(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N} | w_{1:D, 1:N}) = \frac{p\left(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N}, w_{1:D, 1:N}\right)}{p(w_{1:D, 1:N})}.$$ -->

The initial practical step when implementing LDA given a corpus of documents is usually to remove stop words. Although, as mentioned earlier, this is not necessary, and may be better done after the groups are created. We often also remove punctuation and capitalization. We then construct our document-feature matrix using `dfm()` from `quanteda`.

After the dataset is ready, `stm` can be used to implement LDA and approximate the posterior.
<!-- It does this using Gibbs sampling or the variational expectation-maximization algorithm. Following @SteyversGriffiths2006 and @Darling2011, the Gibbs sampling  -->
The process attempts to find a topic for a particular term in a particular document, given the topics of all other terms for all other documents. Broadly, it does this by first assigning every term in every document to a random topic, specified by Dirichlet priors.
<!-- with $\alpha = \frac{50}{K}$ and $\eta = 0.1$  (@SteyversGriffiths2006 recommends $\eta = 0.01$), where $\alpha$ refers to the distribution over topics and $\eta$ refers to the distribution over terms (@Grun2011, p.7).  -->
It then selects a particular term in a particular document and assigns it to a new topic based on the conditional distribution where the topics for all other terms in all documents are taken as given [@Grun2011, p.6].
<!-- $$p(z_{d, n}=k | w_{1:D, 1:N}, z'_{d, n}) \propto \frac{\lambda'_{n\rightarrow k}+\eta}{\lambda'_{.\rightarrow k}+V\eta} \frac{\lambda'^{(d)}_{n\rightarrow k}+\alpha}{\lambda'^{(d)}_{-i}+K\alpha} $$ -->
<!-- where $z'_{d, n}$ refers to all other topic assignments; $\lambda'_{n\rightarrow k}$ is a count of how many other times that term has been assigned to topic $k$; $\lambda'_{.\rightarrow k}$ is a count of how many other times that any term has been assigned to topic $k$; $\lambda'^{(d)}_{n\rightarrow k}$ is a count of how many other times that term has been assigned to topic $k$ in that particular document; and $\lambda'^{(d)}_{-i}$ is a count of how many other times that term has been assigned in that document.  -->
Once this has been estimated, then estimates for the distribution of words into topics and topics into documents can be backed out.

The conditional distribution assigns topics depending on how often a term has been assigned to that topic previously, and how common the topic is in that document [@SteyversGriffiths2006]. The initial random allocation of topics means that the results of early passes through the corpus of document are poor, but given enough time the algorithm converges to an appropriate estimate.

The choice of the number of topics, $k$, affects the results, and must be specified *a priori*. If there is a strong reason for a particular number, then this can be used. Otherwise, one way to choose an appropriate number is to use a test and training set process. Essentially, this means running the process on a variety of possible values for *k* and then picking an appropriate value that performs well.

One weakness of the LDA method is that it considers a "bag of words" where the order of those words does not matter [@blei2012]. It is possible to extend the model to reduce the impact of the bag-of-words assumption and add conditionality to word order. Additionally, alternatives to the Dirichlet distribution can be used to extend the model to allow for correlation.

### What is talked about in the Canadian parliament?

Following the example of the British, the written record of what is said in the Canadian parliament is called "Hansard". It is not completely verbatim, but is very close. It is available in CSV format from [LiPaD](https://www.lipad.ca), which was constructed by @BeelenEtc2017.\index{Hansard}\index{Canada!Hansard}

We are interested in what was talked about in the Canadian parliament in 2018.\index{Canada!parliament}\index{political science} To get started we can download the entire corpus from [here](https://www.lipad.ca/data/), and then discard all of the years apart from 2018. If the datasets are in a folder called "2018", we can use `read_csv()` to read and combine all the CSVs.

```{r}
#| echo: true
#| eval: false

files_of_interest <-
  dir_ls(path = "2018/", glob = "*.csv", recurse = 2)

hansard_canada_2018 <-
  read_csv(
    files_of_interest,
    col_types = cols(
      basepk = col_integer(),
      speechdate = col_date(),
      speechtext = col_character(),
      speakerparty = col_character(),
      speakerriding = col_character(),
      speakername = col_character()
    ),
    col_select =
      c(basepk, speechdate, speechtext, speakername, speakerparty,
        speakerriding)) |>
  filter(!is.na(speakername))

hansard_canada_2018
```

```{r}
#| echo: false
#| eval: true

files_of_interest <-
  dir_ls(
    path = "inputs/data/2018/",
    glob = "*.csv",
    recurse = 2
  )

hansard_canada_2018 <-
  read_csv(
    files_of_interest,
    col_types = cols(
      basepk = col_integer(),
      speechdate = col_date(),
      speechtext = col_character(),
      speakerparty = col_character(),
      speakerriding = col_character(),
      speakername = col_character()
    ),
    col_select = c(
      basepk,
      speechdate,
      speechtext,
      speakername,
      speakerparty,
      speakerriding
    )
  ) |>
  filter(!is.na(speakername))

hansard_canada_2018
```

The use of `filter()` at the end is needed because sometimes aspects such as "directions" and similar non-speech aspects are included in the Hansard. For instance, if we do not include that `filter()` then the first line is "The House resumed from November 9, 2017, consideration of the motion." We can then construct a corpus.

```{r}
hansard_canada_2018_corpus <-
  corpus(hansard_canada_2018,
         docid_field = "basepk",
         text_field = "speechtext")

hansard_canada_2018_corpus
```

We use the tokens in the corpus to construct a document-feature matrix. To make our life a little easier, computationally, we remove any word that does not occur at least twice, and any word that does not occur in at least two documents.

```{r}
#| message: false
#| warning: false

hansard_dfm <-
  hansard_canada_2018_corpus |>
  tokens(
    remove_punct = TRUE,
    remove_symbols = TRUE
  ) |>
  dfm() |>
  dfm_trim(min_termfreq = 2, min_docfreq = 2) |>
  dfm_remove(stopwords(source = "snowball"))

hansard_dfm
```

At this point we can use `stm()` from `stm` to implement a LDA model.\index{latent Dirichlet allocation} We need to specify a document-feature matrix and the number of topics. Topic models are essentially just summaries. Instead of a document becoming a collection of words, they become a collection of topics with some probability associated with each topic. But because it is just providing a collection of words that tend to be used at similar times, rather than actual underlying meaning, we need to specify the number of topics that we are interested in. This decision will have a big impact, and we should consider a few different numbers.

```{r}
#| echo: true
#| eval: false

hansard_topics <- stm(documents = hansard_dfm, K = 10)

beepr::beep()

write_rds(
  hansard_topics,
  file = "hansard_topics.rda"
)
```

```{r}
#| echo: false
#| eval: false

# INTERNAL

hansard_topics <- stm(documents = hansard_dfm, K = 10)

beep()

write_rds(
  hansard_topics,
  file = "outputs/hansard_topics.rda"
)
```

This will take some time, likely 15-30 minutes, so it is useful to save the model when it is done using `write_rds()`, and use `beep` to get a notification when it is done. We could then read the results back in with `read_rds()`.

```{r}
#| echo: true
#| eval: false

hansard_topics <- read_rds(
  file = "hansard_topics.rda"
)
```

```{r}
#| echo: false
#| eval: true

hansard_topics <- read_rds(
  file = "outputs/hansard_topics.rda"
)
```

We can look at the words in each topic with `labelTopics()`.

::: {.content-visible when-format="pdf"}
```{r}
#| message: false
#| eval: false

labelTopics(hansard_topics)
```
:::

::: {.content-visible unless-format="pdf"}
```{r}
labelTopics(hansard_topics)
```
:::


## Exercises

### Practice {.unnumbered}

1. *(Plan)* Consider the following scenario: *You run a news website and are trying to understand whether to allow anonymous comments. You decide to do a A/B test, where we keep everything the same, but only allow anonymous comments on one version of the site. All you will have to decide is the text data that you obtain from the test.* Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.
2. *(Simulate)* Please further consider the scenario described and simulate the situation. Please include at least ten tests based on the simulated data.
3. *(Acquire)* Please describe one possible source of such a dataset.
4. *(Explore)* Please use `ggplot2` to build the graph that you sketched. Use `rstanarm` to build a model.
5. *(Communicate)* Please write two paragraphs about what you did.

### Quiz {.unnumbered}

1. Which argument to `str_replace_all()` would remove punctuation?
    a.  "[:punct:]"
    b. "[:digit:]"
    c. "[:alpha:]"
    d. "[:lower:]"
2. Change `stopwords(source = "snowball")[1:10]` to find the ninth stopword in the "nltk" list.
    a. "her"
    b. "my"
    c.  "you"
    d. "i"
3. Which function from quanteda() will tokenize a corpus?
    a. `tokenizer()`
    b. `token()`
    c. `tokenize()`
    d.  `tokens()`
4. Which argument to `dfm_trim()` should be used if we want to only include terms that occur at least twice? = 2)
    a. "min_wordfreq"
    b.  "min_termfreq"
    c. "min_term_occur"
    d. "min_ occurrence"
5. What is your favorite example of a tri-gram?
6. What is the second-most common word used in the zodiac signs for Cancer?
    a. to
    b. your
    c. the
    d.  you
7. What is the sixth-most common word used in the zodiac signs for Pisces, that is unique to that sign?
    a. shoes
    b.  prayer
    c. fishes
    d. pisces
8. Re-run the Canadian topic model, but only including five topics. Looking at the words in each topic, how would you describe what each of them is about?


### Class activities {.unnumbered}

- Do children learn "dog", "cat", or "bird" first? Use the Wordbank database.

### Task {.unnumbered}

Please follow the code of @hvitfeldt2021supervised in *Supervised Machine Learning for Text Analysis in R*, Chapter 5.2 "Understand word embeddings by finding them yourself", freely available [here](https://smltar.com/embeddings.html), to implement your own word embeddings for one year's worth of data from [LiPaD](https://www.lipad.ca).