-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy paththird_draft.Rmd
1244 lines (881 loc) · 79.3 KB
/
third_draft.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Lab 2 Final Report"
author: "Lucas Brossi, Amar Chatterjee, Daniel Chow, Sandip Panesar"
date: "12/08/2020"
output:
pdf_document: default
word_document: default
html_document:
df_print: paged
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r library imports, include = FALSE}
library(tidyverse)
library(lmtest)
library(sandwich)
library(stargazer)
library(knitr)
library(GGally)
library(naniar)
library(gridExtra)
```
```{r data schema, include = FALSE}
schema <- cols(
state = "c",
cases_total = "i",
cases_last_7_days = "i",
case_rate = "n",
case_rate_last_7_days = "n",
deaths_total = "i",
deaths_last_7_days = "i",
death_rate = "n",
death_rate_last_7_days = "n",
tests_total = "i",
tests_positive = col_factor(
levels = c("0-5%", "6-10%", "11-20%"),
ordered = TRUE
),
test_rate = "i",
white_cases = "i",
white_pop = "i",
black_cases = "i",
black_pop = "i",
hispanic_cases = "i",
hispanic_pop = "i",
other_cases = "i",
other_pop = "i",
white_deaths = "i",
black_deaths = "i",
hispanic_deaths = "i",
other_deaths = "i",
emerg_date = col_date(format = "%d/%m/%Y"),
beg_bus_close_date = col_date(format = "%d/%m/%Y"),
end_bus_close_date = col_date(format = "%d/%m/%Y"),
bus_close_days = "i",
beg_shelter_date = col_date(format = "%d/%m/%Y"),
end_shelter_date = col_date(format = "%d/%m/%Y"),
shelter_days = "i",
mask_date = col_date(format = "%d/%m/%Y"),
mask_use = "l",
mask_legal = "l",
beg_maskbus_date = col_date(format = "%d/%m/%Y"),
end_maskbus_date = col_date(format = "%d/%m/%Y"),
maskbus_use = "l",
gov_party = col_factor(
levels = c("R", "D"),
ordered = FALSE
),
pop_dens = "n",
pop_total = "i",
pre_cond_total = "i",
serious_illness_pct = "n",
all_cause_deaths_total = "i",
homeless_total = "i",
medicaid_pct = "i",
life_expectancy = "n",
unemployment_rate = "n",
poverty_rate = "n",
weekly_UI_max_amount = "i",
household_income = "i",
age_0_18 = "i",
age_19_25 = "i",
age_26_34 = "i",
age_35_54 = "i",
age_55_64 = "i",
age_65 = "i",
mob_RR = "i",
mob_GP = "i",
mob_PK = "i",
mob_TS = "i",
mob_WP = "i",
mob_RS = "i"
)
```
```{r csv import, include = FALSE}
df <- read_delim(
file = "covid_19_clean.csv",
delim = ";",
col_names = TRUE,
col_types = schema,
na = ""
)
```
# Introduction
As of October 2020, more than 10 million Americans have been infected with the novel coronavirus (COVID-19) and more than 240,000 have perished. Currently, the United States leads the globe in the number of cases and the number of deaths.
A relatively simple way to control spread of this respiratory pathogen is with personal protective equipment, specifically face masks. When combined with social distancing face masks are an effective way of reducing spread within communities, workplaces and other public areas. In the absence of sophisticated testing, containment, and contact tracing techniques, the adoption of face masks in the United States has become an essential strategic component in the fight against COVID-19. While it is hard to quantify the exact efficacy, mask use not only reduces the chance of the wearer becoming infected, but also reduces the potential for transmission of virus-containing respiratory secretions to others (see diagram below).
On April 3, 2020, the Center for Disease Control (CDC) issued an official recommendation advising all persons to wear a cloth face mask or covering in public to help slow the spread of the coronavirus. Many states went on to enact a policy requiring people to wear face masks at all times in public settings, however some did not comply, despite mounting evidence in favor of such an order.
![](images/WearMask_3.png){width=85%}
Despite all of these recommendations, the use of face masks has become politicized and undermined by large swaths of the country’s population. Conflicting messages and spread of misinformation has resulted in a loss of credibility and trust in the CDC. As such, some states, counties, districts and individuals continue to ignore scientific advice by neglecting to enforce, or abide by mask use mandates. This may be one of the reasons behind the seemingly unimpeded rise of COVID-19 case rates in the US, when in comparison, other countries with stricter regulations (and better public compliance) have managed to control disease spread.
Accordingly, as a team we have decided to leverage the provided dataset in order to public health guidance and answer the following question:
**Does the implementation of a mandatory face mask policy for all individuals aid in reducing the case rate of COVID-19 in the United States?**
We hypothesize that a face mask mandate does indeed have a measurable & causal impact on containing the spread of COVID-19, even when taking into account age and socioeconomic demographics, social distancing adherence, and other related policies. This is visually represented in the causal diagram below. Our measurement goal is to assess the statistical significance and practical significance of statewide mandatory face mask usage policies in reducing the COVID-19 case rate. We will methodically include incremental covariates from the aforementioned categories in our regression models, in an effort to isolate the portion of case rate variability actually explained by implementation of a mandatory face mask policies. These other covariates, while important, will help absorb some of the “noise” not specifically associated with the implementation of a face mask policy.
![](images/intro_causal_diagram.png)
\newpage
## Data
The COVID-19 dataset is an aggregation of data from a variety of sources and updated at various dates. The final dataset with its most recent update is shown in the table below:
| Data | Source | Last Updated |
|-------------------------------------|----------------------------------|--------------|
| COVID-19 Cases, Deaths, and Testing | Center for Disease Control (CDC) | 10/30/2020 |
| COVID-19 Cases by Race/Ethnicity | Kaiser Family Foundation (KFF) | 10/25/2020 |
| COVID-19 Deaths by Race/Ethnicity | KFF | 10/25/2020 |
| Policies | Raifman J, et. al. | 10/26/2020 |
| State Characteristics | CDC, KFF, and Raifman | N/A |
| Demographics | KFF | 2018 |
| Human Mobility | Google | 10/23/2020 |
The Google Human mobility data is updated daily, and includes information on the amount of time spent at various public locations compared to baseline data. Below are the adjustments made to variables that were either created or supplemented.
There are a total of 6 data types in the dataset: character, numeric, integer, factor, dates, and logical. Any variable with “date” in the name is read in as a date. Logical variables include 'mask_use,' 'mask_legal,' and 'maskbus_use.' Variables read in as factors include 'gov_party,' and 'tests_positive.' The only character variable is the state name. Finally, all other variables are read in as either numeric or an integer. All numeric and integer values are real and nonnegative.
Some of the data was missing from the initial table and was manually populated using an external website$^1$ so we could calculate the number of days of shelter in place. Although some of the states had not yet ended their shelter in place policies, we chose 10/30/2020 as the absolute cutoff for our analysis. The table on the next page shows the data populated manually (empty cells were left as is):
| State | beg_shelter_date | end_shelter_date |
|-------------|------------------|------------------|
| California | | 10/30/2020 |
| Connecticut | 3/23/2020 | |
| New Mexico | | 10/30/2020 |
| Oklahoma | 3/30/2020 | |
| Texas | 4/2/2020 | |
### Variable Operationalization
**Mask Use**
This binary/logical variable was created by assigning a 1 if the state had a mask mandate, and 0 if it did not (based on the mask mandate for all individuals date column).
**Percent Age Below 25**
This column was created by combining the 0-18 and 18-25 age groups (rationale discussed in Model 2 section). No other adjustments were made.
**Days in Shelter-in-Place**
The number of days each state was under the Shelter-in-Place (SIP) mandate. There was some missing data, which was supplemented from external sources (outlined above). The column was created by subtracting the end and start dates.
**Days Businesses Closed**
The number of days each state closed non-essential businesses. No other adjustments were made.
**Percentage of Population: Black**
We interpreted observations of "< 0.01" to mean that the percentage of the population was negligible for the practical purposes of our analysis. We subsequently rounded these down to 0 so they could be treated as numeric.
## Model 1
### Objective
Model 1 is the simplest model. Its aimed at measuring the strength of the relationship between the presence of a mandatory mask use policy and the COVID-19 case rate per 100,000 in a state. It has no other covariates, with the exception of test rate, which we include as a way to control for the impact that different test availabilities might have on the case rate by state (rationalized and discussed further below).
```{r initialize model 1}
df_mod1 <- df %>%
select(case_rate, mask_use, test_rate)
```
### Exploratory Data Analysis
First, we assessed the data provided for the variables of interest. This analysis specifically pertained to detection of spurious values and missing data. We discuss the distribution of the data in the following dedicated sections.
#### Case Rate
For the main outcome variable, we can see that there are no missing values in the columns of interest:
```{r fig.height = 3, fig.width = 5}
vis_miss(df %>% select(case_rate)) + theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
```
We can then analyze the summary statistics for the case rate from which we can see there are no negative or observably spurious values:
```{r summary stats model 1}
summary(df_mod1$case_rate)
```
Together with the summary statistics presented in the previous section, we can further isolate the case rate variable and visualize its distribution:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df %>%
ggplot(aes(x = case_rate)) +
geom_histogram(bins = 50) +
labs(title = "Histogram of Case Rate per 100,000",
y = "Count",
x = "Case Rate per 100,000")
```
From the above histogram, we can see that the case rate is approximately normally distributed, though there are some states that are obvious outliers, with very high and very low case rates per 100,000.
Next, we can visually assess the case rate across the 50 states by plotting a bar graph:
```{r include=FALSE}
get_quantiles <- function(column) {
quantiles <- data.frame(
name = factor(c("3rd Quartile", "Mean", "1st Quartile"), levels = c("3rd Quartile", "Mean", "1st Quartile")),
value = c(quantile(column, 0.75), mean(column), quantile(column, 0.25))
)
return(quantiles)
}
```
```{r warning=FALSE, fig.height = 4, message=FALSE}
quantiles <- get_quantiles(df$case_rate)
df %>%
ggplot(aes(x = state, y = case_rate)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_hline(data = quantiles, aes(yintercept = value, color = name)) +
labs(title = "Case Rate Across 50 States",
y = "Case Rate/100,000",
x = "State",
color = "Legend")
```
We can see that there are a substantial number of states with case rates below the mean (2,749), and a few below the 1st (2,040) and 3rd (3,516) quartiles.
We now look at the correlation between case rate and our causal variables of interest.
#### Case Rate vs. Mandatory Mask Policy
Before conducting a comparison between the dependent and main independent variable, we must ensure that there are no spurious or missing data in the mandatory mask policy category. As we can see below, there is no missing data for this variable:
```{r fig.height = 3, fig.width = 5}
vis_miss(df %>% select(mask_use)) + theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
```
As the mask use policy variable is binary, we only need to ensure that there are 51 observations (51 as the District of Columbia is included in the dataset):
```{r}
summary(df_mod1$mask_use)
```
We can also see that 1/3 of all the states surveyed did not have a mask use policy in place by October 30th, 2020.
Based on our hypothesis, implementation of a statewide mask-use policy should lead to a decrease in the number of COVID cases by preventing their spread. From the box plot comparison below we can visually discern an apparent positive correlation between mask use policy categories and case rates. This broadly falls in line with our hypothesis, as we can see that the mean case rate in the 'True' (i.e. enforced mask policy) category (2,473/100,000) is substantially lower than that of the 'False' (i.e. no enforced mask policy) category (3,303/100,000):
```{r warning=FALSE, message=FALSE, fig.height = 4}
df_mod1 %>%
ggplot(aes(y = case_rate, x = mask_use)) +
geom_boxplot() +
labs(
title = "Mandatory Mask Use vs Case Rate by State",
x = "Mandatory Mask Use",
y = "Case Rate per 100,000")
```
\newpage
#### Case Rate vs. Test Rate
First, we assess for any potential missing data for the test rate variable. As we can see from below, there is none:
```{r fig.height = 3, fig.width = 5}
vis_miss(df %>% select(test_rate)) + theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
```
Next, we ensure there are no spurious values by calculating summary statistics:
```{r}
summary(df_mod1$test_rate)
```
Together with the summary statistics presented previously, we can visualize the distribution of test rates using a histogram:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df %>%
ggplot(aes(x = test_rate)) +
geom_histogram(bins = 50) +
labs(title = "Histogram of Test Rate per 100,000",
y = "Count",
x = "Test Rate per 100,000")
```
From the above histogram, we can see that there is a definite positive skew of the data, meaning that the "worst performing" states had test rates were approximately 1/3 of the "top performing" states in terms of tests administered per 100,000 of population. This is an interesting observation, and warrants a comparison between the test rates and case rates to see if there is any potential relationship, which we will next perform.
An important control variable in our causal models is the test rate per 100,000 of the population: We hypothesize that the more tests are performed, the higher the overall case rate. Though this seems like an obvious assumption, the issue is particularly complex, and the relationship not as straightforward as initially apparent$^2$. Consequently, we believe that including the test rate per 100,000 is potentially an important control variable in our model:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_mod1 %>%
ggplot(aes(y = case_rate, x = test_rate)) +
geom_point() +
geom_smooth(method = "lm", level = 0) +
labs(
title = "Case Rate vs Test Rate by State",
x = "Test Rate per 100,000",
y = "Case Rate per 100,000")
```
The above diagram visualizes the relationship between test and case rates per 100,000. It confirms the more "simple"" hypothesis of a positive relationship between case rate and number of tests. This relationship is slight, however (as evidenced by the slope of the line). Due to the prior reasoning and demonstrable positive relationship between the variables, we subsequently choose to include it in all subsequent models.
### Casual Diagram for Model 1
Our initial assumptions have been validated by a cursory analysis and we are confident that the causal diagram below holds true. Colored in blue at the center is the main dependent variable, case rate per 100,000. In green, pointing to the case rate (signifying a causal effect on the dependent variable) is the main independent variable, mask use policy. In yellow is the test rate per 100,000 as the first control variable. Finally, in red is the error term that contains all other variables.
![](images/model1_causal_diagram.png)
### Model specification
The first regression model uses **COVID-19 Case Rate per 100,000** as the outcome variable and two covariates: the variable of interest (**Mandatory Mask Use**) and **Test Rate per 100,000**.
**Test Rate** has been included because there is an evident positive relationship between test rates and case rates. Subsequently, it should be included as a control variable in our subsequent models in order to prevent potential misattribution of effects to other included variables.
Model 1 is subsequently defined as:
$$
\begin{aligned}
\mathrm{case\ rate\ per\ 100,000\ pop.} = \beta_0 + \beta_1(\mathrm{mandatory\ mask\ use\ policy}) + \beta_2(\mathrm{test\ rate\ per\ 100,000\ pop.})
\end{aligned}
$$
### Model summary
```{r model_1 summary, results = "asis", warning = FALSE}
model_1 <- lm(case_rate ~ mask_use + test_rate, data = df)
std_errors = sqrt(diag(vcovHC(model_1)))
stargazer(model_1, se = std_errors, title = "Model 1 Summary", header = FALSE)
```
\newpage
### Overall model significance (F-test)
This compares the null hypothesis, where $H_0: \beta_1 = \beta_2 = 0$, against an alternative hypothesis where $H_a: \beta_1 \ne 0$ or $\beta_2 \ne 0$ at a significance level of 0.05:
```{r model_1 F-test}
model_0 <- lm(case_rate ~ 1, data = df)
anova(model_0, model_1, test = "F")
```
The F-Statistic is 7.416, and the p-value is < 0.01, with an adjusted R-squared of 0.204. From the F-test, we can reject the null hypothesis ($H_0$) in favor of the more complete model ($H_1$) which now includes the covariates **Mandatory Mask Use** and **Test Rate per 100,000**. This is the base model we will henceforth build upon.
### Coefficient significance (t-test)
In order to assess the performance of the model, we can look at the other model coefficients:
```{r model_1 coefficient test}
coeftest(model_1, vcovHC)
```
Under a significance level of 0.05, we can accept the alternative hypotheses $H_{a}: \beta_1 \ne 0$, which means **Mandatory Mask Use** explains at least a part of the variability observed in the **COVID-19 Case Rate per 100,000**.
On the other hand, for **Test Rate per 100,000** we failed to reject the null hypothesis that $H_{0}: \beta_2 = 0$. It seems this variable is not absorbing a significant part of the variability observed in the outcome, **COVID-19 Case Rate per 100,000**.
Our estimate for $\beta_1$ (the coefficient of our variable of interest) is $\tilde \beta_1 \approx-990.5$, with a standard error of ~324.8 and a p-value of 0.004.
### Practical significance
According to Model 1, enforcement of mask use policies would be expected to reduce the case rate by ~990.5 cases/100,000 or by ~1%, *ceteris paribus*. Given that the median COVID-19 case rate among US states is 2,633 per 100,000, the coefficient estimate has practical significance, with an effect size corresponding to a reduction of 37.6% of the median case rate among states.
\newpage
## Model 2
### Objective
Model 2 is designed to be the optimal model, which strikes a balance between accuracy and parsimony. It is intended to reflect the best understanding of relationships between key variables. It includes the same covariates used in Model 1, in addition to new covariates related to structural demographics and behavioral differences between states that might partly correlate with the variability observed in the case rate.
It is difficult to make *a priori* assumptions regarding the variables we will utilize for Model 2. We hope to select one causal variable from each of the three broader categories that we believe align with our hypotheses regarding the factors influencing COVID-19 case rates. The broad categories include:
- Age demographics
- Socioeconomic demographics
- Social distancing adherence
Model selection will be based upon our EDA. For each one of these categories, we will look for variables that correlate more strongly with case rate, but without a high degree of collinearity with the other variables already included in the model.
```{r initialize model 2}
df_race <- df %>%
select(state, case_rate, white_pop, black_pop, hispanic_pop, other_pop)
df_socio <- df %>%
select(state, case_rate, homeless_total, poverty_rate, household_income,
life_expectancy, unemployment_rate, black_pop)
df_dist <- df %>%
select(state, case_rate, `mob_R&R`, `mob_G&P`, mob_P, mob_TS, mob_WP, mob_RES)
df_age <- df %>%
select(state, case_rate, age_0_18, age_19_25, age_26_34, age_35_54, age_55_64, age_65)
df_socio$life_expectancy <- df_socio$life_expectancy/10
```
### Exploratory Data Anaysis
**Age Demographics**
In order to select the most appropriate age groups (i.e. % of population of a particular age group in a state) to include in our model, we must conduct a detailed exploratory analysis of the individual independent variables, their relationship with the dependent variable and their fellow categories.
First we ensure that there is no missing data in any of the age groups. The below figure confirms that there is no missing data:
```{r fig.height = 3, fig.width = 5}
df_age %>%
vis_miss() +
theme(axis.text.x = element_text(angle = 90))
```
Next we ensure there is no spurious data contained in any of the age categories. From below, we can see that all the categories look to be well behaved in terms of the data they contain:
```{r summary age}
df_age %>%
select(where(is.numeric)) %>%
summary()
```
Now that we have ensured the integrity of our data, we examine how each particular age group is related to the dependent variable and to each of the other age variables:
```{r warning=FALSE, fig.height = 6, message=FALSE}
ggpairs(df_age[,-1])
```
From the above scatterplot matrix we can see that the age groups that have highest correlation with our dependent variable (case rate) are below 25 (r = ~0.6) and 0-18 (r = ~0.5). These might both be valuable causal factors in terms of case rate and could be combined to enhance the model, while preventing the collinearity occuring if the two variables were added separately (as they also strongly correlate with each other, r = ~0.5). We can subsequently look at our newly transformed "<25" category in more detail:
First we can plot a scatterplot to see what percentage of the population <25 exists in each state:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_age$age_below_25 = df$age_0_18 + df$age_19_25
df$age_below_25 = df$age_0_18 + df$age_19_25
quantiles <- get_quantiles(df$age_below_25)
df %>%
ggplot(aes(x = state, y = age_below_25)) +
geom_point(stat="identity") +
ylim(0, 45) +
labs(title = "% of Population Under 25 Years Old Across States",
y = "%",
x = "State",
color = "Legend") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_hline(data = quantiles, aes(yintercept = value, color = name))
```
As we can see from the above, all of the states have similar proportions of their population < 25, with the interquartile distance being relatively small, indicating a generally uniform national age distribution.
We can subsequently move on to the comparison between the new 'age <25' category and the dependent variable:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_age %>%
select(case_rate, age_below_25) %>%
ggplot(aes(y = case_rate, x = age_below_25)) +
geom_point() +
geom_smooth(method = "lm", level = 0) +
labs(
title = "Case Rate vs Percent Age Below 25",
x = "Percent Age Below 25",
y = "Case Rate per 100,000")
```
As we can see from the above, there is a definite positive relationship between the percent of people in the <25 age category and the case rate per 100,000. As such, it is unlikely that further transformation to this variable is required, and we can utilize it as a representation of an age demographic factor in our causal model moving forward.
#### Case Rate vs. Socioeconomic Demographics
It is well known and documented that COVID-19 has affected different ethnic groups differently, with figures demonstrating that case rates and death rates are higher among certain minority groups. This may be partially attributable to genetic factors. It may also be related to socioeconomic factors, such as poverty, inability to work from home, education, among many other reasons, which are also related to race.
Again, we begin by confirming the data has been recorded properly and that there are no missing values:
```{r fig.height = 3, fig.width = 5}
df_race[,c(-1,-2)] %>%
vis_miss() +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))
```
Next we ensure there is no spurious data contained in any of the race categories. From below, we can see that all the categories look to be well behaved in terms of the data they contain:
```{r summary race}
df_race[,c(-1,-2)] %>%
select(where(is.numeric)) %>%
summary()
```
Now that we have ensured the integrity of our data, we examine how the state percentage of particular race category might be related to the dependent variable and to each of the other race groups:
```{r warning=FALSE, fig.height = 6, message=FALSE}
ggpairs(df_race[,-1])
```
As we can see from the above scatterplot matrix, in general there are generally weak relationships between case rate per 100,000 and percentage of the particular racial group per state, in fact the majority show a negative relationship. Nevertheless, we can observe that the strongest absolute value for relationship is positive, and demonstrated by the percentage of black population per state, correlating with case rate with an r of ~0.3. We can visualize the distribution of black people by percentage of state population using a bar graph:
```{r warning=FALSE, fig.height = 4, message=FALSE}
quantiles <- get_quantiles(df$black_pop)
df %>%
ggplot(aes(x = state, y = black_pop)) +
geom_bar(stat="identity") +
labs(title = "% Black Population",
y = "%",
x = "State",
color = "Legend") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_hline(data = quantiles, aes(yintercept = value, color = name))
```
As we can observe from the above graph, there is a wide variance in the percentage of black population by state. Nevertheless, as linear regression does not require normality of data as an assumption, we can continue to utilize this variable.
As mentioned in the introduction, race may potentially act as a proxy for socioeconomic factors such as poverty, access to healthcare and education, among others$^3$. Consequently, if there is potential collinearity observed between dedicated socioeconomic categories in the dataset and a particular racial category, including them both as separate dependent variables might cause problems with the model. As such, we have included the percentage of black population per state in the following scatterplot analysis of the relationship between various socioeconomic factors and case rate. Our reasoning for this is because we already know from previously that the black ethnicity variable is strongly correlated with case rate, and we can study this relationship further.
First we assess for missing data in the socioeconomic categories we will be assessing:
```{r fig.height = 3, fig.width = 5}
df_socio[,c(-1,-2)] %>%
vis_miss() +
theme(axis.text.x = element_text(angle = 90))
```
Next we do an analysis to ensure there are no spurious values in the data for any of the variables:
```{r summary stats race}
df_socio[,c(-1,-2)] %>%
select(where(is.numeric)) %>%
summary()
```
We have already identified a few missing values in the household income and life expectancy column. Nevertheless, the rest of the variables look to be well behaved and contain summary values concordant with what one would expect for these data types.
Next we can create the scatterplot matrix for the socioeconomic variables of interest (+ percentage of blacks):
```{r warning=FALSE, fig.height = 6, message=FALSE}
ggpairs(df_socio[,-1])
```
From the pairs plot above, we see that poverty rate and household income have the highest absolute correlation with case rate per 100,000. They are comparable to the correlation of percentage of black population variable seen above. Additionally, the black population variable has high collinearity with household income, poverty rate, life expectancy, and unemployment rate. Because of this, the percent black population variable may act as a proxy that can control for many of these factors, while also being representative of racial factors.
We explore the bivariate relationship between case rate and percentage of black population below:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_race %>%
ggplot(aes(y = case_rate, x = black_pop)) +
geom_point() +
geom_smooth(method = "lm", level = 0) +
labs(
title = "Case Rate vs % Black by state",
x = "% Black",
y = "Case Rate per 100,000")
```
The untransformed relationship shows a rather weak positive correlation with case rate. Moreover, the points seem to become further spaced as the value of X increases. Disregarding the straight regression line, we can observe a somewhat non-linear relationship between the data points. As such, this variable could benefit from transformation. Due to some states having effectively 0% black population (due to rounding down of the "<0.01" category), a log transformation would not apply. To avoid this, each value is increased by 1 prior to the log transform:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_race %>%
ggplot(aes(y = case_rate, x = log1p(black_pop))) +
geom_point() +
geom_smooth(method = "lm", level = 0) +
labs(
title = "Case Rate vs log plus 1 (% Black by state)",
x = "log plus 1(% Black)",
y = "Case Rate per 100,000")
```
The graph above demonstrates that a transformation distributes the points more evenly along the X axis, making the potential linear relationship more discernable.
#### Case Rate vs. Social Distancing Adherence
The Google Human mobility data includes information on the amount of time spent at various public locations compared to Google's baseline data. The values are recorded as percentage changes with possible values ranging from -100 to 100. In general this data is a proxy for adherence to social distancing regulations. We would expect to see reductions in mobility upon enforcement of shelter-in-place, work-from-home or quarantine orders. Moreover, we would expect to see drastic reductions at transit stations (e.g. busy commuter hubs). As social distancing measures are designed to reduce the transmission of COVID-19, we would expect there to be a positive relationship between mobility and case rate, i.e. reductions in mobility are linked to lower case rates and vice versa. Nevertheless, there are several different social distancing metrics to choose from, which we deal with in this section.
We can see that there are no missing data points in this category:
```{r fig.height = 3, fig.width = 5}
df_dist[,c(-1,-2)] %>%
vis_miss() +
theme(axis.text.x = element_text(angle = 90))
```
A look at the summary statistics of the values fail to show any significant errors in values:
```{r}
df_dist %>%
select(where(is.numeric)) %>%
summary()
```
Below is the pairs plot for the mobility data and case rate per 100,000:
```{r warning=FALSE, fig.height = 6, message=FALSE}
ggpairs(df_dist[,-1])
```
All the mobility data are highly collinear with one another, so for the model, we will select "mob_TS" (i.e. change in mobility at transit stations) as it has the highest correlation with case rate, with an r of 0.237. We can visualize the changes in mobility by State to see if there are any interesting trends worth exploring:
```{r warning=FALSE, fig.height = 4, message=FALSE}
quantiles <- get_quantiles(df$mob_TS)
df %>%
ggplot(aes(x = state, y = mob_TS)) +
geom_point(stat="identity") +
labs(title = "% Change in Mobility",
y = "% Change",
x = "State",
color = "Legend") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
geom_hline(data = quantiles, aes(yintercept = value, color = name))
```
As expected, the majority of states have witnessed a decline in mobility at their respective transit centers. In general, the data is congruent with expectations, for example the District of Columbia has seen the largest decrease in transit station mobility at >70%. Interestingly, there are 3 states that have seen increases, with Wyoming seeing a substantial increase in transit station mobility. Nevertheless, these trends are unlikely to affect our further analysis using the variable in our model.
The figure below shows the bivariate relationship between case rate variable and the change in mobility at transit stations:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_dist %>%
ggplot(aes(y = case_rate, x = mob_TS)) +
geom_point() +
geom_smooth(method = "lm", level = 0) +
labs(
title = "Relationship Between Mobility: Transit Stations and Case Rate by State",
x = "Percentage Change in Mobility: Transit Stations",
y = "Case Rate per 100,000")
```
From the above figure, we can discern a clear positive relationship between the X and Y variables. As the points are approximately evenly distributed within the axes, there is no clear and explainable transformation that can be applied. The variable will be left as is for the second model.
### Casual Diagram for Model 2
The causal diagram for Model 2 further resembles the diagram from the introduction section. We have now explored the effect of age, race, socioeconomic factors, and mobility as it pertains to policies for mask use and the COVID-19 case rate per 100,000. For the age category, percent population <25 had strongest correlation with case rate. The percentage of black people per state served as the strongest proxy to control for socioeconomic and racial characteristics. Using Google's mobility data as a proxy for adherence to social distancing measures, we found that the percentage change in traffic through transit stations had strongest correlation with case rate per 100,000.
![](images/model2_causal_diagram.png)
### Model specification
The second regression model has **COVID-19 Case Rate per 100,000** as the outcome variable and 5 covariates: The variable of interest (**Mandatory Mask Use**), and 4 control variables: **Test Rate per 100,000**, **Percentage of Population Below 25 Years Old**, **Log of Percentage of Black Ethnicity in Total Population + 1**, and **Human Mobility Change in Transit Stations**.
The variable of interest continues to be **Mandatory Mask Use** and the primary measurement goal remains to assess the significance and practical impact of **Mandatory Mask Use** on the **COVID-19 Case Rate per 100,000**. We predict this model (with more control variables) to be better at capturing the actual significance and practical relevance of the **Mandatory Mask Use** on **Case Rate**.
Model 2 has the format:
$$
\begin{aligned}
\mathrm{case\ rate\ per\ 100,000\ pop.} = \beta_0 + \beta_1(\mathrm{mandatory\ mask\ use\ policy}) +
\beta_2(\mathrm{test\ rate\ per\ 100,000\ pop.}) + \\ \beta_3(\mathrm{\% \ pop.\ <25}) + \beta_4(\mathrm{\ln (\%\ black\ pop. + 1)}) + \beta_5(\mathrm{mobility\ \Delta\ transit\ stations})
\end{aligned}
$$
\newpage
### Model Summary
```{r model_2 summary, results = "asis", warning = FALSE}
model_2 <- lm(case_rate ~ mask_use + test_rate + age_below_25 +
log(black_pop + 1) + mob_TS, data = df)
std_errors = list(
sqrt(diag(vcovHC(model_1))),
sqrt(diag(vcovHC(model_2))))
stargazer(model_1, model_2, se = std_errors, title = "Model 2 Summary", header = FALSE)
```
\newpage
### Overall model significance (F-test)
This compares the null hypothesis, where $H_0:$ **Model 1**, against an alternative hypothesis where $H_a:$ **Model 2** at a significance level of 0.05:
```{r model_2 F-test}
anova(model_1, model_2, test = "F")
```
From this F-test, we can reject the null hypothesis (**Model 1**) in favor of our optimized **Model 2**, which now includes the covariates **Mandatory Mask Use**, **Test Rate per 100,000**, **% of Age <25**, **ln(% of Black Population + 1)**, and **% Mobility Change at Mobile Transit Stations**. The F-Statistic is 19.733, and the p-value < 0.01. Model 2 has an adjusted R-squared of 0.633.
### Coefficient significance (t-test)
Under a significance level of 0.05, we can accept all the alternative hypotheses: $H_{a1}: \beta_1 \ne 0$, $H_{a2}: \beta_2 \ne 0$, $H_{a3}: \beta_3 \ne 0$, $H_{a4}: \beta_4 \ne 0$, and $\beta_5 \ne 0$. This means all 5 covariates assist in explaining a part of the variability observed in the **COVID-19 Case Rate per 100,000**.
Our estimate for $\beta_1$ (the coefficient of our variable of interest) is $\tilde \beta_1 = -919.3$, with a standard error of 227.0 and a p-value of 0.0002. It continues to be statistically significant, and with an estimated value that did not change a lot from **Model 1** (-990.5) to **Model 2** (-919.2).
```{r model_2 coefficient test}
coeftest(model_2, vcovHC)
```
### Practical significance
According to Model 2, suggests that enforcing a mandatory mask use policy would expect to reduce the COVID-19 positive case rate by ~919.2 cases per 100,000, with all other variables being held constant. Given that the median case rate across states is 2,633 per 100,000, the coefficient estimate has practical significance, with an effect size corresponding to a reduction of 34.9% of the median state case rate value.
## Model 3
### Objective
Model 3 includes all the previous covariates, and several other covariates, erring on the side of inclusion. A key purpose of this model is to demonstrate the robustness of our coefficient for mandatory mask use. We do not expect our model performance to increase significantly compared to Model 2.
Subsequently, we will include other COVID-related measures adopted by states, which might also contribute towards variability of the dependent variable i.e. case rate per 100,000. Moreover, some of these variables may be collinear with our primary independent variable, statewide mandated mask use. This may potentially reduce the overall explanatory performance of our model, however. We want to verify that even under harsh conditions our coefficient for mandatory mask use remains statistically significant, and with a practical significance close to that of Model 2.
Model 3 therefore acts as an *acid test* to further validate Model 2 as the most optimized model.
```{r model 3 variable initialization}
df_mod3 <- df %>%
select(state, case_rate, bus_close_days, shelter_days, mask_legal, maskbus_use)
```
#### Case Rate vs. Other COVID Related Policies
The variables selected in the exploratory phase of Model 3 regard other COVID-19 specific policies. We are using the number of days a business had to close, the duration of shelter-in-place, and whether there was legal enforcement to wear a mask.
From the summary statistics, we can see that the numbers mostly make sense. There are no negative values, and the max number of days in shelter-in-place (since October 30th) would have the shelter-in-place start from March 19th, 2020, a full 8 days after the WHO declared COVID-19 a pandemic.
First we can assess the integrity of our data, and confirm that there are no missing datapoints:
```{r fig.height = 3, fig.width = 5}
df_mod3[,c(-1,-2)] %>%
vis_miss() +
theme(axis.text.x = element_text(angle = 90))
```
Next we can ensure that the data are well behaved and do not contain any spurious values:
```{r summary stats - other covid policies}
df_mod3[,c(-1,-2)] %>%
select(where(is.numeric)) %>%
summary()
```
The continues variables corresponding to business close days and days of shelter in place seem to contain data concordant with what we would expect them to contain. Next, we can assess the column containing data pertaining to enforcement of mask wear, which we can see contains 51 values, with no missing data:
```{r}
summary(df_mod3$mask_legal)
```
We can utilize a scatterplot matrix to visualize the relationships between the case rate and other COVID-19 related policies:
```{r warning=FALSE, fig.height = 6, message=FALSE}
ggpairs(df_mod3[,-1], upper = list(combo = "points"))
```
As we cannot calculate a Pearson r correlation score to quantify the relationship between a categorical variable and a continuous variable, instead we can conduct a point biserial test to assess for the relationship between the legal enforcement of mask use and case rate variables:
```{r warning=FALSE, fig.height = 4, message=FALSE}
df_mod3 <- df_mod3 %>%
mutate(binary_legal = as.integer(as.logical(mask_legal)))
cor.test(df_mod3$case_rate, df_mod3$binary_legal)
```
As we can see from the above scatterplot matrix, variables pertaining to business closure days and shelter days are both negatively correlated with case rate, with Pearson r of ~-0.2 and ~-0.3, respectively. They are also both collinear, with a correlation of r = ~0.4. Due to the fact that these variables are negatively related to case rate (whereas the previous models have exclusively utilized variables that were positively related to case rate), and collinear, adding them separately will be useful to test our model by introducing variables that might interact strongly with the dependent and independent variables. Moreover, as the point biserial estimate of correlation between case rate and legal enforcement of mask wear is ~-0.2, it does not seem to be any more useful than the business closure and shelter in place day variables, and will henceforth be discarded.
As these two variables are specifically chosen to conflict with our existing variables, we elect not to do any further exploratory data analysis upon them for the sake of brevity. We have already assured that they do not contain any values that might be considered spurious nor do they contain a substantial amount of missing data.
### Casual Diagram for Model 3
The causal diagram for Model 3 is the most "complete" model, and is thus identical to the diagram included in the introduction. We believe that Model 2 is the most optimal in regards to containing variables proxying for the complex demographic, socioeconomic, social, policy and other factors that contribute to case rate per 100,000 of population. The additional variables pertaining to shelter in place days and enforced business closures are both negatively correlated with case rate (as opposed to positively correlated for all other variables) and collinear with each other. This is essentially an *acid test* of Model 2 to ensure that it is robust.
![](images/final_causal_diagram.png)
### Model specification
Model 3 has **COVID-19 Case Rate per 100,000** as the primary outcome variable and seven covariates: the primary variable of interest (**Mandatory Mask Use**), **Test Rate per 100,000**, **Percentage of Population Below 25 Years Old**, **Log of Percentage of Black Ethnicity in Total Population + 1**,
**Human Mobility Change in Transit Stations**, **Number of Days of Shelter in Place**, and **Number of Days of Non-Essential Businesses Closure**.
**Model 3** will demonstrate robustness of **Model 2** ($\tilde \beta_1$). New variables on **Model 3** represent other common policies states have adopted to combat the virus spread. They have some collinearity with mask use as would be expected, since typically states enact a set of policies against COVID-19 concurrently.
Despite losing some explanatory power due to inclusion of the new variables, the result we would like to highlight is that the coeficient of interest ($\tilde \beta_1$) continues to be both statistically significant, and with an estimated value having practical significance in terms of informing public policies in the combat to the virus.
Model 3 has the format:
$$
\begin{aligned}
\mathrm{case\ rate\ per\ 100,000\ pop.} = \beta_0 + \beta_1(\mathrm{mandatory\ mask\ use\ policy}) +
\beta_2(\mathrm{test\ rate\ per\ 100,000\ pop.}) + \\ \beta_3(\mathrm{\% \ pop.\ <25}) + \beta_4(\mathrm{\ln (\%\ black\ pop. + 1)}) + \beta_5(\mathrm{mobility\ \Delta\ transit\ stations})+\\ \beta_6(\mathrm{number\ days\ shelter\ in\ place}) + \beta_7(\mathrm{number\ days\ businesses\ closed})
\end{aligned}
$$
\newpage
### Model summary
```{r model_3 summary, results = "asis", warning=FALSE}
model_3 <- lm(case_rate ~ mask_use + test_rate + age_below_25
+ log(black_pop + 1) + mob_TS + shelter_days
+ bus_close_days, data = df)
std_errors = list(
sqrt(diag(vcovHC(model_1))),
sqrt(diag(vcovHC(model_2))),
sqrt(diag(vcovHC(model_3))))
stargazer(model_1, model_2, model_3, se = std_errors, title = "Model 3 Summary", header = FALSE)
```
### Overall model significance (F-test)
```{r model_3 F-test}
anova(model_2, model_3, test = "F")
```
At the significance level of 0.05, we cannot reject the null hypothesis ($H_0:$**Model 2**) in favor of our fuller $H_a:$**Model 3**, which now includes the covariates **Number of Days of Shelter in Place** and **Number of Days of Non-Essential Businesses Closure**. The residual standard error remains almost unchanged, even with inclusion of these new variables. This demonstrates collinearity between the variables and the existing ones from **Model 2**. The inclusion of the new variables did not enhance the explained variability of the outcome variable. The adjusted R-squared of **Model 3** decreased to 0.616, which would be expected due to the interactions.
As such we have demonstrated that **Model 3** serves as a reliable acid test for the robustness of **Model 2** and the coefficient of its primary explanatory variable pertaining to mask use ($\tilde \beta_1$).
### Coefficient significance (t-test)
```{r model_3 coefficient test}
coeftest(model_3, vcovHC)
```
At 0.05 significance, we reject the null hypothesis in favor of the alternate hypothesis: $H_{a1}: \beta_1 \ne 0$, $H_{a3}: \beta_3 \ne 0$, and $H_{a4}: \beta_4 \ne 0$, meaning that only 3 out of 7 of Model 3 covariates have ability to partially explain the variability observed in the case rate per 100,000.
The estimate for $\beta_1$ (the coefficient for our variable of interest) is $\tilde \beta_1 = -913.8$, with a standard error of 271.8 and a p-value of 0.0016. It continues to be statistically significant, with an estimated value that changes little from **Model 2** (-919.3) to **Model 3** (-913.8), indicating that mask use policies are a reliable explanator for differences in case rates between the states.
### Practical significance
According to Model 3, states that have adopted mandatory mask use would expect to have 913.8 fewer COVID cases per 100,000, all other variables held constant. Given that the median case rate among US states is 2,633 per 100,000, the coefficient estimate has practical significance, with an effect size corresponding to a reduction of 34.7% of the median rate.
## CLM Assumptions & Limitations
In theory, our EDA process should have helped us choose the most optimal variables (and subsequent transformations) to use in our models. Nevertheless, we must assess how good our models are at explaining the causal relationship between the dependent variable (case rate per 100,000), and our primary independent variable of interest (state implementation of a mask mandate) or whether they require further modification and optimization. As such we will consider whether they meet the 5 key assumptions required for the Classic Linear Model (CLM). Moreover, we can also utilize CLM assessment techniques to demonstrate how our iterative model building process has optimized our causal model.
### 1) Independent & Identically Distributed Random Variables
As it is aggregated by state, there may be potential that our data is not independent and identically distributed (IID).
**a. Clustering Effect**
States in close proximity to each other may have similar population characteristics. There may also be frequent movements of populations between neighboring states. Moreover, these states may have similar population demographics (ethnicities, ages) or geographical characteristics (e.g. climate, see Omitted Variable Bias section) which lead to a clustering effect in terms of case rates.
**b. Strategic Effect**
Similar to clustering, socioeconomic and behavioral characteristics of populations may effect public health policies and case rates. Moreover, adjacent states or states with similar population characteristics (and behaviors) may be encouraged to adopt similar public health policies such as implementing shelter-in-place orders, quarantines, mask use mandates and other regulations.
As such, the dataset may be amenable to different statistical methods that specifically deal with IID data. Due to the issue of there being relatively few observations, and potentially many values, other techniques may not offer better performance than simple OLS regression, however. Moreover, as we do not know exactly how the data was compiled nor exactly how states arrived at their policy decisions, or demographic characteristics, for the purposes of this exercise we have proceeded as if the data were IID and have utilized model performance metrics to guide our assessment on model efficacy.
### 2) Linear Conditional Expectation
The linear regression model assumes a straight-line relationship between the predictors and the fitted variables.
**Residuals vs. Fitted**
```{r, fig.height=4}
model_1_residuals = resid(model_1)
model_2_residuals = resid(model_2)
model_3_residuals = resid(model_3)
model_1_predicteds = predict(model_1)
model_2_predicteds = predict(model_2)
model_3_predicteds = predict(model_3)
plot_1_predicts <- model_1 %>%
ggplot(aes(model_1_predicteds, model_1_residuals)) +
geom_point() +
stat_smooth(color="red") +
labs(
title = "Model 1: Residuals vs. Fitted",
x = "Fitted Values",
y = "Residual Values")
plot_2_predicts <- model_2 %>%
ggplot(aes(model_2_predicteds, model_2_residuals)) +
geom_point() +
stat_smooth(color="blue") +
labs(
title = "Model 2: Residuals vs. Fitted",
x = "Fitted Values",
y = "Residual Values")
plot_3_predicts <- model_3 %>%
ggplot(aes(model_3_predicteds, model_3_residuals)) +
geom_point() +
stat_smooth(color="green") +
labs(
title = "Model 3: Residuals vs. Fitted",
x = "Fitted Values",
y = "Residual Values")
```
Looking at the fitted vs. residuals plot for Model 1 (below), we can see that the residuals demonstrate an element of non-linearity (evidenced especially at fitted values > 3,500), indicating problems with the model:
```{r}
plot_1_predicts
```
In Model 2, with the addition of several more control variables, we can see that the plot assumes a more linear pattern, with the line much closer to 0. There appears to be very little difference (almost indiscernable to the naked eye) between the fitted vs. residual line between Model 2 and 3, indicating that Model 3, despite the addition of several more variables, does little to improve the overall model:
```{r}
plot_2_predicts
plot_3_predicts
```
Moreover, comparing Model 1 to Model 2, the residuals seem to be more evenly distributed (i.e. randomly about the) about the line in the latter model, with fewer outliers. There is almost no change in residual distribution about the line between Model 2 and 3. We will discuss homoscedasticity in a subsequent section.
Altogether we can see that compared to Model 1, Model 2 does a better job at meeting the fundamental assumption that the error term has a conditional mean of 0 across all the X-support. Moreover, Model 3 does not seem to contribute further to meeting this assumption, despite the addition of further variables.
### 3) No Perfect Collinearity
The linear regression model assumes that the variables are not perfectly (or even near perfectly) collinear.
**Dropped Coefficients**
We can assess for perfect collinearity by looking to see if there were any dropped coefficients in any of the 3 models. As we can see in the preceding model sections, none of the variables have been dropped from any of the models, indicating that there is no perfect collinearity between any of the independent variables.
**Variance Inflation Factor**
We can quantify the degree of collinearity between the independent variables by conducting a variance inflation factor (VIF) test. This is the quotient of the variance of a model with multiple terms by by the variance of a model with only one term. It is given by the formula:
$$
\begin{aligned}
\mathrm{VIF} = \frac{1}{1-R_i^2}
\end{aligned}
$$
Where $R^2_i$ is the coefficient of determnation of of a regression equation with $X_i$ on the left hand side, and all other predictor variables on the right hand side. Subsequently, it will produce a VIF index value for the coefficient estimator of the particular variable we are analyzing ($\mathrm{VIF}(\hat{\beta_i})$). According to several sources, $\mathrm{VIF}(\hat{\beta_i})>10$ (some also say values >5 indicate high collinearity) is considered indicative of multicollinearity.
```{r}
car::vif(model_1)
```
The VIF values for both variables in Model 1 are <2, indicating no high degree of collinearity between the 2 independent variables.
```{r}
car::vif(model_2)
```
The VIF values for all variables in Model 2 are <2, indicating no high degree of multicollinearity between the 5 independent variables.
```{r}
car::vif(model_3)
```
The VIF values for all variables in Model 3 are <2, indicating no high degree of multicollinearity between the 7 independent variables.
From the two tests above, we can safely say our models meet the assumption of having no substantial amount of collinearity or multicollinearity between the independent variables.
## 4) Homoscedastic Errors
There are two methods we can employ to test for homocedasticity of the error terms:
**Scale-Location Plots**
This is a method to visually assess for homoscedasticity of the error terms.
```{r, fig.height=3}
plot_1_sl <- model_1 %>%
ggplot(aes(x = model_1_predicteds,
y = sqrt(abs(model_1_residuals/sd(model_1_residuals))))) +
geom_point() +
stat_smooth(color="red", se=FALSE) +
labs(
title = "Scale-Location Plot: Model 1",
x = "Fitted Values",
y = "sqrt(|Standardized Residuals|)")
plot_2_sl <- model_2 %>%
ggplot(aes(x = model_2_predicteds,
y = sqrt(abs(model_2_residuals/sd(model_2_residuals))))) +
geom_point() +
stat_smooth(color="blue", se=FALSE) +
labs(
title = "Scale-Location Plot: Model 2",
x = "Fitted Values",
y = "sqrt(|Standardized Residuals|)")
plot_3_sl <- model_3 %>%
ggplot(aes(x = model_3_predicteds,
y = sqrt(abs(model_3_residuals/sd(model_3_residuals))))) +
geom_point() +
stat_smooth(color="green", se=FALSE) +
labs(
title = "Scale-Location Plot: Model 3", x = "Fitted Values",
y = "sqrt(|Standardized Residuals|)")
```
In Model 1, it seems like there are a greater concentration of points below the line, which demonstrates evident curvature at the lower and upper extremes of the X-scale, with a greater upward slope at Fitted values > 3500. It is difficult to visually assess for heteroscedasticity in this plot:
```{r}
plot_1_sl
```
In Model 2, the errors seem much more evenly distributed above and below the line, which is also straighter than that in Model 1. However it seems that the error magnitude is increasing as X increases. There is almost no discernable change in any of the visualized error parameters between Model 2 and 3, indicating that the extra added variables do nothing to help our model meet any of the CLM assumptions:
```{r}
plot_2_sl
plot_3_sl
```
As it is difficult to discern homoscedasticity in the scale-location plot for Model 1, and it appears errors are increasing in magnitude in both Model's 2 and 3, we can also perform a quantitative assessment for non-constant variance in the form of the Breusch-Pagan test. The null hypothesis is that there is no evidence for heteroscedastic variance.
**Breusch-Pagan Test**
Running the test for Model 1, we observe a high p-value of 0.7. While we cannot assertively state that there is no heteroscedastic variance, we can safely assume that we fail to reject the null hypothesis:
```{r}
lmtest::bptest(model_1)
```
For Model 2, we now observe a very low p-value of ~0.008, and can reject the null hypothesis. Although this indicates that our data is heteroscedastic, it is important to note that we are utilizing robust standard errors versus classical standard errors in all of our models and associated tests. This means that despite the presence of heteroscedasticity, the t-tests that we performed for Model 2 coefficients are still valid:
```{r}