-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexams.rmd
647 lines (540 loc) · 25.9 KB
/
exams.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
---
title: "Project assignment for the course 'Exploratory Data Analysis with the R Language'"
subtitle: "Dataset: Students Performance in Exams"
author: "Antonov Grigorii"
date: "14.06.2023"
output:
html_document:
toc: true
toc_float: true
theme: united
highlight: zenburn
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.align = "center")
```
# Description of the dataset
This dataset contains fictitious information about students' grades in the following subjects: `math`, `reading`, and `writing`.
**Data reference**: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams
```{r}
library(tidyverse)
library(ggpubr)
library(patchwork)
scores <- read_csv("http://roycekimmons.com/system/generate_data.php?dataset=exams&n=1000")
colnames(scores)
head(scores)
```
The dataset contains 1000 records and 8 characteristics (columns).
## Clearing data
There are no missing values in the presented dataset. In addition, there are no incorrectly entered data in the data.
```{r}
colSums(is.na(scores))
```
## Variable `gender`
Character variable `gender` with two values:
- `male`
- `female`
Since there are only two parameters, it is convenient to move this variable to the rank of categorical variables.
```{r}
scores$gender <- fct_infreq(as.factor(scores$gender))
table(scores$gender)
# conversion to a factor variable
scores |>
ggplot(aes(x = gender, fill = gender)) +
geom_bar(alpha = 0.9) +
labs(title = "Gender representation", y = "Number of people", x = "Gender") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
This study involved roughly equal numbers of boys and girls, with a slight skew toward the male side.
## Variable `race/ethnicity`
Character variable `race/ethnicity` containing 5 variants:
- `group A`
- `group B`
- `group C`
- `group D`
- `group E`
Because of the structure of the variable it is convenient to put it in the rank of categorical.
For ease of calling I will rename the variable from `race/ethnicity` to `group`.
```{r}
scores <- rename(scores, "group" = "race/ethnicity")
scores$group <- fct_infreq(as.factor(scores$group))
table(scores$group)
scores |>
ggplot(aes(x = group, fill = group)) +
geom_bar(alpha = 0.9) +
labs(title = "Ethnic groups distribution",
y = "Number of people", x = "Ethnic group") +
scale_fill_brewer(palette = "Set3") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
The predominant group in this study is the `C` group.
```{r}
scores |>
ggplot(aes(x = group, fill = gender)) +
geom_bar(position = "dodge", alpha = 0.9) +
labs(title = "Ethnic groups distribution",
y = "Number of people", x = "Ethnic group", fill = "Gender") +
scale_fill_brewer(palette = "Set3") +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Only from the `B` group there are more girls participating in the test than there are boys.
## Variable `parental level of education`
Character variable denoting the parents' level of education and having the following options:
- `associate's degree`
- `bachelor's degree`
- `some college`
- `some high school`
- `high school`
- `master's degree`
For convenience, it is better to convert the characteristic into a categorical variable.
For ease of recall, I will rename the variable `parental level of education` to `parent_ed`.
```{r}
scores <- rename(scores, "parent_ed" = "parental level of education")
scores$parent_ed <- fct_infreq(as.factor(scores$parent_ed))
table(scores$parent_ed)
scores |>
ggplot(aes(x = parent_ed, fill = parent_ed)) +
geom_bar(alpha = 0.9) +
labs(title = "Parental level of education",
y = "Number of people", x = "Level of education") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Based on the data, it follows that the majority of parents of tested children completed secondary education and partially received higher education. A smaller proportion of parents of tested children received higher education (`bachelor's degree` and `master's degree`).
## Variable `lunch`
Character variable indicating whether the subject has lunch provisioning. The parameter has two levels:
- `standard`
- `free/reduced`
```{r}
scores$lunch <- fct_infreq(as.factor(scores$lunch))
table(scores$lunch)
scores |>
ggplot(aes(x = lunch, fill = lunch)) +
geom_bar(alpha = 0.9) +
labs(title = "Lunch provision",
y = "Number of people", x = "Lunch type") +
scale_fill_brewer(palette = "Set3") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Most students received a standard lunch at the institution.
## Variable `test preparation course`
Character variable indicating whether or not the examinee had specialized exam preparation.
The column is characterized by two values:
- `completed`
- `none`
```{r}
scores <- rename(scores, "prep_level" = "test preparation course")
scores$prep_level <- fct_infreq(as.factor(scores$prep_level))
table(scores$prep_level)
scores |>
ggplot(aes(x = prep_level, fill = prep_level)) +
geom_bar(alpha = 0.9) +
labs(title = "Test preparation course",
y = "Number of people", x = "Preparation type") +
scale_fill_brewer(palette = "Set2") +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Most of the test takers did not pass the exam preparation.
## Variable `math score`
Numerical variable reflecting how the math test was written by the students. Scores are distributed between the values *13* and *100*. Average score for the test: *66.4*.
```{r}
scores <- rename(scores, "math" = "math score")
summary(scores$math)
math <- scores |>
ggplot(aes(x = math)) +
geom_histogram(aes(alpha = 0.9, fill = ..x..)) +
geom_density(aes(y = ..density.. * 3000),
color = "#676b75", fill = "#676b75", size = 1, alpha = 0.1) +
labs(title = "Math test score distribution",
y = "Number of people", x = "Score") +
scale_fill_gradient(low = "#824141", high = "#418252") +
scale_y_continuous(
name = "Number of people",
sec.axis = sec_axis(~ . / 3000, name = "Density")) +
theme(axis.text = element_text(size = 12)) +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
math
```
## Variable `reading score`
Numerical Variable reflecting how students did on the reading test. Scores are distributed between the values *27* and *100*. Average score for the test: *69*.
```{r}
scores <- rename(scores, "reading" = "reading score")
summary(scores$reading)
reading <- scores |>
ggplot(aes(x = reading)) +
geom_histogram(aes(alpha = 0.9, fill = ..x..)) +
geom_density(aes(y = ..density.. * 3000),
color = "#676b75", fill = "#676b75", size = 1, alpha = 0.1) +
labs(title = "Reading test score distribution",
y = "Number of people", x = "Score") +
scale_fill_gradient(low = "#824141", high = "#418252") +
scale_y_continuous(
name = "Number of people",
sec.axis = sec_axis(~ . / 3000, name = "Density")) +
theme(axis.text = element_text(size = 12)) +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
reading
```
## Variable `writing score`
Numerical Variable that reflects how students performed on the written test. Scores are distributed between the values *23* and *100*. Average score for the test: *67.74*.
```{r}
scores <- rename(scores, "writing" = "writing score")
summary(scores$writing)
writing <- scores |>
ggplot(aes(x = writing)) +
geom_histogram(aes(alpha = 0.9, fill = ..x..)) +
geom_density(aes(y = ..density.. * 3000),
color = "#676b75", fill = "#676b75", size = 1, alpha = 0.1) +
labs(title = "Writing test score distribution",
y = "Number of people", x = "Score") +
scale_fill_gradient(low = "#824141", high = "#418252") +
scale_y_continuous(
name = "Number of people",
sec.axis = sec_axis(~ . / 3000, name = "Density")) +
theme(axis.text = element_text(size = 12)) +
theme(legend.position = "none",
plot.title = element_text(size = rel(1.2), hjust = 0.5))
writing
```
## A comparison of results for the three subjects
On average, students did better on the `reading` test and worse on the `math` test.
```{r, fig.width = 11, fig.height = 5}
(math + reading + writing) +
plot_annotation(title = "Comparison of results in three subjects",
tag_levels = "I",
theme = theme(plot.title = element_text(size = rel(1.3), hjust = 0.5))) +
plot_layout(ncol = 3)
```
# Data Exploration (Objectives and Goals)
After parsing the parameters, it makes sense to see how the parameters are related. In addition, it is worth answering some questions, such as:
1. How effective is the preparation for the exam
2. What parameters have the impact on the results of the exams
3. What parameters have the impact on the results of the lunch type of the students
4. Whether there is a correlation between the results of the various examinations
## Relationship between numerical variables
```{r}
cor1 <- scores |>
ggplot(aes(x = math, y = reading, color = prep_level)) +
geom_point(size = 2) +
labs(title = "Correlation between results",
subtitle = "math ~ reading",
y = "Number of people", x = "Score", color = "Preparation") +
theme(plot.title = element_text(size = rel(1.1), hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
stat_cor(aes(color = NULL), show.legend = FALSE, method = "pearson") +
geom_smooth(aes(group = 1), method = "lm", color = "#676b75") +
scale_color_brewer(palette = "Set3")
cor1
```
```{r}
cor2 <- scores |>
ggplot(aes(x = reading, y = writing, color = prep_level)) +
geom_point(size = 2) +
labs(title = "Correlation between results",
subtitle = "reading ~ writing",
y = "Number of people", x = "Score", color = "Preparation") +
theme(plot.title = element_text(size = rel(1.1), hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
stat_cor(aes(color = NULL), show.legend = FALSE, method = "pearson") +
geom_smooth(aes(group = 1), method = "lm", color = "#676b75") +
scale_color_brewer(palette = "Set3")
cor2
```
```{r}
cor3 <- scores |>
ggplot(aes(x = writing, y = math, color = prep_level)) +
geom_point(size = 2) +
labs(title = "Correlation between results",
subtitle = "writing ~ math",
y = "Number of people", x = "Score", color = "Preparation") +
theme(plot.title = element_text(size = rel(1.1), hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
stat_cor(aes(color = NULL), show.legend = FALSE, method = "pearson") +
geom_smooth(aes(group = 1), method = "lm", color = "#676b75") +
scale_color_brewer(palette = "Set3")
cor3
```
Note the great importance of correlation for numerical variables: the greater the score in one subject, the greater the score in the other. That is, good students are good at everything).
Note the highest value of correlation between the test results in reading and writing. This can be explained by the fact that reading and writing are essentially similar skills, strongly correlated at the level of higher nervous activity.
In addition, most of the yellow dots where students have passed preparation are closer to the upper right corner of the coordinate plane - in the area of high scores. That is, we can say that preparation for exams has a positive effect on the results.
```{r, fig.width = 11, fig.height = 5}
(cor1 + cor2 + cor3) +
plot_annotation(title = "Relationship between numerical variables",
tag_levels = "I",
theme = theme(plot.title = element_text(size = rel(1.3), hjust = 0.5))) +
plot_layout(ncol = 3)
```
## How training affects results
```{r}
scores |>
ggplot(aes(x = math)) +
geom_density(aes(y = ..density.., fill = prep_level), alpha = 0.5) +
labs(title = "Math test score distribution depending on preparation",
y = "Density", x = "Score", fill = "Preparation") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
```{r}
scores |>
ggplot(aes(x = reading)) +
geom_density(aes(y = ..density.., fill = prep_level), alpha = 0.5) +
labs(title = "Reading test score distribution depending on preparation",
y = "Density", x = "Score", fill = "Preparation") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
```{r}
scores |>
ggplot(aes(x = writing)) +
geom_density(aes(y = ..density.., fill = prep_level), alpha = 0.5) +
labs(title = "Writing test score distribution depending on preparation",
y = "Density", x = "Score", fill = "Preparation") +
scale_fill_brewer(palette = "Set2") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
I see that for all types of tests, the distribution of test scores for those who have undergone training is to the right of the distribution of scores for those who have not undergone this training.
### Statistical verification of training effectiveness
#### Data normality check
```{r}
shapiro.test(pull(scores[scores$prep_level == "completed", 6]))
# math and preparation completed
shapiro.test(pull(scores[scores$prep_level == "none", 6]))
# math and no preparation
shapiro.test(pull(scores[scores$prep_level == "completed", 7]))
# reading and preparation completed
shapiro.test(pull(scores[scores$prep_level == "none", 7]))
# reading and no preparation
shapiro.test(pull(scores[scores$prep_level == "completed", 8]))
# writing and preparation completed
shapiro.test(pull(scores[scores$prep_level == "none", 8]))
# writing and no preparation
```
The data did not pass the normality test. We need to use nonparametric statistical tests, such as the **Mann-Whitney** U test or its modifications.
```{r, fig.width = 11, fig.height = 5}
scores |>
select(prep_level, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = prep_level)) +
geom_boxplot(alpha = 0.9) +
labs(title = "Test score distribution depending on preparation",
y = "Score", x = "Test type", fill = "Preparation") +
scale_fill_brewer(palette = "Set2") +
stat_compare_means(size = 4, vjust = -0.5) +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
In this case, to test for equality of medians I used the **Kruskal-Wallis test**, which is a modification of the Mann-Whitney U test, and is suitable for measuring nonparametric independent data divided into three groups. Judging by the test results, I can reject the null hypothesis of equality of the mean scores for the test. That is, it is legitimate to assume that preparation for the test has a positive effect on its results.
## The effect of gender on test results
#### Checking for data normality
```{r}
shapiro.test(pull(scores[scores$gender == "male", 6]))
# math and male
shapiro.test(pull(scores[scores$gender == "female", 6]))
# math and female
shapiro.test(pull(scores[scores$gender == "male", 7]))
# reading and male
shapiro.test(pull(scores[scores$gender == "female", 7]))
# reading and female
shapiro.test(pull(scores[scores$gender == "male", 8]))
# writing and male
shapiro.test(pull(scores[scores$gender == "female", 8]))
# writing and female
```
Almost all samples, except the sample of boys' math results, failed the normality test. It is better not to use the T-test in this case.
```{r, fig.width = 11, fig.height = 5}
scores |>
select(gender, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = gender)) +
geom_boxplot(alpha = 0.9) +
labs(title = "Test score distribution depending on gender",
y = "Score", x = "Test type", fill = "Gender") +
scale_fill_brewer(palette = "Set2") +
stat_compare_means(size = 4, vjust = -0.5) +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Based on the results, the boys did better, on average, on the math test, and the girls did better on the reading and writing tests.
## Effects of parental education on test results
```{r, fig.width = 11, fig.height = 5}
scores |>
select(parent_ed, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = parent_ed)) +
geom_boxplot(alpha = 0.9) +
labs(title = "Test score distribution depending on parental education level",
y = "Score", x = "Test type", fill = "Parental education") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Let's combine children whose parents have received a complete higher education and those whose parents have not received a complete higher education into two groups: `higher' and `no_higher', respectively.
#### Check for normality of data
```{r}
shapiro.test(pull(scores[scores$parent_ed %in% c("bachelor's degree",
"master's degree"), 6]))
# math and higher education
shapiro.test(pull(scores[scores$parent_ed %in% c("associate's degree",
"some college", "some high school", "high school"), 6]))
# math and no higher education
shapiro.test(pull(scores[scores$parent_ed %in% c("bachelor's degree",
"master's degree"), 7]))
# reading and higher education
shapiro.test(pull(scores[scores$parent_ed %in% c("associate's degree",
"some college", "some high school", "high school"), 7]))
# reading and no higher education
shapiro.test(pull(scores[scores$parent_ed %in% c("bachelor's degree",
"master's degree"), 8]))
# writing and higher education
shapiro.test(pull(scores[scores$parent_ed %in% c("associate's degree",
"some college", "some high school", "high school"), 8]))
# writing and no higher education
```
None of the samples passed the normality test.
```{r, fig.width = 11, fig.height = 5}
scores |>
mutate(new_ed = fct_collapse(parent_ed,
higher = c("bachelor's degree", "master's degree"),
no_higher = c("associate's degree", "some college",
"some high school", "high school"))) |>
select(new_ed, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = new_ed)) +
geom_boxplot(alpha = 0.9) +
stat_compare_means(size = 4, vjust = -0.5) +
labs(title = "Test score distribution depending on parental education level",
y = "Score", x = "Test type", fill = "Parental education") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
The average scores of students whose parents have completed higher education are higher than those of students whose parents have no higher education. The average scores are highest for students whose parents have a master's degree or higher.
## Effect of the student's ethnic group on the test results
```{r, fig.width = 11, fig.height = 5}
scores |>
select(group, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = group)) +
geom_boxplot(alpha = 0.9) +
labs(title = "Test score distribution depending on ethnic group",
y = "Score", x = "Test type", fill = "Ethnic group") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
The graph shows that some groups are higher than others. Let's combine the top two groups `D` and `E` into `group_DE`, and the bottom three, `A`, `B` and `C` into `group_ABC`, to perform a statistical test on them.
#### Data normality check
```{r}
shapiro.test(pull(scores[scores$group %in% c("group D", "group E"), 6]))
# math and groups D and E
shapiro.test(pull(scores[scores$group %in% c("group A", "group B",
"group C"), 6]))
# math and groups A, B and C
shapiro.test(pull(scores[scores$group %in% c("group D", "group E"), 7]))
# reading and groups D and E
shapiro.test(pull(scores[scores$group %in% c("group A", "group B",
"group C"), 7]))
# reading and groups A, B and C
shapiro.test(pull(scores[scores$group %in% c("group D", "group E"), 8]))
# writing and groups D and E
shapiro.test(pull(scores[scores$group %in% c("group A", "group B",
"group C"), 8]))
# writing and groups A, B and C
```
None of the samples except the first (the math test and groups `D` and `E`) passed the test for data normality.
```{r, fig.width = 11, fig.height = 5}
scores |>
mutate(new_group = fct_collapse(group,
group_DE = c("group D", "group E"),
group_ABC = c("group A", "group B", "group C"))) |>
select(new_group, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = new_group)) +
geom_boxplot(alpha = 0.9) +
stat_compare_means(size = 4, vjust = -0.5) +
labs(title = "Test score distribution depending on ethnic group",
y = "Score", x = "Test type", fill = "Ethnic group") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
On average, students in groups `D` and `E` do better than students in other groups, especially in `math`.
## Effect of having a complete lunch on test results
#### Check for normality of data
```{r}
shapiro.test(pull(scores[scores$lunch == "free/reduced", 6]))
# math and free or reduced lunch
shapiro.test(pull(scores[scores$lunch == "standard", 6]))
# math and standard lunch
shapiro.test(pull(scores[scores$lunch == "free/reduced", 7]))
# reading and free or reduced lunch
shapiro.test(pull(scores[scores$lunch == "standard", 7]))
# reading and standard lunch
shapiro.test(pull(scores[scores$lunch == "free/reduced", 8]))
# writing and free or reduced lunch
shapiro.test(pull(scores[scores$lunch == "standard", 8]))
# writing and standard lunch
```
Interestingly, the data with an incomplete lunch are normally distributed (cannot reject this hypothesis), while in the case of a full lunch they are not normally distributed.
```{r, fig.width = 11, fig.height = 5}
scores |>
select(lunch, math, reading, writing) |>
pivot_longer(cols = c(math, reading, writing),
names_to = "test_name", values_to = "score") |>
ggplot(aes(y = score, x = test_name, fill = lunch)) +
geom_boxplot(alpha = 0.9) +
stat_compare_means(size = 4, vjust = -0.5) +
labs(title = "Test score distribution depending on lunch type",
y = "Score", x = "Test type", fill = "Lunch type") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text = element_text(size = 12)) +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
According to the test results, we can fairly confidently reject the hypothesis that the type of `lunch` does not affect the average test results. At the same time, we can assume that the type of lunch depends on the level of earnings in the family, i.e. on the standard of living. In addition, a well-fed person tends to think better than a hungry one.
## The influence of ethnicity on the type of lunch
```{r, fig.width = 11, fig.height = 5}
scores |>
ggplot(aes(x = group, fill = lunch)) +
geom_bar(position = "dodge", alpha = 0.8) +
labs(title = "Lunch type depending on ethnic group",
y = "Number of people", x = "Ethnic group", fill = "Lunch type") +
scale_fill_brewer(palette = "Set2") +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5))
```
Analyzing the data, we see that in all groups there are more students who receive a standard lunch. At the same time, it is interesting that in group `E' children are distributed across the two food groups almost equally, with the average score of children from this group being one of the highest. That is, the idea that the type of lunch directly correlates with students' average test score is called into question.
## Influence of parents' education level on the type of lunch
```{r, fig.width = 11, fig.height = 5}
scores |>
ggplot(aes(x = parent_ed, fill = lunch)) +
geom_bar(position = "dodge", alpha = 0.8) +
labs(title = "Lunch type depending on parental education",
y = "Number of people", x = "Parental education", fill = "Lunch type") +
scale_fill_brewer(palette = "Set2") +
theme(plot.title = element_text(size = rel(1.2), hjust = 0.5),
legend.title = element_text(size = rel(1)))
```
In addition, it does not appear that the level of education of the parents has much influence on the ratio of children with full and incomplete lunches.
# Conclusions
1. `Test preparation course` has a positive effect on test results
2. Children with complete `lunch` do better, on average, at writing the test. Perhaps the underlying factor here is income level, which determines the type of lunch
3. Students whose parents have completed higher education do better, on average, on tests in the subjects presented
4. In this population, girls do better on `reading` and `writing` assignments, and boys do better on `math` tests
5. Generally, children from groups `D` and `E` perform best.
6. The ethnicity and educational level of the parents of the test takers in the data presented has almost no effect on the quality of lunches