-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path03b-data-wrangling.Rmd
557 lines (437 loc) · 19.8 KB
/
03b-data-wrangling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
---
knit: bookdown::preview_chapter
---
# Wrangling Data
## Overview
In this chapter, we will learn about actions on data, that are useful for data analysis. These are the verbs in the tidyverse that are commonly used to operate on data (nouns).
- filter
- select
- mutate
- group_by
- summarise
- arrange
- count, tally
## Side-note: Package system
We have been loading the `tidyverse` package. Its actually a suite of packages, and you can learn more about the individual packages at https://www.tidyverse.org. You could load each individually.
Because so many people contribute packages to R, it is a blessing and a curse. The best techniques are available, but there can be conflicts between function names. When you load tidyverse it prints a great summary of conflicts that it knows about, between its functions and others.
```{r message=TRUE}
library(tidyverse)
```
For example, there is a `filter` function in the `stats` package that comes with the R distribution. This can cause confusion when you want to use the filter function in `dplyr` (part of tidyverse). To be sure the function you use is the one you want to use, you can prefix it with the package name, `dplyr::filter()`.
## Example: french fries

- Load the data
- Gather into long form
```{r}
load("data/french_fries.rda")
ff_long <- french_fries %>%
gather(type, rating, -subject, -time, -treatment, -rep) %>%
mutate(type=as.factor(type))
```
### The story of the french fries data....
This was an actual experiment in Food Sciences at Iowa State University. The goal was to find out if some cheaper oil options could be used to make hot chips: that people would not be able to distinguish the difference between chips fried in the new oils relative to those fried in the current market leader.
Twelve tasters were recruited to sample two chips from each batch, over a period of ten weeks. The same oil was kept for a period of 10 weeks! May be a bit gross by the end!
This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data and the plyr tools.
## filter: choose observations
Filter records for subject 10.
```{r}
ff_long_subj10 <- ff_long %>%
filter(subject == 10)
ff_long_subj10
```
`==` does a logical equals check. This works nicely for characters or factors but not for numerical data.
To filter you need to do some comparison to find the subset of observations you are interested in. For example:
- `subject != 10` finds rows corresponding to all subjects except subject 10
- `x > 10` find all rows where variable `x` has values bigger than 10
- `x >= 10` find all rows where variable `x` has values bigger than *or equal to* 10
- `class %in% c("A", "B")` finds all records where variable `class` is either A or B
- `!is.na(y)` finds all the records that *DO NOT* have a missing value for variable `y`
#### Your turn
Filter the french fries data to have:
- only week 1
- weeks 1-4 only
- oil type 1 (oil type is called treatment)
- oil types 1 and 3 but not 2
```{r eval=FALSE, echo=FALSE}
ff_long %>%
filter(time == 1)
ff_long %>%
filter(time < 5)
ff_long %>%
filter(treatment == 1)
ff_long %>%
filter(time != 2)
```
## select
`select` chooses which variables to keep in the data set. It is useful when there are a lot of variables but you only need some of them for an analysis.
```{r}
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
select(country, year, starts_with("new_sp_"))
tb %>% top_n(20)
```
- Generally a comma separated list of the variables, by name.
- You can use text-matching of the names like `starts_with`, `ends_with()`, `contains()`, `matches()`, `num_range()`, `one_of()` or `everything()`
- You can use `:` to choose variables in order of the columns, eg `select(df, V4:V6)`
- And you can use it to NOT select some variables by prefixing with `-`
#### Your turn
For the french fries data, use select to do these tasks:
- choose time, treatment and rep
- choose subject through to rating
- drop subject
```{r eval=FALSE, echo=FALSE}
ff_long %>%
select(time, treatment, rep)
ff_long %>%
select(subject:rating)
ff_long %>%
select(-subject)
```
## mutate
`mutate` is used to create a new variable.
```{r}
load("data/pisa_au.rda")
pisa_au <- pisa_au %>%
select(PV1MATH:PV10MATH)
pisa_au %>%
mutate(math = (PV1MATH+PV2MATH+PV3MATH)/3)
```
#### Your turn
- Compute a standised math score, by subtracting the mean from PV1MATH and dividing by the standard deviation
- For the french fries data compute a new variable called lrating by taking a log of the rating
```{r eval=FALSE, echo=FALSE}
pisa_au %>%
mutate(PV1MATH = (PV1MATH-mean(PV1MATH))/sd(PV1MATH))
ff_long %>%
mutate(lrating=log10(rating))
```
## summarise and group_by
`summarise` generates a summary of a variable. It is most useful when working on subsets of the data, and `group_by` can automate this.
```{r}
ff_long %>%
group_by(type) %>%
summarise(rating = mean(rating, na.rm=TRUE))
```
#### Your turn
- Compute the average rating by subject
- Compute the average rancid rating per week
```{r eval=FALSE, echo=FALSE}
ff_long %>%
group_by(subject) %>%
summarise(rating = mean(rating, na.rm=TRUE))
ff_long %>%
filter(type == "rancid") %>%
group_by(time) %>%
summarise(rating = mean(rating, na.rm=TRUE))
```
## arrange
`arrange` orders a table by a given variable. Its usful for display of results, primarily.
```{r}
ff_long %>%
group_by(type) %>%
summarise(rating = mean(rating, na.rm=TRUE)) %>%
arrange(rating)
```
#### Your turn
- Arrange the average rating by type in decreasing order
- Arrange the average subject rating in order lowest to highest.
```{r echo=FALSE, eval=FALSE}
ff_long %>%
group_by(type) %>%
summarise(rating = mean(rating, na.rm=TRUE)) %>%
arrange(desc(rating))
ff_long %>%
group_by(subject) %>%
summarise(rating = mean(rating, na.rm=TRUE)) %>%
arrange(rating)
```
## count, tally
These are convenience functions to compute numbers in different categories.
```{r}
ff_long %>% count(type, sort=TRUE)
```
## Putting it together to problem solve
### Are ratings similar?
```{r}
ff_long %>%
group_by(type) %>%
summarise(m = mean(rating, na.rm=TRUE),
sd = sd(rating, na.rm=TRUE)) %>%
arrange(desc(m))
```
The scales of the ratings are quite different. Mostly the chips are rated highly on potato'y, but low on grassy.
Make a picture too!
```{r}
ff_long %>%
ggplot(aes(x=type, y=rating)) +
geom_boxplot()
```
### Are reps like each other
```{r}
ff_long %>%
spread(rep, rating) %>%
summarise(r=cor(`1`, `2`, use="complete.obs"))
```
Make a picture, too!
```{r}
ff_long %>%
spread(rep, rating) %>%
ggplot(aes(x=`1`, y=`2`)) + geom_point()
```
This data is poor quality - the replicates do not look like each other!
### Replicates by rating type
```{r}
ff_long %>%
spread(rep, rating) %>%
group_by(type) %>%
summarise(r=cor(`1`, `2`, use="complete.obs"))
```
Make a picture, to!
```{r}
ff_long %>%
spread(rep, rating) %>%
ggplot(aes(x=`1`, y=`2`)) +
geom_point() +
facet_wrap(~type, ncol=5)
```
Potato'y and buttery have better replication than the other scales, but there is still a lot of variation from rep 1 to 2.
## Exercise
This question is about the 2015 PISA results. The data is downloaded from [http://www.oecd.org/pisa/data/2015database/](http://www.oecd.org/pisa/data/2015database/). The SPSS format "Student questionnaire data file (419MB)" is downloaded and processed using this code, to extract results for Australia:
```{r eval=FALSE}
library(haven)
pisa_2015 <- read_sav(file.choose())
pisa_au <- pisa_2015 %>% filter(CNT == "AUS")
save(pisa_au, file="pisa_au.rda")
```
You don't need to do this, because the Australia data is extracted and saved for you. Your task is to answer these questions about Australia. At times it may be helpful to examine the data dictionary, which is provided as an excel file (you can also download this directly from the OECD PISA site too).
A large amount of pre-processing is done on the data, as performed by this code:
```{r}
load("data/pisa_au.rda")
pisa_au <- pisa_au %>% mutate(state=as.character(substr(STRATUM, 4, 5)),
schtype_yr=as.character(substr(STRATUM, 6, 7))) %>%
mutate(state=recode(state, "01"="ACT", "02"="NSW", "03"="VIC",
"04"="QLD", "05"="SA", "06"="WA", "07"="TAS", "08"="NT")) %>%
mutate(schtype_yr=recode(schtype_yr,
"01"="Catholic_Y10", "02"="Catholic_noY10",
"03"="Gov_Y10", "04"="Gov_noY10",
"05"="Ind_Y10", "06"="Ind_noY10",
"07"="Catholic_Y10", "08"="Catholic_noY10",
"09"="Gov_Y10", "10"="Gov_noY10",
"11"="Ind_Y10", "12"="Ind_noY10",
"13"="Catholic_Y10", "14"="Catholic_noY10",
"15"="Gov_Y10", "16"="Gov_noY10",
"17"="Ind_Y10", "18"="Ind_noY10",
"19"="Catholic_Y10", "20"="Catholic_noY10",
"21"="Gov_Y10", "22"="Gov_noY10",
"23"="Ind_Y10", "24"="Ind_noY10",
"25"="Catholic_Y10", "26"="Catholic_noY10",
"27"="Gov_Y10", "28"="Gov_noY10",
"29"="Ind_Y10", "30"="Ind_noY10",
"31"="Catholic_Y10", "32"="Catholic_noY10",
"33"="Gov_Y10", "34"="Gov_noY10",
"35"="Ind_Y10", "36"="Ind_noY10",
"37"="Catholic_Y10", "38"="Catholic_noY10",
"39"="Gov_Y10", "40"="Gov_noY10",
"41"="Ind_Y10", "42"="Ind_noY10",
"43"="Catholic_Y10", "44"="Catholic_noY10",
"45"="Gov_Y10", "46"="Gov_noY10",
"47"="Ind_Y10", "48"="Ind_noY10")) %>%
separate(schtype_yr, c("schtype","yr")) %>%
rename(birthmonth=ST003D02T, birthyr=ST003D03T,
gender=ST004D01T, desk=ST011Q01TA,
room=ST011Q02TA, computer=ST011Q04TA, internet=ST011Q06TA,
solarpanels=ST011D17TA, tvs=ST012Q01TA, cars=ST012Q02TA,
music_instr=ST012Q09NA, books=ST013Q01TA, birthcnt=ST019AQ01T,
mother_birthcnt=ST019BQ01T, father_birthcnt=ST019CQ01T,
test_anxiety=ST118Q01NA, ambitious=ST119Q04NA,
prefer_team=ST082Q01NA, make_friends_easy=ST034Q02TA,
tardy=ST062Q03TA, science_fun=ST094Q01NA, breakfast=ST076Q01NA,
work_pay=ST078Q10NA, sport=ST078Q11NA, internet_use=IC006Q01TA,
install_software=IC015Q02NA,
outhours_study=OUTHOURS, math_time=MMINS, read_time=LMINS,
science_time=SMINS, belong=BELONG,
anxtest=ANXTEST, motivat=MOTIVAT, language=LANGN,
home_edres=HEDRES, home_poss=HOMEPOS, wealth=WEALTH,
stuweight=W_FSTUWT) %>%
mutate(math=(PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH+
PV6MATH+PV7MATH+PV8MATH+PV9MATH+PV10MATH)/10,
science=(PV1SCIE+PV2SCIE+PV3SCIE+PV4SCIE+PV5SCIE+
PV6SCIE+PV7SCIE+PV8SCIE+PV9SCIE+PV10SCIE)/10,
read=(PV1READ+PV2READ+PV3READ+PV4READ+PV5READ+
PV6READ+PV7READ+PV8READ+PV9READ+PV10READ)/10) %>%
select(state, schtype, yr, birthmonth, birthyr, gender, desk, room,
computer, internet, solarpanels, tvs, cars, music_instr, books,
birthcnt, mother_birthcnt, father_birthcnt, test_anxiety,
ambitious, prefer_team, make_friends_easy, tardy, science_fun,
breakfast, work_pay, sport, internet_use, install_software,
outhours_study, math_time, read_time, science_time, belong,
anxtest, motivat, language, home_edres, home_poss, wealth,
stuweight, math, science, read) %>%
mutate(gender=factor(gender, levels=1:2, labels=c("female", "male"))) %>%
mutate(birthmonth=factor(birthmonth, levels=1:12,
labels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
"sep", "oct", "nov", "dec")))
```
a. Explain how the STRATUM variable is processed to create three new variables state, schtype and yr.
b. Compute the average of math scores by state. Which state does best, on average, on math? (You should use the stuweight variable to compute a weighted average. This is survey data, and the weights indicate how representative that individual is of the population.)
c. Compute the difference in average male and female math scores by state. Which state has the smallest average gender difference?
d. Does test anxiety have an effect math score? (Use the variable `anxtest`, and a simple regression model to answer this question.)
e. Explain what the `rename` operation is doing.
f. Come up with two more questions as a group, based on the data description. Do the wrangling to answer these questions. Discuss these with a tutor.
## Relational data, and joins
It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called *relational data* because it is the relations, not just the individual datasets, that are important.
```{r}
library(tidyverse)
library(nycflights13)
```
The `nycflights13` is a small example data set, flights that departed NYC in 2013, taken from a public database of all commercial airline flights in the USA, https://www.transtats.bts.gov. It has five tables,
```{r}
flights
airlines
airports
planes
weather
```
### Keys
Keys are variables used to connect the records in one table, with those in another. In the `nycflights13` data,
- flights connects to planes by a single variable `tailnum`
- flights connects to airlines by a single variable `carrier`
- flights connects to airports by two variables, `origin` and `dest`
- flights connects to weather using multiple variables, `origin`, and `year`, `month`, `day` and `hour`.
#### Your turn
The `Lahman` package contains multiple tables of baseball data. What key(s) connect the batting table with the salary table?
```{r}
library(Lahman)
glimpse(Batting)
glimpse(Salaries)
```
### Joins
The textbook calls them "mutating joins", add variables from one table to another. There is always a decision on what observations are copied to the new table as well.
#### Types of joins
- **Inner join**: intersection between the two tables, only the observations that are in both
- **Outer (full) join**: union of the two tables, all observations from both, and missing values might get added
- **Left join**: All observations from the "left" table, but only the observations from the "right" table that match those in the left.
- **Right join**: Same as left in reverse.
And a really simple explanation of joins with animations can be found [here](https://twitter.com/grrrck/status/1029567123029467136).
#### Example: airlines
If you want to add the full airline name to the flights2 data, you can combine the `airlines` and `flights` data frames with `left_join()`:
```{r}
flights %>% left_join(airlines, by="carrier") %>% glimpse()
flights %>% left_join(airports, by=c("origin"="faa")) %>% glimpse()
```
### Airline travel, ontime data
```{r}
load("data/plane_N4YRAA.rda")
plane_N4YRAA %>% glimpse()
```
#### Airline travel, airport location
```{r}
airport <- read_csv("data/airports.csv")
airport %>% select(AIRPORT, LATITUDE, LONGITUDE, AIRPORT_STATE_NAME) %>%
glimpse()
```
#### Joining the two tables
- Purpose is to show flight movement on the map
- Key is the airport three letter code,
- called ORIGIN or DEST in plane_N4YRAA table
- called AIRPORT in the airport table
- One table, plane_N4YRAA, has less airports than the other
- Only want to keep the rows of airport table, for those that appear in the plane_N4YRAA table
```{r}
airport <- airport %>%
select(AIRPORT, LATITUDE, LONGITUDE, AIRPORT_IS_LATEST, DISPLAY_AIRPORT_NAME) %>%
filter(AIRPORT_IS_LATEST == 1) %>%
select(-AIRPORT_IS_LATEST)
N4YRAA_latlon <- left_join(plane_N4YRAA, airport,
by = c("ORIGIN"="AIRPORT")) %>%
rename("ORIGIN_LATITUDE"="LATITUDE",
"ORIGIN_LONGITUDE"="LONGITUDE")
N4YRAA_latlon %>%
select(ORIGIN, ORIGIN_LATITUDE, ORIGIN_LONGITUDE,
DISPLAY_AIRPORT_NAME)
```
The variables ORIGIN_LATITUDE, ORIGIN_LONGITUDE, DISPLAY_AIRPORT_NAME are added to corresponding row in the plane_N4YRAA table.
#### Add destination locations
- Added the spatial coordinates (lat, lon) for the origin airport
- The same needs to be done for the destination airport
- Then the airports can be drawn over a map
```{r}
N4YRAA_latlon <- left_join(N4YRAA_latlon, airport,
by = c("DEST"="AIRPORT")) %>%
rename("DEST_LATITUDE"="LATITUDE",
"DEST_LONGITUDE"="LONGITUDE")
N4YRAA_latlon <- N4YRAA_latlon %>% arrange(FL_DATE, DEP_TIME)
```
#### Map it
```{r}
library(lubridate)
library(ggthemes)
library(ggmap)
#register_google(XXX)
#map_plane <- get_map(c(lon=-92.20562, lat=36.20259), zoom=5)
load("data/map_plane.rda")
ggmap(map_plane) +
geom_segment(data=filter(N4YRAA_latlon,
FL_DATE == ymd("2017-05-06")),
aes(x=ORIGIN_LONGITUDE, xend=DEST_LONGITUDE,
y=ORIGIN_LATITUDE, yend=DEST_LATITUDE),
color="navyblue") +
geom_point(data=filter(N4YRAA_latlon,
FL_DATE == ymd("2017-05-06")),
aes(x=ORIGIN_LONGITUDE,
y=ORIGIN_LATITUDE), color="orange",
alpha=0.3, size=3) +
geom_point(data=filter(N4YRAA_latlon,
FL_DATE == ymd("2017-05-06")),
aes(x=DEST_LONGITUDE,
y=DEST_LATITUDE), color="red",
alpha=0.3, size=1) +
theme_map()
ggmap(map_plane) +
geom_segment(data=N4YRAA_latlon,
aes(x=ORIGIN_LONGITUDE, xend=DEST_LONGITUDE,
y=ORIGIN_LATITUDE, yend=DEST_LATITUDE),
color="navyblue") +
geom_point(data=filter(N4YRAA_latlon,
FL_DATE == ymd("2017-05-06")),
aes(x=ORIGIN_LONGITUDE,
y=ORIGIN_LATITUDE), color="orange",
alpha=0.3, size=3) +
geom_point(data=filter(N4YRAA_latlon,
FL_DATE == ymd("2017-05-06")),
aes(x=DEST_LONGITUDE,
y=DEST_LATITUDE), color="red",
alpha=0.3, size=1) +
theme_map() + facet_wrap(~FL_DATE)
```
## Exercise
Complete these exercises about the `nycflights13` data using wrangling operations, an appropriate join, and a plot.
1. Make a map that shows the origin to destinations made by Delta flights, from La Guardia airport, for the month of August. You will need to
a. Filter the flights data to contain just Delta flights, for August
b. Add the airport locations (lat, long) of the origin and destination to the flights data
c. Pull a google map, and plot it
d. Draw lines connecting origin to destination airports on the map
2. Does wind direction, when windspeed is stronger, affect the operations at the airport? Generally cross winds affect airport operations. If the wind is reasonably strong and blowing across the runway, there are likely to be more delays. It could be helpful if you can find maps of the three airports in NYC to check how many runways they have, and the orientation of them. You will need to
a. Join the weather data to the flights data
b. Filter by airport and higher wind speeds
c. Plot delay against wind direction, perhaps focusing a restricted range of delay or using only a smoother instead of all the points
```{r echo=FALSE, eval=FALSE}
airports %>% filter(faa == "LGA")
delta <- flights %>%
filter(carrier == "DL", origin == "LGA", month == 8) %>%
left_join(airports, by = c("dest"="faa")) %>%
mutate(orig_lon = -73.9, orig_lat = 40.8)
#us_map <- get_map(c(lon=-92.20562, lat=36.20259), zoom=4)
load("data/us_map.rda")
ggmap(us_map) +
geom_segment(data=delta,
aes(x=orig_lon, xend=lon,
y=orig_lat, yend=lat),
color="navyblue", alpha=0.2)
flgt_weath <- flights %>%
filter(origin == "LGA") %>%
left_join(weather, by=c("origin", "time_hour")) %>%
filter(wind_speed > 25)
ggplot(flgt_weath, aes(x=wind_dir, y=dep_delay)) +
#geom_point(alpha=0.1) +
geom_smooth(se=FALSE)
```