03b-data-wrangling.Rmd

---
knit: bookdown::preview_chapter
---

# Wrangling Data

## Overview

In this chapter, we will learn about actions on data, that are useful for data analysis. These are the verbs in the tidyverse that are commonly used to operate on data (nouns). 

- filter
- select
- mutate
- group_by
- summarise
- arrange
- count, tally

## Side-note: Package system

We have been loading the `tidyverse` package. Its actually a suite of packages, and you can learn more about the individual packages at https://www.tidyverse.org. You could load each individually.

Because so many people contribute packages to R, it is a blessing and a curse. The best techniques are available, but there can be conflicts between function names. When you load tidyverse it prints a great summary of conflicts that it knows about, between its functions and others.

```{r message=TRUE}
library(tidyverse)
```

For example, there is a `filter` function in the `stats` package that comes with the R distribution. This can cause confusion when you want to use the filter function in `dplyr` (part of tidyverse). To be sure the function you use is the one you want to use, you can prefix it with the package name, `dplyr::filter()`.

## Example: french fries

![](images/french_fries.png)

- Load the data
- Gather into long form

```{r}
load("data/french_fries.rda")
ff_long <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep) %>%
  mutate(type=as.factor(type))
```

### The story of the french fries data....

This was an actual experiment in Food Sciences at Iowa State University. The goal was to find out if some cheaper oil options could be used to make hot chips: that people would not be able to distinguish the difference between chips fried in the new oils relative to those fried in the current market leader.

Twelve tasters were recruited to sample two chips from each batch, over a period of ten weeks. The same oil was kept for a period of 10 weeks! May be a bit gross by the end!

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data and the plyr tools. 

## filter: choose observations

Filter records for subject 10.

```{r}
ff_long_subj10 <- ff_long %>%
  filter(subject == 10)
ff_long_subj10
```

`==` does a logical equals check. This works nicely for characters or factors but not for numerical data.

To filter you need to do some comparison to find the subset of observations you are interested in. For example:

- `subject != 10` finds rows corresponding to all subjects except subject 10
- `x > 10` find all rows where variable `x` has values bigger than 10
- `x >= 10` find all rows where variable `x` has values bigger than *or equal to* 10
- `class %in% c("A", "B")` finds all records where variable `class` is either A or B
- `!is.na(y)` finds all the records that *DO NOT* have a missing value for variable `y`

#### Your turn

Filter the french fries data to have:

- only week 1
- weeks 1-4 only
- oil type 1 (oil type is called treatment)
- oil types 1 and 3 but not 2

```{r eval=FALSE, echo=FALSE}
ff_long %>%
  filter(time == 1)
ff_long %>%
  filter(time < 5)
ff_long %>%
  filter(treatment == 1)
ff_long %>%
  filter(time != 2)
```


## select

`select` chooses which variables to keep in the data set. It is useful when there are a lot of variables but you only need some of them for an analysis. 

```{r}
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
  select(country, year, starts_with("new_sp_")) 
tb %>% top_n(20)
```

- Generally a comma separated list of the variables, by name. 
- You can use text-matching of the names like `starts_with`, `ends_with()`, `contains()`, `matches()`, `num_range()`, `one_of()` or `everything()`
- You can use `:` to choose variables in order of the columns, eg `select(df, V4:V6)`
- And you can use it to NOT select some variables by prefixing with `-`

#### Your turn

For the french fries data, use select to do these tasks:

- choose time, treatment and rep
- choose subject through to rating
- drop subject

```{r eval=FALSE, echo=FALSE}
ff_long %>% 
  select(time, treatment, rep)
ff_long %>%
  select(subject:rating)
ff_long %>%
  select(-subject)
```

## mutate

`mutate` is used to create a new variable. 

```{r}
load("data/pisa_au.rda")
pisa_au <- pisa_au %>%
  select(PV1MATH:PV10MATH)
pisa_au %>%
  mutate(math = (PV1MATH+PV2MATH+PV3MATH)/3)
```

#### Your turn

- Compute a standised math score, by subtracting the mean from PV1MATH and dividing by the standard deviation
- For the french fries data compute a new variable called lrating by taking a log of the rating

```{r eval=FALSE, echo=FALSE}
pisa_au %>%
  mutate(PV1MATH = (PV1MATH-mean(PV1MATH))/sd(PV1MATH))
ff_long %>%
  mutate(lrating=log10(rating))
```

## summarise and group_by

`summarise` generates a summary of a variable. It is most useful when working on subsets of the data, and `group_by` can automate this.

```{r}
ff_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
```

#### Your turn

- Compute the average rating by subject
- Compute the average rancid rating per week

```{r eval=FALSE, echo=FALSE}
ff_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
ff_long %>% 
  filter(type == "rancid") %>%
  group_by(time) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
```

## arrange

`arrange` orders a table by a given variable. Its usful for display of results, primarily.

```{r}
ff_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
```

#### Your turn

- Arrange the average rating by type in decreasing order
- Arrange the average subject rating in order lowest to highest.

```{r echo=FALSE, eval=FALSE}
ff_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(desc(rating))
ff_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
```

## count, tally

These are convenience functions to compute numbers in different categories.

```{r}
ff_long %>% count(type, sort=TRUE)
```

## Putting it together to problem solve

### Are ratings similar?

```{r}
ff_long %>% 
  group_by(type) %>%
  summarise(m = mean(rating, na.rm=TRUE), 
            sd = sd(rating, na.rm=TRUE)) %>%
  arrange(desc(m))
```

The scales of the ratings are quite different. Mostly the chips are rated highly on potato'y, but low on grassy. 

Make a picture too!

```{r}
ff_long %>% 
  ggplot(aes(x=type, y=rating)) +
    geom_boxplot()
```

### Are reps like each other

```{r}
ff_long %>% 
  spread(rep, rating) %>%
  summarise(r=cor(`1`, `2`, use="complete.obs"))
```

Make a picture, too!

```{r}
ff_long %>% 
  spread(rep, rating) %>%
  ggplot(aes(x=`1`, y=`2`)) + geom_point()
```

This data is poor quality - the replicates do not look like each other!

### Replicates by rating type

```{r}
ff_long %>% 
  spread(rep, rating) %>%
  group_by(type) %>%
  summarise(r=cor(`1`, `2`, use="complete.obs"))
```

Make a picture, to!

```{r}
ff_long %>% 
  spread(rep, rating) %>%
  ggplot(aes(x=`1`, y=`2`)) + 
    geom_point() +
    facet_wrap(~type, ncol=5)
```

Potato'y and buttery have better replication than the other scales, but there is still a lot of variation from rep 1 to 2.

## Exercise

This question is about the 2015 PISA results. The data is downloaded from [http://www.oecd.org/pisa/data/2015database/](http://www.oecd.org/pisa/data/2015database/). The SPSS format "Student questionnaire data file (419MB)" is downloaded and processed using this code, to extract results for Australia:

```{r eval=FALSE}
library(haven)
pisa_2015 <- read_sav(file.choose())
pisa_au <- pisa_2015 %>% filter(CNT == "AUS")
save(pisa_au, file="pisa_au.rda")
```

You don't need to do this, because the Australia data is extracted and saved for you. Your task is to answer these questions about Australia. At times it may be helpful to examine the data dictionary, which is provided as an excel file (you can also download this directly from the OECD PISA site too).

A large amount of pre-processing is done on the data, as performed by this code:

```{r}
load("data/pisa_au.rda")
pisa_au <- pisa_au %>% mutate(state=as.character(substr(STRATUM, 4, 5)),
                schtype_yr=as.character(substr(STRATUM, 6, 7))) %>%
  mutate(state=recode(state, "01"="ACT", "02"="NSW", "03"="VIC",
       "04"="QLD", "05"="SA", "06"="WA", "07"="TAS", "08"="NT")) %>%
  mutate(schtype_yr=recode(schtype_yr,
            "01"="Catholic_Y10", "02"="Catholic_noY10",
            "03"="Gov_Y10", "04"="Gov_noY10",
            "05"="Ind_Y10", "06"="Ind_noY10",
            "07"="Catholic_Y10", "08"="Catholic_noY10",
            "09"="Gov_Y10", "10"="Gov_noY10",
            "11"="Ind_Y10", "12"="Ind_noY10",
            "13"="Catholic_Y10", "14"="Catholic_noY10",
            "15"="Gov_Y10", "16"="Gov_noY10",
            "17"="Ind_Y10", "18"="Ind_noY10",
            "19"="Catholic_Y10", "20"="Catholic_noY10",
            "21"="Gov_Y10", "22"="Gov_noY10",
            "23"="Ind_Y10", "24"="Ind_noY10",
            "25"="Catholic_Y10", "26"="Catholic_noY10",
            "27"="Gov_Y10", "28"="Gov_noY10",
            "29"="Ind_Y10", "30"="Ind_noY10",
            "31"="Catholic_Y10", "32"="Catholic_noY10",
            "33"="Gov_Y10", "34"="Gov_noY10",
            "35"="Ind_Y10", "36"="Ind_noY10",
            "37"="Catholic_Y10", "38"="Catholic_noY10",
            "39"="Gov_Y10", "40"="Gov_noY10",
            "41"="Ind_Y10", "42"="Ind_noY10",
            "43"="Catholic_Y10", "44"="Catholic_noY10",
            "45"="Gov_Y10", "46"="Gov_noY10",
            "47"="Ind_Y10", "48"="Ind_noY10")) %>%
  separate(schtype_yr, c("schtype","yr")) %>%
  rename(birthmonth=ST003D02T, birthyr=ST003D03T,
         gender=ST004D01T, desk=ST011Q01TA,
         room=ST011Q02TA, computer=ST011Q04TA, internet=ST011Q06TA,
         solarpanels=ST011D17TA, tvs=ST012Q01TA, cars=ST012Q02TA,
         music_instr=ST012Q09NA, books=ST013Q01TA, birthcnt=ST019AQ01T,
         mother_birthcnt=ST019BQ01T, father_birthcnt=ST019CQ01T,
         test_anxiety=ST118Q01NA, ambitious=ST119Q04NA,
         prefer_team=ST082Q01NA, make_friends_easy=ST034Q02TA,
         tardy=ST062Q03TA, science_fun=ST094Q01NA, breakfast=ST076Q01NA,
         work_pay=ST078Q10NA, sport=ST078Q11NA, internet_use=IC006Q01TA,
         install_software=IC015Q02NA,
         outhours_study=OUTHOURS, math_time=MMINS, read_time=LMINS,
         science_time=SMINS, belong=BELONG,
         anxtest=ANXTEST, motivat=MOTIVAT, language=LANGN,
         home_edres=HEDRES, home_poss=HOMEPOS, wealth=WEALTH,
         stuweight=W_FSTUWT) %>%
    mutate(math=(PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH+
                     PV6MATH+PV7MATH+PV8MATH+PV9MATH+PV10MATH)/10,
           science=(PV1SCIE+PV2SCIE+PV3SCIE+PV4SCIE+PV5SCIE+
                        PV6SCIE+PV7SCIE+PV8SCIE+PV9SCIE+PV10SCIE)/10,
           read=(PV1READ+PV2READ+PV3READ+PV4READ+PV5READ+
                     PV6READ+PV7READ+PV8READ+PV9READ+PV10READ)/10) %>%
   select(state, schtype, yr, birthmonth, birthyr, gender, desk, room,
          computer, internet, solarpanels, tvs, cars, music_instr, books,
          birthcnt, mother_birthcnt, father_birthcnt, test_anxiety,
          ambitious, prefer_team, make_friends_easy, tardy, science_fun,
          breakfast, work_pay, sport, internet_use, install_software,
          outhours_study, math_time, read_time, science_time, belong,
          anxtest, motivat, language, home_edres, home_poss, wealth,
          stuweight, math, science, read) %>%
  mutate(gender=factor(gender, levels=1:2, labels=c("female", "male"))) %>% 
  mutate(birthmonth=factor(birthmonth, levels=1:12,
    labels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug",
                            "sep", "oct", "nov", "dec")))
```

a. Explain how the STRATUM variable is processed to create three new variables state, schtype and yr.
b. Compute the average of math scores by state. Which state does best, on average, on math? (You should use the stuweight variable to compute a weighted average. This is survey data, and the weights indicate how representative that individual is of the population.)
c. Compute the difference in average male and female math scores by state. Which state has the smallest average gender difference?
d. Does test anxiety have an effect math score? (Use the variable `anxtest`, and a simple regression model to answer this question.)
e. Explain what the `rename` operation is doing.
f. Come up with two more questions as a group, based on the data description. Do the wrangling to answer these questions. Discuss these with a tutor.

## Relational data, and joins

It’s rare that a data analysis involves only a single table of data. Typically you have many tables of data, and you must combine them to answer the questions that you’re interested in. Collectively, multiple tables of data are called *relational data* because it is the relations, not just the individual datasets, that are important.

```{r}
library(tidyverse)
library(nycflights13)
```

The `nycflights13` is a small example data set, flights that departed NYC in 2013, taken from a public database of all commercial airline flights in the USA,  https://www.transtats.bts.gov. It has five tables, 

```{r}
flights
airlines
airports
planes
weather
```

### Keys

Keys are variables used to connect the records in one table, with those in another. In the `nycflights13` data, 

- flights connects to planes by a single variable `tailnum`
- flights connects to airlines by a single variable `carrier`
- flights connects to airports by two variables, `origin` and `dest`
- flights connects to weather using multiple variables, `origin`, and `year`, `month`, `day` and `hour`.

#### Your turn

The `Lahman` package contains multiple tables of baseball data. What key(s) connect the batting table with the salary table?

```{r}
library(Lahman)
glimpse(Batting)
glimpse(Salaries)
```

### Joins

The textbook calls them "mutating joins", add variables from one table to another. There is always a decision on what observations are copied to the new table as well. 

#### Types of joins

- **Inner join**: intersection between the two tables, only the observations that are in both
- **Outer (full) join**: union of the two tables, all observations from both, and missing values might get added
- **Left join**: All observations from the "left" table, but only the observations from the "right" table that match those in the left.
- **Right join**: Same as left in reverse.

And a really simple explanation of joins with animations can be found [here](https://twitter.com/grrrck/status/1029567123029467136).

#### Example: airlines

If you want to add the full airline name to the flights2 data, you can combine the `airlines` and `flights` data frames with `left_join()`:

```{r}
flights %>% left_join(airlines, by="carrier") %>% glimpse()
flights %>% left_join(airports, by=c("origin"="faa")) %>% glimpse()
```

### Airline travel, ontime data

```{r}
load("data/plane_N4YRAA.rda")
plane_N4YRAA %>% glimpse()
```

#### Airline travel, airport location

```{r}
airport <- read_csv("data/airports.csv")
airport %>% select(AIRPORT, LATITUDE, LONGITUDE, AIRPORT_STATE_NAME) %>%
  glimpse()
```

#### Joining the two tables

- Purpose is to show flight movement on the map
- Key is the airport three letter code, 
    - called ORIGIN or DEST in plane_N4YRAA table
    - called AIRPORT in the airport table
- One table, plane_N4YRAA, has less airports than the other
    - Only want to keep the rows of airport table, for those that appear in the plane_N4YRAA table

```{r}
airport <- airport %>%
  select(AIRPORT, LATITUDE, LONGITUDE, AIRPORT_IS_LATEST, DISPLAY_AIRPORT_NAME) %>%
  filter(AIRPORT_IS_LATEST == 1) %>%
  select(-AIRPORT_IS_LATEST)

N4YRAA_latlon <- left_join(plane_N4YRAA, airport,
                           by = c("ORIGIN"="AIRPORT")) %>%
  rename("ORIGIN_LATITUDE"="LATITUDE",
         "ORIGIN_LONGITUDE"="LONGITUDE")
N4YRAA_latlon %>% 
  select(ORIGIN, ORIGIN_LATITUDE, ORIGIN_LONGITUDE, 
         DISPLAY_AIRPORT_NAME)
```

The variables ORIGIN_LATITUDE, ORIGIN_LONGITUDE, DISPLAY_AIRPORT_NAME are added to corresponding row in the plane_N4YRAA table.

#### Add destination locations

- Added the spatial coordinates (lat, lon) for the origin airport
- The same needs to be done for the destination airport
- Then the airports can be drawn over a map

```{r}
N4YRAA_latlon <- left_join(N4YRAA_latlon, airport,
                           by = c("DEST"="AIRPORT")) %>%
  rename("DEST_LATITUDE"="LATITUDE",
         "DEST_LONGITUDE"="LONGITUDE")

N4YRAA_latlon <- N4YRAA_latlon %>% arrange(FL_DATE, DEP_TIME)
```

#### Map it

```{r}
library(lubridate)
library(ggthemes)
library(ggmap)
#register_google(XXX)
#map_plane <- get_map(c(lon=-92.20562, lat=36.20259), zoom=5)
load("data/map_plane.rda")
ggmap(map_plane) +
  geom_segment(data=filter(N4YRAA_latlon,
                           FL_DATE == ymd("2017-05-06")),
               aes(x=ORIGIN_LONGITUDE, xend=DEST_LONGITUDE,
                   y=ORIGIN_LATITUDE, yend=DEST_LATITUDE),
               color="navyblue") +
  geom_point(data=filter(N4YRAA_latlon,
                         FL_DATE == ymd("2017-05-06")),
             aes(x=ORIGIN_LONGITUDE,
                 y=ORIGIN_LATITUDE), color="orange",
             alpha=0.3, size=3) +
  geom_point(data=filter(N4YRAA_latlon,
                         FL_DATE == ymd("2017-05-06")),
             aes(x=DEST_LONGITUDE,
                 y=DEST_LATITUDE), color="red",
             alpha=0.3, size=1) +
   theme_map()
ggmap(map_plane) +
  geom_segment(data=N4YRAA_latlon,
               aes(x=ORIGIN_LONGITUDE, xend=DEST_LONGITUDE,
                   y=ORIGIN_LATITUDE, yend=DEST_LATITUDE),
               color="navyblue") +
  geom_point(data=filter(N4YRAA_latlon,
                         FL_DATE == ymd("2017-05-06")),
             aes(x=ORIGIN_LONGITUDE,
                 y=ORIGIN_LATITUDE), color="orange",
             alpha=0.3, size=3) +
  geom_point(data=filter(N4YRAA_latlon,
                         FL_DATE == ymd("2017-05-06")),
             aes(x=DEST_LONGITUDE,
                 y=DEST_LATITUDE), color="red",
             alpha=0.3, size=1) +
   theme_map() + facet_wrap(~FL_DATE)
```

## Exercise

Complete these exercises about the `nycflights13` data using wrangling operations, an appropriate join, and a plot.

1. Make a map that shows the origin to destinations made by Delta flights, from La Guardia airport, for the month of August. You will need to 
    a. Filter the flights data to contain just Delta flights, for August
    b. Add the airport locations (lat, long) of the origin and destination to the flights data 
    c. Pull a google map, and plot it
    d. Draw lines connecting origin to destination airports on the map
2. Does wind direction, when windspeed is stronger, affect the operations at the airport? Generally cross winds affect airport operations. If the wind is reasonably strong and blowing across the runway, there are likely to be more delays. It could be helpful if you can find maps of the three airports in NYC to check how many runways they have, and the orientation of them. You will need to
    a. Join the weather data to the flights data
    b. Filter by airport and higher wind speeds
    c. Plot delay against wind direction, perhaps focusing a restricted range of delay or using only a smoother instead of all the points

```{r echo=FALSE, eval=FALSE}
airports %>% filter(faa == "LGA")
delta <- flights %>% 
  filter(carrier == "DL", origin == "LGA", month == 8) %>%
  left_join(airports, by = c("dest"="faa")) %>%
  mutate(orig_lon = -73.9, orig_lat = 40.8)
#us_map <- get_map(c(lon=-92.20562, lat=36.20259), zoom=4)
load("data/us_map.rda")
ggmap(us_map) +
  geom_segment(data=delta,
               aes(x=orig_lon, xend=lon,
                   y=orig_lat, yend=lat),
               color="navyblue", alpha=0.2) 
flgt_weath <- flights %>% 
  filter(origin == "LGA") %>%
  left_join(weather, by=c("origin", "time_hour")) %>%
  filter(wind_speed > 25)
ggplot(flgt_weath, aes(x=wind_dir, y=dep_delay)) + 
  #geom_point(alpha=0.1) + 
  geom_smooth(se=FALSE) 
```