Skip to content

Latest commit

 

History

History
114 lines (88 loc) · 3.91 KB

spread_gather.md

File metadata and controls

114 lines (88 loc) · 3.91 KB

Spread and Gather with tidyr

So far we have seen how to do some manipulation of the data, but we didn't really do too much with the structure of that data frame. In some cases we might need to have data that are stored in rows, as columns or vice-versa. Two of the function I use the most in tidyr, spread() and gather(), will accomplish this for us. They are somewhat similar to pivot tables in spreadsheets and allow us combine columns together or spread them back out. I'll admit it still sometimes feels a bit like magic. So, abracadabra!

Load up the library:

library(tidyr)

gather

Let's build an untidy data frame. This is made up, but let's say we want to grab some monthly stats on some variable (e.g., average number of Boston Red Sox Hats, in thousands) per state. This time we will use the tibble function. This is similar to data.frame but it makes good assumptions about the data and has some nice display options built in. It comes from the dplyr package.

dirty_df <- tibble(state = c("CT","RI","MA"), jan = c(3.1,8.9,9.3), feb = c(1.0,4.2,8.6), march = c(2.9,3.1,12.5), april = c(4.4,5.6,14.2))
dirty_df
## # A tibble: 3 × 5
##   state   jan   feb march april
##   <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CT      3.1   1     2.9   4.4
## 2 RI      8.9   4.2   3.1   5.6
## 3 MA      9.3   8.6  12.5  14.2

What would be a tidy way to represent this same data set?

They way I would do this is to gather the month columns into two new columns that represent month and number of visitors.

tidy_df <- gather(dirty_df, month, vistors, jan:april) 
tidy_df
## # A tibble: 12 × 3
##    state month vistors
##    <chr> <chr>   <dbl>
##  1 CT    jan       3.1
##  2 RI    jan       8.9
##  3 MA    jan       9.3
##  4 CT    feb       1  
##  5 RI    feb       4.2
##  6 MA    feb       8.6
##  7 CT    march     2.9
##  8 RI    march     3.1
##  9 MA    march    12.5
## 10 CT    april     4.4
## 11 RI    april     5.6
## 12 MA    april    14.2

spread

Here's another possibility from a water quality example. We have data collected at multiple sampling locations and we are measuring multiple water quality parameters.

long_df <- tibble(station = rep(c("A","A","B","B"),3), 
                      month = c(rep("june",4),rep("july",4),rep("aug", 4)), 
                      parameter = rep(c("chla","temp"), 6),
                      value = c(18,23,3,22,19.5,24,3.5,22.25,32,26.7,4.2,23))
long_df
## # A tibble: 12 × 4
##    station month parameter value
##    <chr>   <chr> <chr>     <dbl>
##  1 A       june  chla       18  
##  2 A       june  temp       23  
##  3 B       june  chla        3  
##  4 B       june  temp       22  
##  5 A       july  chla       19.5
##  6 A       july  temp       24  
##  7 B       july  chla        3.5
##  8 B       july  temp       22.2
##  9 A       aug   chla       32  
## 10 A       aug   temp       26.7
## 11 B       aug   chla        4.2
## 12 B       aug   temp       23

We might want to have this in a wide as opposed to long format. That can be accomplished with spread()

wide_df <- spread(long_df,parameter,value)
wide_df
## # A tibble: 6 × 4
##   station month  chla  temp
##   <chr>   <chr> <dbl> <dbl>
## 1 A       aug    32    26.7
## 2 A       july   19.5  24  
## 3 A       june   18    23  
## 4 B       aug     4.2  23  
## 5 B       july    3.5  22.2
## 6 B       june    3    22

While these two simple examples showcase the general ideas, deciding on a given tidy structure for your data will depend on many things and the result will differ based on your task (i.e. data entry, visualization, modeling, etc.). A couple of nice reads about this are: