Skip to content

Commit be255b8

Browse files
committed
added more explanation at the start of the wrangling slides
1 parent d94c6eb commit be255b8

File tree

2 files changed

+318
-9
lines changed

2 files changed

+318
-9
lines changed

2.2-wrangling/index.Rmd

+131-3
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,137 @@ library(gridExtra)
4343
- dplyr package: motivation, functions, chaining
4444
- purrr and broom: working with lists, vectors of data frames
4545

46-
#Working with lots of models
46+
## dplyr verbs
4747

48-
## Why would we even do that???
48+
There are five primary dplyr **verbs**, representing distinct data analysis tasks:
49+
50+
- Filter: Remove the rows of a data frame, producing subsets
51+
- Arrange: Reorder the rows of a data frame
52+
- Select: Select particular columns of a data frame
53+
- Mutate: Add new columns that are functions of existing columns
54+
- Summarise: Create collapsed summaries of a data frame
55+
56+
57+
## Filter
58+
59+
```{r}
60+
data(french_fries, package = "reshape2")
61+
french_fries %>%
62+
filter(subject == 3, time == 1)
63+
```
64+
65+
## Arrange
66+
67+
```{r}
68+
french_fries %>%
69+
arrange(desc(rancid)) %>%
70+
head
71+
```
72+
73+
## Select
74+
75+
```{r}
76+
french_fries %>%
77+
select(time, treatment, subject, rep, potato) %>%
78+
head
79+
```
80+
81+
## Summarise
82+
83+
```{r}
84+
french_fries %>%
85+
group_by(time, treatment) %>%
86+
summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
87+
```
88+
89+
## Let's use these tools
90+
91+
to answer these french fry experiment questions:
92+
93+
- Is the design complete?
94+
- Are replicates like each other?
95+
- How do the ratings on the different scales differ?
96+
- Are raters giving different scores on average?
97+
- Do ratings change over the weeks?
98+
99+
## Completeness
100+
If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person. (Assuming that each person rated on all scales.)
101+
102+
To check this we want to tabulate the number of records for each subject, time and treatment. This means select appropriate columns, tabulate, count and spread it out to give a nice table.
103+
104+
##
105+
106+
```{r}
107+
french_fries %>%
108+
select(subject, time, treatment) %>%
109+
tbl_df() %>%
110+
count(subject, time) %>%
111+
spread(time, n)
112+
```
113+
114+
## Check completeness with different scales, too
115+
116+
```{r}
117+
french_fries %>%
118+
gather(type, rating, -subject, -time, -treatment, -rep) %>%
119+
select(subject, time, treatment, type) %>%
120+
tbl_df() %>%
121+
count(subject, time) %>%
122+
spread(time, n)
123+
```
124+
125+
## Change in ratings over weeks, relative to experimental design
126+
127+
```{r fig.show='hide'}
128+
ff.m <- french_fries %>%
129+
gather(type, rating, -subject, -time, -treatment, -rep)
130+
ggplot(data=ff.m, aes(x=time, y=rating, colour=treatment)) +
131+
geom_point() +
132+
facet_grid(subject~type)
133+
```
134+
135+
##
136+
137+
```{r echo=FALSE, fig.width=10, fig.height=6}
138+
ggplot(data=ff.m, aes(x=time, y=rating, colour=treatment)) +
139+
geom_point() +
140+
facet_grid(subject~type)
141+
```
142+
143+
## Add means over reps, and connect the dots
144+
145+
```{r fig.show='hide'}
146+
ff.m.av <- ff.m %>%
147+
group_by(subject, time, type, treatment) %>%
148+
summarise(rating=mean(rating))
149+
ggplot(data=ff.m, aes(x=time, y=rating, colour=treatment)) +
150+
facet_grid(subject~type) +
151+
geom_line(data=ff.m.av, aes(group=treatment))
152+
```
153+
154+
##
155+
156+
```{r echo=FALSE, fig.width=10, fig.height=6}
157+
ggplot(data=ff.m, aes(x=time, y=rating, colour=treatment)) +
158+
facet_grid(subject~type) +
159+
geom_line(data=ff.m.av, aes(group=treatment))
160+
```
161+
162+
## Your turn
163+
164+
![](lorikeets.png)
165+
166+
Write an answer to each of the questions:
167+
168+
- Is the design complete?
169+
- Are replicates like each other?
170+
- How do the ratings on the different scales differ?
171+
- Are raters giving different scores on average?
172+
- Do ratings change over the weeks?
173+
174+
## Working with lots of models
175+
176+
Why would we even do that???
49177

50178
- Hans Rosling can explain that really well in his [TED talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen?language=en)
51179

@@ -223,7 +351,7 @@ qplot(year+1950, lifeExp, data=subset(country_all, between(r.squared, 0.45, 0.7
223351

224352
## Your turn
225353

226-
![](rainbow-lorikeet.png)
354+
![](lorikeets.png)
227355

228356
- extract residuals for each of the models and store it in a dataset together with country and continent information
229357

2.2-wrangling/index.html

+187-6
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)