Skip to content

Commit 93c95d9

Browse files
committed
chapter on missing values
1 parent d60fb1e commit 93c95d9

File tree

3 files changed

+386
-0
lines changed

3 files changed

+386
-0
lines changed

03c-data-missings.Rmd

+386
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
---
2+
knit: bookdown::preview_chapter
3+
---
4+
5+
# Handling Missings
6+
7+
## Overview
8+
9+
- The story of missing flights
10+
- Data set overviews
11+
- Missingness patterns
12+
- Common ways to impute
13+
- Checking imputed values
14+
15+
## A story of missing flights
16+
17+
![](images/plane.jpg)
18+
19+
Do you know what a *ghost flight* is?
20+
21+
When we were were first analysing the US airline traffic data, we found that it quite often occurred that a plane (identified by its tail number) landed at one airport but took off from another airport, e.g.
22+
23+
```
24+
Year Month DayofMonth DepTime ArrTime Origin Dest Diverted
25+
1995 3 8 1102 1256 PIT CVG 0
26+
1995 3 8 1311 NA CVG PIT 1
27+
1995 3 8 1913 2050 RIC PIT 0
28+
1995 3 8 2134 2300 PIT MSY 0
29+
```
30+
31+
The plane landed in Pittsburgh but took off in Richmond.
32+
33+
This happens when a plane is needed at a location where there's a shortage. A plane flies empty and it doesn't get listed in the database. When it happens a lot, it indicates that the company is operating inefficiently.
34+
35+
An example was AirTran in March 2003. 46 flights, for a total of 31510 miles.
36+
37+
![](images/airtran.png)
38+
39+
Airtran repeatedly had financial difficulties in the following years, and finally ceased operations in 2014.
40+
41+
**Missing values are not always obvious in data.**
42+
43+
## Data set overview
44+
45+
```{r}
46+
library(tidyverse)
47+
library(visdat)
48+
library(naniar)
49+
tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>%
50+
select(country, iso3, year, new_sp_m04:new_sp_fu) %>%
51+
gather(stuff, count, new_sp_m04:new_sp_fu) %>%
52+
separate(stuff, c("stuff1", "stuff2", "genderage")) %>%
53+
select(-stuff1, -stuff2) %>%
54+
mutate(gender=substr(genderage, 1, 1),
55+
age=substr(genderage, 2, length(genderage))) %>%
56+
select(-genderage)
57+
58+
tb_au <- tb %>%
59+
filter(country == "Australia")
60+
vis_dat(tb_au)
61+
```
62+
63+
Overview of the tb incidence in Australia.
64+
65+
This type of display is called a "heatmap", displays the data table, with cells coloured according to some other information. In this case it is type of variable, and missingness status. What do we learn?
66+
67+
- Most of the variables are `character` (text) variables, or `integer` variables.
68+
- Missing values are located only in the counts, but it is in blocks, so perhaps corresponds to some category levels of the other variables.
69+
70+
## Missing value patterns
71+
72+
West Pacific Tropical Atmosphere Ocean Data, 1993 & 1997, for improved detection, understanding and prediction of El Nino and La Nina, collected from [http://www.pmel.noaa.gov/tao/index.shtml](http://www.pmel.noaa.gov/tao/index.shtml)
73+
74+
```{r}
75+
glimpse(oceanbuoys)
76+
```
77+
78+
The data has 736 observations, and 8 variables. All of the variables are numeric (including year).
79+
80+
### Missingness map
81+
82+
```{r}
83+
vis_miss(oceanbuoys, sort_miss=TRUE) + theme(aspect.ratio=1)
84+
```
85+
86+
What do we learn?
87+
88+
- Two variables, air temperature and humidity, have a large number of missings.
89+
- Year, latitude and longitude have no missings
90+
- Sea temperature has a couple of missings
91+
- Some rows have a lot of missings
92+
93+
Overall statistics, 3% of possible values are missing. That's not much. BUT, both air temperature and humidity have more than 10% missing which is too much to ignore.
94+
95+
### Numerical summaries
96+
- Proportion of observations missing:
97+
98+
```{r}
99+
s_miss <- miss_summary(oceanbuoys)
100+
s_miss$miss_df_prop
101+
```
102+
103+
- Proportion of variables missing:
104+
105+
```{r}
106+
s_miss$miss_var_prop
107+
```
108+
109+
- How many observations have $k$ missings?
110+
111+
```{r}
112+
s_miss$miss_case_table
113+
```
114+
115+
- By group
116+
117+
```{r}
118+
s_miss_group <- oceanbuoys %>%
119+
group_by(year) %>% miss_summary()
120+
s_miss_group$miss_case_table
121+
```
122+
123+
You can call these separately with:
124+
125+
- `prop_miss` - The proportion of missings
126+
- `miss_var_prop` - The proportion of variables with a missing
127+
- `miss_case_table` - The proportion of cases with $k$ missings
128+
129+
### Tend to get ignored by most software
130+
131+
and this is a problem, because results computed on data with missing values might be biased. for example, `ggplot` ignores them, but at least tells you its ignoring them:
132+
133+
```{r warning=TRUE}
134+
ggplot(oceanbuoys,
135+
aes(x = sea_temp_c,
136+
y = humidity)) +
137+
geom_point() + theme(aspect.ratio=1)
138+
```
139+
140+
### Add them to plot
141+
142+
143+
```{r}
144+
ggplot(oceanbuoys,
145+
aes(x = sea_temp_c,
146+
y = humidity)) +
147+
scale_colour_brewer(palette="Dark2") +
148+
geom_miss_point() + theme(aspect.ratio=1)
149+
```
150+
151+
By year
152+
153+
```{r}
154+
ggplot(oceanbuoys,
155+
aes(x = sea_temp_c,
156+
y = humidity)) +
157+
geom_miss_point() +
158+
scale_colour_brewer(palette="Dark2") +
159+
facet_wrap(~year) +
160+
theme(aspect.ratio=1)
161+
```
162+
163+
### Understanding missing dependencies
164+
165+
```{r}
166+
ggplot(oceanbuoys,
167+
aes(x = sea_temp_c,
168+
y = air_temp_c)) +
169+
geom_miss_point() +
170+
scale_colour_brewer(palette="Dark2") +
171+
facet_wrap(~year) +
172+
theme(aspect.ratio=1)
173+
```
174+
175+
What do we learn?
176+
177+
- There is a different missingness pattern in each of the years
178+
- Year needs to be accounted for in finding good substitute values.
179+
180+
## Shadow matrix
181+
182+
The shadow table is created to mark the location of missings in the original data table. Here, this gets appended to the original data, and then can be used to submet, or colour by different types of missingness.
183+
184+
```{r}
185+
as_shadow(oceanbuoys)
186+
tao_shadow <- bind_shadow(oceanbuoys)
187+
tao_shadow
188+
glimpse(tao_shadow)
189+
```
190+
191+
### Relationship with other variables
192+
193+
```{r}
194+
ggplot(data = tao_shadow,
195+
aes(x = wind_ew,
196+
y=wind_ns,
197+
colour=air_temp_c_NA)) +
198+
scale_colour_brewer(palette="Dark2") +
199+
geom_point(alpha=0.7) + theme(aspect.ratio=1)
200+
```
201+
202+
What do we learn?
203+
204+
- The lowest values of east-west winds have no missing values. Maybe it is less likely to have air temperature missing values when there are light east-west winds?
205+
206+
### Two minute challenge
207+
208+
- Generate the shadow matrix and make a plot of the winds, coloured by missingness on humidity.
209+
210+
```{r}
211+
ggplot(data = tao_shadow,
212+
aes(x = wind_ew,
213+
y=wind_ns,
214+
colour=humidity_NA)) +
215+
scale_colour_brewer(palette="Dark2") +
216+
geom_point(alpha=0.7) + theme(aspect.ratio=1)
217+
```
218+
219+
## Strategy
220+
221+
- An small fraction of cases have several missings, drop the cases
222+
- A variable or two, out of many, have a lot of missings, drop the variables
223+
- If missings are small in number, but located in many cases and variables, you need to impute these values, to do most analyses
224+
- Designing the imputation should take into account dependencies that you have seen between missingness and existing variables.
225+
- For the ocean buoys data this means imputation needs to be done separately by year
226+
227+
## Common ways to impute values
228+
229+
- Simple parametric: use the mean or median of the complete cases for each variable
230+
- Simple non-parametric: find the $k$ nearest neighbours with a complete value and average these
231+
- Multiple imputation: Use a statistical distribution, e.g. normal model and simulate a value (or set of values, hot deck imputation) for the missings
232+
233+
### Mean
234+
235+
ignoring year.
236+
237+
```{r}
238+
tao_shadow <- tao_shadow %>%
239+
impute_mean_at(vars(sea_temp_c,
240+
air_temp_c))
241+
242+
ggplot(tao_shadow,
243+
aes(x = sea_temp_c,
244+
y = air_temp_c,
245+
colour = air_temp_c_NA)) +
246+
geom_point(alpha = 0.7) +
247+
facet_wrap(~year) +
248+
scale_colour_brewer(palette = "Dark2") +
249+
theme(aspect.ratio = 1)
250+
```
251+
252+
What do we learn?
253+
254+
- Oh, this is so wrong!
255+
- The imputed values are nothing like the complete case values
256+
257+
### by year
258+
259+
```{r}
260+
tao_shadow <- bind_shadow(oceanbuoys)
261+
tao_shadow <- tao_shadow %>%
262+
group_by(year) %>%
263+
impute_mean_at(vars(sea_temp_c,
264+
air_temp_c))
265+
266+
ggplot(tao_shadow,
267+
aes(x = sea_temp_c,
268+
y = air_temp_c,
269+
colour=air_temp_c_NA)) +
270+
geom_point(alpha=0.7) +
271+
facet_wrap(~year) +
272+
scale_colour_brewer(palette="Dark2") +
273+
theme(aspect.ratio=1)
274+
```
275+
276+
277+
What do we learn?
278+
279+
- The imputed values are closer to the complete case values
280+
- However, they form a rigid line, mismatching the variation
281+
- and they extend outside the range of complete values. This is a problem in that the imputed air temperature value for these high sea temperature cases is lower than we would expect, and thus possibly impeding good model fitting
282+
283+
### Two minute challenge
284+
285+
- Change the code to plot sea temperature against humidity, with colour representing missing humidity values. What do you learn about the imputations?
286+
287+
```{r}
288+
tao_shadow <- bind_shadow(oceanbuoys)
289+
tao_shadow <- tao_shadow %>%
290+
group_by(year) %>%
291+
impute_mean_at(vars(sea_temp_c,
292+
air_temp_c,
293+
humidity))
294+
ggplot(tao_shadow,
295+
aes(x = sea_temp_c,
296+
y = humidity,
297+
colour=humidity_NA)) +
298+
geom_point(alpha=0.7) +
299+
facet_wrap(~year) +
300+
scale_colour_brewer(palette="Dark2") +
301+
theme(aspect.ratio=1)
302+
```
303+
304+
### Nearest neighbors imputation
305+
306+
The package `impute` from Bioconductor has a nice nearest neighbor imputation implementation.
307+
308+
```{r eval=FALSE}
309+
#source("https://bioconductor.org/biocLite.R")
310+
#biocLite("impute")
311+
library(impute)
312+
tao_impute <- bind_shadow(oceanbuoys) %>%
313+
arrange(year, sea_temp_c, air_temp_c) %>%
314+
select(sea_temp_c, air_temp_c)
315+
tao_impute <- impute.knn(as.matrix(tao_impute), 5)
316+
tao_shadow <- bind_shadow(oceanbuoys) %>%
317+
arrange(year, sea_temp_c, air_temp_c) %>%
318+
mutate(sea_temp_c = tao_impute$data[,1],
319+
air_temp_c = tao_impute$data[,2])
320+
ggplot(tao_shadow,
321+
aes(x = sea_temp_c,
322+
y = air_temp_c,
323+
colour=air_temp_c_NA)) +
324+
geom_point(alpha=0.7) +
325+
facet_wrap(~year) +
326+
scale_colour_brewer(palette="Dark2") +
327+
theme(aspect.ratio=1)
328+
```
329+
330+
### by year
331+
332+
```{r eval=FALSE}
333+
library(impute)
334+
tao_impute_93 <- bind_shadow(oceanbuoys) %>%
335+
arrange(year, sea_temp_c, air_temp_c) %>%
336+
filter(year=="1993") %>%
337+
select(sea_temp_c, air_temp_c)
338+
tao_impute_93 <- impute.knn(as.matrix(tao_impute_93), 5)
339+
tao_impute_97 <- bind_shadow(oceanbuoys) %>%
340+
arrange(year, sea_temp_c, air_temp_c) %>%
341+
filter(year=="1997") %>%
342+
select(sea_temp_c, air_temp_c)
343+
tao_impute_97 <- impute.knn(as.matrix(tao_impute_97), 5)
344+
tao_impute <- rbind(tao_impute_93$data, tao_impute_97$data)
345+
tao_shadow <- bind_shadow(oceanbuoys) %>%
346+
arrange(year, sea_temp_c, air_temp_c) %>%
347+
mutate(sea_temp_c = tao_impute[,1],
348+
air_temp_c = tao_impute[,2])
349+
ggplot(tao_shadow,
350+
aes(x = sea_temp_c,
351+
y = air_temp_c,
352+
colour=air_temp_c_NA)) +
353+
geom_point(alpha=0.5) +
354+
facet_wrap(~year) +
355+
scale_colour_brewer(palette="Dark2") +
356+
theme(aspect.ratio=1)
357+
```
358+
359+
### Two minute challenge
360+
361+
- Change the code to plot sea temperature against humidity, with colour representing missing humidity values. What do you learn about the imputations?
362+
363+
## Exercise
364+
365+
We are going to examine Melbourne house prices, using a data set available at https://www.kaggle.com/anthonypino/melbourne-housing-market). It was constructed by scraping the auction reports over several years, and adding more information on property details by scraping the web pages on the properties at domain.com.
366+
367+
1. Make an overview plot of the full data. Which variables have missings?
368+
2. Focus ONLY on the variables Price, Rooms, Type, Distance, Bedroom2, Bathroom. Remove the observations that have missing values on Price, because this is the response variable that we want to ultimately predict. We can't build a stable model for price if we don't know the price.
369+
3. Make a missing values summary of all the data - what proportion of observations are missing on Price?
370+
4. Make the shadow matrix, and plot Bathroom vs Bedroom2 coloured by missingness on Bedroom2. Why don't any missing values show up?
371+
5. Impute the missing values for Bedroom2 and Bathroom, by using mean imputation. Make a plot of the two variables, with the imputed values coloured. Describe the pattern that you see.
372+
6. Impute the missing values for Bedroom2 and Bathroom, using a linear model on th variable Rooms. Make a plot of the two variables, with the imputed values coloured. Is this better or worse than the mean value imputed values? Explain your thinking. (You can use the code below to help do this type of imputation. )
373+
374+
```{r echo=TRUE, eval=FALSE}
375+
houses_sub <- houses %>% select(Price, Rooms, Type, Distance, Bedroom2, Bathroom)
376+
houses_sub <- houses_sub %>% bind_shadow()
377+
br2 <- lm(Bedroom2~Rooms, data=houses_sub)
378+
ba <- lm(Bathroom~Rooms, data=houses_sub)
379+
houses_sub <- houses_sub %>%
380+
mutate(Bedroom2=ifelse(is.na(Bedroom2), predict(br2, new=houses_sub), Bedroom2),
381+
Bathroom=ifelse(is.na(Bathroom), predict(br2, new=houses_sub), Bathroom))
382+
ggplot(houses_sub, aes(x=Bedroom2, y=Bathroom,
383+
colour=Bedroom2_NA)) +
384+
geom_point(alpha=0.2)
385+
```
386+

images/airtran.png

504 KB
Loading

images/plane.jpg

17.2 KB
Loading

0 commit comments

Comments
 (0)