|
| 1 | +--- |
| 2 | +knit: bookdown::preview_chapter |
| 3 | +--- |
| 4 | + |
| 5 | +# Handling Missings |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +- The story of missing flights |
| 10 | +- Data set overviews |
| 11 | +- Missingness patterns |
| 12 | +- Common ways to impute |
| 13 | +- Checking imputed values |
| 14 | + |
| 15 | +## A story of missing flights |
| 16 | + |
| 17 | + |
| 18 | + |
| 19 | +Do you know what a *ghost flight* is? |
| 20 | + |
| 21 | +When we were were first analysing the US airline traffic data, we found that it quite often occurred that a plane (identified by its tail number) landed at one airport but took off from another airport, e.g. |
| 22 | + |
| 23 | +``` |
| 24 | +Year Month DayofMonth DepTime ArrTime Origin Dest Diverted |
| 25 | +1995 3 8 1102 1256 PIT CVG 0 |
| 26 | +1995 3 8 1311 NA CVG PIT 1 |
| 27 | +1995 3 8 1913 2050 RIC PIT 0 |
| 28 | +1995 3 8 2134 2300 PIT MSY 0 |
| 29 | +``` |
| 30 | + |
| 31 | +The plane landed in Pittsburgh but took off in Richmond. |
| 32 | + |
| 33 | +This happens when a plane is needed at a location where there's a shortage. A plane flies empty and it doesn't get listed in the database. When it happens a lot, it indicates that the company is operating inefficiently. |
| 34 | + |
| 35 | +An example was AirTran in March 2003. 46 flights, for a total of 31510 miles. |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | +Airtran repeatedly had financial difficulties in the following years, and finally ceased operations in 2014. |
| 40 | + |
| 41 | +**Missing values are not always obvious in data.** |
| 42 | + |
| 43 | +## Data set overview |
| 44 | + |
| 45 | +```{r} |
| 46 | +library(tidyverse) |
| 47 | +library(visdat) |
| 48 | +library(naniar) |
| 49 | +tb <- read_csv("data/TB_notifications_2018-03-18.csv") %>% |
| 50 | + select(country, iso3, year, new_sp_m04:new_sp_fu) %>% |
| 51 | + gather(stuff, count, new_sp_m04:new_sp_fu) %>% |
| 52 | + separate(stuff, c("stuff1", "stuff2", "genderage")) %>% |
| 53 | + select(-stuff1, -stuff2) %>% |
| 54 | + mutate(gender=substr(genderage, 1, 1), |
| 55 | + age=substr(genderage, 2, length(genderage))) %>% |
| 56 | + select(-genderage) |
| 57 | +
|
| 58 | +tb_au <- tb %>% |
| 59 | + filter(country == "Australia") |
| 60 | +vis_dat(tb_au) |
| 61 | +``` |
| 62 | + |
| 63 | +Overview of the tb incidence in Australia. |
| 64 | + |
| 65 | +This type of display is called a "heatmap", displays the data table, with cells coloured according to some other information. In this case it is type of variable, and missingness status. What do we learn? |
| 66 | + |
| 67 | +- Most of the variables are `character` (text) variables, or `integer` variables. |
| 68 | +- Missing values are located only in the counts, but it is in blocks, so perhaps corresponds to some category levels of the other variables. |
| 69 | + |
| 70 | +## Missing value patterns |
| 71 | + |
| 72 | +West Pacific Tropical Atmosphere Ocean Data, 1993 & 1997, for improved detection, understanding and prediction of El Nino and La Nina, collected from [http://www.pmel.noaa.gov/tao/index.shtml](http://www.pmel.noaa.gov/tao/index.shtml) |
| 73 | + |
| 74 | +```{r} |
| 75 | +glimpse(oceanbuoys) |
| 76 | +``` |
| 77 | + |
| 78 | +The data has 736 observations, and 8 variables. All of the variables are numeric (including year). |
| 79 | + |
| 80 | +### Missingness map |
| 81 | + |
| 82 | +```{r} |
| 83 | +vis_miss(oceanbuoys, sort_miss=TRUE) + theme(aspect.ratio=1) |
| 84 | +``` |
| 85 | + |
| 86 | +What do we learn? |
| 87 | + |
| 88 | +- Two variables, air temperature and humidity, have a large number of missings. |
| 89 | +- Year, latitude and longitude have no missings |
| 90 | +- Sea temperature has a couple of missings |
| 91 | +- Some rows have a lot of missings |
| 92 | + |
| 93 | +Overall statistics, 3% of possible values are missing. That's not much. BUT, both air temperature and humidity have more than 10% missing which is too much to ignore. |
| 94 | + |
| 95 | +### Numerical summaries |
| 96 | +- Proportion of observations missing: |
| 97 | + |
| 98 | +```{r} |
| 99 | +s_miss <- miss_summary(oceanbuoys) |
| 100 | +s_miss$miss_df_prop |
| 101 | +``` |
| 102 | + |
| 103 | +- Proportion of variables missing: |
| 104 | + |
| 105 | +```{r} |
| 106 | +s_miss$miss_var_prop |
| 107 | +``` |
| 108 | + |
| 109 | +- How many observations have $k$ missings? |
| 110 | + |
| 111 | +```{r} |
| 112 | +s_miss$miss_case_table |
| 113 | +``` |
| 114 | + |
| 115 | +- By group |
| 116 | + |
| 117 | +```{r} |
| 118 | +s_miss_group <- oceanbuoys %>% |
| 119 | + group_by(year) %>% miss_summary() |
| 120 | +s_miss_group$miss_case_table |
| 121 | +``` |
| 122 | + |
| 123 | +You can call these separately with: |
| 124 | + |
| 125 | +- `prop_miss` - The proportion of missings |
| 126 | +- `miss_var_prop` - The proportion of variables with a missing |
| 127 | +- `miss_case_table` - The proportion of cases with $k$ missings |
| 128 | + |
| 129 | +### Tend to get ignored by most software |
| 130 | + |
| 131 | +and this is a problem, because results computed on data with missing values might be biased. for example, `ggplot` ignores them, but at least tells you its ignoring them: |
| 132 | + |
| 133 | +```{r warning=TRUE} |
| 134 | +ggplot(oceanbuoys, |
| 135 | + aes(x = sea_temp_c, |
| 136 | + y = humidity)) + |
| 137 | + geom_point() + theme(aspect.ratio=1) |
| 138 | +``` |
| 139 | + |
| 140 | +### Add them to plot |
| 141 | + |
| 142 | + |
| 143 | +```{r} |
| 144 | +ggplot(oceanbuoys, |
| 145 | + aes(x = sea_temp_c, |
| 146 | + y = humidity)) + |
| 147 | + scale_colour_brewer(palette="Dark2") + |
| 148 | + geom_miss_point() + theme(aspect.ratio=1) |
| 149 | +``` |
| 150 | + |
| 151 | +By year |
| 152 | + |
| 153 | +```{r} |
| 154 | +ggplot(oceanbuoys, |
| 155 | + aes(x = sea_temp_c, |
| 156 | + y = humidity)) + |
| 157 | + geom_miss_point() + |
| 158 | + scale_colour_brewer(palette="Dark2") + |
| 159 | + facet_wrap(~year) + |
| 160 | + theme(aspect.ratio=1) |
| 161 | +``` |
| 162 | + |
| 163 | +### Understanding missing dependencies |
| 164 | + |
| 165 | +```{r} |
| 166 | +ggplot(oceanbuoys, |
| 167 | + aes(x = sea_temp_c, |
| 168 | + y = air_temp_c)) + |
| 169 | + geom_miss_point() + |
| 170 | + scale_colour_brewer(palette="Dark2") + |
| 171 | + facet_wrap(~year) + |
| 172 | + theme(aspect.ratio=1) |
| 173 | +``` |
| 174 | + |
| 175 | +What do we learn? |
| 176 | + |
| 177 | +- There is a different missingness pattern in each of the years |
| 178 | +- Year needs to be accounted for in finding good substitute values. |
| 179 | + |
| 180 | +## Shadow matrix |
| 181 | + |
| 182 | +The shadow table is created to mark the location of missings in the original data table. Here, this gets appended to the original data, and then can be used to submet, or colour by different types of missingness. |
| 183 | + |
| 184 | +```{r} |
| 185 | +as_shadow(oceanbuoys) |
| 186 | +tao_shadow <- bind_shadow(oceanbuoys) |
| 187 | +tao_shadow |
| 188 | +glimpse(tao_shadow) |
| 189 | +``` |
| 190 | + |
| 191 | +### Relationship with other variables |
| 192 | + |
| 193 | +```{r} |
| 194 | +ggplot(data = tao_shadow, |
| 195 | + aes(x = wind_ew, |
| 196 | + y=wind_ns, |
| 197 | + colour=air_temp_c_NA)) + |
| 198 | + scale_colour_brewer(palette="Dark2") + |
| 199 | + geom_point(alpha=0.7) + theme(aspect.ratio=1) |
| 200 | +``` |
| 201 | + |
| 202 | +What do we learn? |
| 203 | + |
| 204 | +- The lowest values of east-west winds have no missing values. Maybe it is less likely to have air temperature missing values when there are light east-west winds? |
| 205 | + |
| 206 | +### Two minute challenge |
| 207 | + |
| 208 | +- Generate the shadow matrix and make a plot of the winds, coloured by missingness on humidity. |
| 209 | + |
| 210 | +```{r} |
| 211 | +ggplot(data = tao_shadow, |
| 212 | + aes(x = wind_ew, |
| 213 | + y=wind_ns, |
| 214 | + colour=humidity_NA)) + |
| 215 | + scale_colour_brewer(palette="Dark2") + |
| 216 | + geom_point(alpha=0.7) + theme(aspect.ratio=1) |
| 217 | +``` |
| 218 | + |
| 219 | +## Strategy |
| 220 | + |
| 221 | +- An small fraction of cases have several missings, drop the cases |
| 222 | +- A variable or two, out of many, have a lot of missings, drop the variables |
| 223 | +- If missings are small in number, but located in many cases and variables, you need to impute these values, to do most analyses |
| 224 | +- Designing the imputation should take into account dependencies that you have seen between missingness and existing variables. |
| 225 | +- For the ocean buoys data this means imputation needs to be done separately by year |
| 226 | + |
| 227 | +## Common ways to impute values |
| 228 | + |
| 229 | +- Simple parametric: use the mean or median of the complete cases for each variable |
| 230 | +- Simple non-parametric: find the $k$ nearest neighbours with a complete value and average these |
| 231 | +- Multiple imputation: Use a statistical distribution, e.g. normal model and simulate a value (or set of values, hot deck imputation) for the missings |
| 232 | + |
| 233 | +### Mean |
| 234 | + |
| 235 | +ignoring year. |
| 236 | + |
| 237 | +```{r} |
| 238 | +tao_shadow <- tao_shadow %>% |
| 239 | + impute_mean_at(vars(sea_temp_c, |
| 240 | + air_temp_c)) |
| 241 | +
|
| 242 | +ggplot(tao_shadow, |
| 243 | + aes(x = sea_temp_c, |
| 244 | + y = air_temp_c, |
| 245 | + colour = air_temp_c_NA)) + |
| 246 | + geom_point(alpha = 0.7) + |
| 247 | + facet_wrap(~year) + |
| 248 | + scale_colour_brewer(palette = "Dark2") + |
| 249 | + theme(aspect.ratio = 1) |
| 250 | +``` |
| 251 | + |
| 252 | +What do we learn? |
| 253 | + |
| 254 | +- Oh, this is so wrong! |
| 255 | +- The imputed values are nothing like the complete case values |
| 256 | + |
| 257 | +### by year |
| 258 | + |
| 259 | +```{r} |
| 260 | +tao_shadow <- bind_shadow(oceanbuoys) |
| 261 | +tao_shadow <- tao_shadow %>% |
| 262 | + group_by(year) %>% |
| 263 | + impute_mean_at(vars(sea_temp_c, |
| 264 | + air_temp_c)) |
| 265 | +
|
| 266 | +ggplot(tao_shadow, |
| 267 | + aes(x = sea_temp_c, |
| 268 | + y = air_temp_c, |
| 269 | + colour=air_temp_c_NA)) + |
| 270 | + geom_point(alpha=0.7) + |
| 271 | + facet_wrap(~year) + |
| 272 | + scale_colour_brewer(palette="Dark2") + |
| 273 | + theme(aspect.ratio=1) |
| 274 | +``` |
| 275 | + |
| 276 | + |
| 277 | +What do we learn? |
| 278 | + |
| 279 | +- The imputed values are closer to the complete case values |
| 280 | +- However, they form a rigid line, mismatching the variation |
| 281 | +- and they extend outside the range of complete values. This is a problem in that the imputed air temperature value for these high sea temperature cases is lower than we would expect, and thus possibly impeding good model fitting |
| 282 | + |
| 283 | +### Two minute challenge |
| 284 | + |
| 285 | +- Change the code to plot sea temperature against humidity, with colour representing missing humidity values. What do you learn about the imputations? |
| 286 | + |
| 287 | +```{r} |
| 288 | +tao_shadow <- bind_shadow(oceanbuoys) |
| 289 | +tao_shadow <- tao_shadow %>% |
| 290 | + group_by(year) %>% |
| 291 | + impute_mean_at(vars(sea_temp_c, |
| 292 | + air_temp_c, |
| 293 | + humidity)) |
| 294 | +ggplot(tao_shadow, |
| 295 | + aes(x = sea_temp_c, |
| 296 | + y = humidity, |
| 297 | + colour=humidity_NA)) + |
| 298 | + geom_point(alpha=0.7) + |
| 299 | + facet_wrap(~year) + |
| 300 | + scale_colour_brewer(palette="Dark2") + |
| 301 | + theme(aspect.ratio=1) |
| 302 | +``` |
| 303 | + |
| 304 | +### Nearest neighbors imputation |
| 305 | + |
| 306 | +The package `impute` from Bioconductor has a nice nearest neighbor imputation implementation. |
| 307 | + |
| 308 | +```{r eval=FALSE} |
| 309 | +#source("https://bioconductor.org/biocLite.R") |
| 310 | +#biocLite("impute") |
| 311 | +library(impute) |
| 312 | +tao_impute <- bind_shadow(oceanbuoys) %>% |
| 313 | + arrange(year, sea_temp_c, air_temp_c) %>% |
| 314 | + select(sea_temp_c, air_temp_c) |
| 315 | +tao_impute <- impute.knn(as.matrix(tao_impute), 5) |
| 316 | +tao_shadow <- bind_shadow(oceanbuoys) %>% |
| 317 | + arrange(year, sea_temp_c, air_temp_c) %>% |
| 318 | + mutate(sea_temp_c = tao_impute$data[,1], |
| 319 | + air_temp_c = tao_impute$data[,2]) |
| 320 | +ggplot(tao_shadow, |
| 321 | + aes(x = sea_temp_c, |
| 322 | + y = air_temp_c, |
| 323 | + colour=air_temp_c_NA)) + |
| 324 | + geom_point(alpha=0.7) + |
| 325 | + facet_wrap(~year) + |
| 326 | + scale_colour_brewer(palette="Dark2") + |
| 327 | + theme(aspect.ratio=1) |
| 328 | +``` |
| 329 | + |
| 330 | +### by year |
| 331 | + |
| 332 | +```{r eval=FALSE} |
| 333 | +library(impute) |
| 334 | +tao_impute_93 <- bind_shadow(oceanbuoys) %>% |
| 335 | + arrange(year, sea_temp_c, air_temp_c) %>% |
| 336 | + filter(year=="1993") %>% |
| 337 | + select(sea_temp_c, air_temp_c) |
| 338 | +tao_impute_93 <- impute.knn(as.matrix(tao_impute_93), 5) |
| 339 | +tao_impute_97 <- bind_shadow(oceanbuoys) %>% |
| 340 | + arrange(year, sea_temp_c, air_temp_c) %>% |
| 341 | + filter(year=="1997") %>% |
| 342 | + select(sea_temp_c, air_temp_c) |
| 343 | +tao_impute_97 <- impute.knn(as.matrix(tao_impute_97), 5) |
| 344 | +tao_impute <- rbind(tao_impute_93$data, tao_impute_97$data) |
| 345 | +tao_shadow <- bind_shadow(oceanbuoys) %>% |
| 346 | + arrange(year, sea_temp_c, air_temp_c) %>% |
| 347 | + mutate(sea_temp_c = tao_impute[,1], |
| 348 | + air_temp_c = tao_impute[,2]) |
| 349 | +ggplot(tao_shadow, |
| 350 | + aes(x = sea_temp_c, |
| 351 | + y = air_temp_c, |
| 352 | + colour=air_temp_c_NA)) + |
| 353 | + geom_point(alpha=0.5) + |
| 354 | + facet_wrap(~year) + |
| 355 | + scale_colour_brewer(palette="Dark2") + |
| 356 | + theme(aspect.ratio=1) |
| 357 | +``` |
| 358 | + |
| 359 | +### Two minute challenge |
| 360 | + |
| 361 | +- Change the code to plot sea temperature against humidity, with colour representing missing humidity values. What do you learn about the imputations? |
| 362 | + |
| 363 | +## Exercise |
| 364 | + |
| 365 | +We are going to examine Melbourne house prices, using a data set available at https://www.kaggle.com/anthonypino/melbourne-housing-market). It was constructed by scraping the auction reports over several years, and adding more information on property details by scraping the web pages on the properties at domain.com. |
| 366 | + |
| 367 | +1. Make an overview plot of the full data. Which variables have missings? |
| 368 | +2. Focus ONLY on the variables Price, Rooms, Type, Distance, Bedroom2, Bathroom. Remove the observations that have missing values on Price, because this is the response variable that we want to ultimately predict. We can't build a stable model for price if we don't know the price. |
| 369 | +3. Make a missing values summary of all the data - what proportion of observations are missing on Price? |
| 370 | +4. Make the shadow matrix, and plot Bathroom vs Bedroom2 coloured by missingness on Bedroom2. Why don't any missing values show up? |
| 371 | +5. Impute the missing values for Bedroom2 and Bathroom, by using mean imputation. Make a plot of the two variables, with the imputed values coloured. Describe the pattern that you see. |
| 372 | +6. Impute the missing values for Bedroom2 and Bathroom, using a linear model on th variable Rooms. Make a plot of the two variables, with the imputed values coloured. Is this better or worse than the mean value imputed values? Explain your thinking. (You can use the code below to help do this type of imputation. ) |
| 373 | + |
| 374 | +```{r echo=TRUE, eval=FALSE} |
| 375 | +houses_sub <- houses %>% select(Price, Rooms, Type, Distance, Bedroom2, Bathroom) |
| 376 | +houses_sub <- houses_sub %>% bind_shadow() |
| 377 | +br2 <- lm(Bedroom2~Rooms, data=houses_sub) |
| 378 | +ba <- lm(Bathroom~Rooms, data=houses_sub) |
| 379 | +houses_sub <- houses_sub %>% |
| 380 | + mutate(Bedroom2=ifelse(is.na(Bedroom2), predict(br2, new=houses_sub), Bedroom2), |
| 381 | + Bathroom=ifelse(is.na(Bathroom), predict(br2, new=houses_sub), Bathroom)) |
| 382 | +ggplot(houses_sub, aes(x=Bedroom2, y=Bathroom, |
| 383 | + colour=Bedroom2_NA)) + |
| 384 | + geom_point(alpha=0.2) |
| 385 | +``` |
| 386 | + |
0 commit comments