linearRegressionAnalysis/fav3.rmd at master · jmatias/linearRegressionAnalysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
---
title: "Final Project"
author: "Hari Manan (NetID: nfnh2), Javier Matias (NetID: javierm4), Francisco Vargas (NetID: fav3)"
date: 'August 4th, 2018'
output:
  html_document:
    toc: yes
  pdf_document: default
urlcolor: cyan
---

## Students names

- Hari Manan (NetID: nfnh2)
- Javier Matias (NetID: javierm4)
- Francisco Vargas (NetID: fav3)

## Title

Prediction of House Prices in Ames, Iowa Using Regression Analysis Techniques

## Introduction

The main goal of this project is to develop a model to predict the sale price of a house using a dataset gathered on Ames, Iowa related to houses sold between 2006 and 2010. The model not only should be good at predicting the price, but also avoid using all the features provided in order to reduce the complexity of explaining it. In other words, we are aiming to get a model with a balance between prediction and explanation.

There are also practical questions that will be answered which can help a seller or buyer to understand specific characteristics that are important when determining the price of a house. We decided to also work on this special practical questions to explore other techniques seen in the course and make our analysis even more interesting (for us and for the reader).

The dataset originally contains 2930 observations with 81 features (23 nominal/categorical, 23 ordinal, 14 discrete, and 21 continuous).

The features can be roughly divided into 6 categories. We include some examples of each category:

- **House characteristics**: e.g. number of bedrooms, number of bathrooms, electrical system
- **Space measurements**: e.g. lot area, living area
- **Quality ratings**: e.g. overall condition, kitchen condition
- **Construction characteristics and materials**: e.g. roof material
- **Zone and location classification**: e.g. zoning classification, neighborhood
- **Terms of sale**: e.g. month and year sold, sale price

This dataset was compiled by Dean DeCock in 2011 to be used for Data Science education and research. It was an effort towards expanding a famous dataset about the housing market in Boston by adding variables to those already provided in the Boston Housing dataset.

The link to the original paper can be found here: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf

The dataset is also openly available here: http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls

There are several reasons why the team decided to use this dataset and why this is interesting to us. Most of them are related to research and personal motivations and are enumerated below:

1. **Model simplification**: This dataset offers many variables. Even though accuracy tends to improve as the number of predictors increases, larger models also increase complexity and become more difficult to interpret. Complex models are also highly succeptible to overfitting. The goal of this project is to derive a model using a subset of the predictor variables while achieving a comparable level of performance.
2. **Cleanness**: The dataset is well curated and does not require extensive cleansing. This will allow the team to focus on analyzing the data itself.
3. **Questions**: From a personal interest perspective, this dataset is related to a frequent activity: buying a house. This is a well-known activity that allows for some interesting questions:
    - **How effectively can we predict the price of a house given a set of explanatory variables?** A test/training set evaluation process can help us measure the accuracy under acceptable error ranges. This could give us some insight to answer other questions such as: *"Does the time of year affect the sale price of the house?"*
    - **Can we identify unusual observations that might be affecting the model?** Understanding leverage, outliers, and influential points in the model will allow us to identify unusual observations in our regressions model.
    - **Are there any interactions between the variables affecting the price of the houses?** Experimenting with dummy variables and interaction models might help answer questions such as: *"Do some neighborhoods value having a pool or a garage more than others?"*
    - **What are the most influential predictors?** Hypothesis tests might be the best path in this case.
    - **Can we build a reduced version of the model without sacrificing accuracy?** Use variable selection procedures, collinearity, and ANOVA evaluation to find a good model for the housing dataset from a set of possible models.
    - **Are transformations necessary to improve the accuracy of the model?** Transformation techniques and assumption tests might lead us to the answer.

## Method

This section is devoted to execute all the calculations and operations necessary to achieve the goals exposed in the introductory section.

The first step is to load the original data that is attached to this project in the `data` directory:

```{r message=FALSE}
library(readr)

# Load the data
housing_data = read_csv("data/AmesHousing.csv")
```

We are using a 'csv' file that was derived from the original 'xls' file just to use the well known `read_csv` function.

### Cleaning the data

There are two useless variables. `PID` is a unique identifier of each observation and `Order` is a counter for each observation. Let's get rid of those:

```{r}
housing_data = subset(housing_data, select = -Order)
housing_data = subset(housing_data, select = -PID)
```

Let's start by understanding the structure of the original data and clean as needed.

```{r}
# Display structure of dataset
str(housing_data)
```

From the previous output one can see that this dataset is composed by 2930 observations and 82 variables (80 when removing the two useless variables). There are only `chr` and `int` type variables. A description of each variable can be found here https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt.

It's important to know whether there are `NA` values in the dataset and decide how to manage them. Let's see if there are variables containing `NA` values:

```{r}
hasNA = colSums(is.na(housing_data))[colSums(is.na(housing_data)) > 0]
hasNA
```

As you can see `r length(hasNA)` variables contain `NA` values. Some of those variables have too many `NA`s, therefore we decided to remove any variable with more than 400 `NA`s:

```{r}
# Keep only those variables that has less than 400 NAs
housing_data = subset(housing_data, select = colSums(is.na(housing_data)) < 400)
```

After filtering, we now have `r dim(housing_data)[1]` observations and `r dim(housing_data)[2]` variables.

In order to avoid any issue with the remaining observations that have `NA`s we are removing those as well:

```{r}
housing_data = na.omit(housing_data)
```

After filtering, we now have `r dim(housing_data)[1]` observations and `r dim(housing_data)[2]` variables. These are enough observations to work on our model without any problem.

There are 5 unusual observations that were pointed out by the author of the dataset. These are easily observable by plotting `SalePrice` (price of the house) vs `Gr Liv Area` (above ground living area square feet) as follows:

```{r}
plot(housing_data$SalePrice ~ housing_data$`Gr Liv Area`,
     col = ifelse(housing_data$`Gr Liv Area` > 4000, "orange", "dodgerblue"),
     xlab = "Ground Living Area (square feet)",
     ylab = "Sale Price (dollars)",
     main = "Sale Price vs Ground Living Area",
     pch = 20
     )
```

As you can see there are 3 points that have an extremely low price for a huge ground living area, this doesn't make sense. The other two points seem to be priced correctly but those are unusual sales. Let's remove them:

```{r}
housing_data = housing_data[housing_data$`Gr Liv Area` < 4000, ]
```

During the analysis of the models we found out that there are two neighborhoods with 2 or less observations. We are removing those points since we don't have enough samples to make sure the data is evenly distributed on training and test.

```{r}
housing_data = housing_data[-which(housing_data$Neighborhood %in% c("GrnHill","Landmrk")),]
```

Finally, let's coerce all character columns as factors to fit the categorical predictors of the model without problem:

```{r}
# Determine variables that are of type character
char_var = lapply(housing_data, class) == "character"
# Coerce those columns to factor
housing_data[, char_var] = lapply(housing_data[, char_var], as.factor)
```

### Baseline analysis

As stated in the `Introduction` the goal of the project is to obtain a model for predicting sales price, but trying to keep it simple. In other words, finding a good balance between prediction and explanation.

Before playing around with model selection, let's divide the data into a training and a test set. That will help us evaluate the final model chosen. We'll use the seed `420` for any random process so we can reproduce the results if needed.

```{r}
set.seed(420)

# Get training indexes randomly (80% of the observations)
train_index <- sample(seq_len(nrow(housing_data)), size = floor(0.8 * nrow(housing_data)))

# Divide the data
train_hd <- housing_data[train_index, ]
test_hd <- housing_data[-train_index, ]
```

After splitting the data we now have `r dim(train_hd)[1]` observations for training and `r dim(test_hd)[1]` observations for test.

In order to assess our models we are going to use the set of auxiliary functions defined on week's 9 assignment. We are also defining a macro function that calls all those auxiliary functions and show the results in a single call. This overview encompasses the p-value of the normality and constant variance assumptions through Shapiro and Breusch test, the Fitted vs Residuals plot, the normal Q-Q plot, the number of parameters used (betas), the LOOCV RMSE, and the adjusted $R^2$.

```{r}
# Library required for Breusch-Pagan test
library(lmtest)

# Plot Fitted vs Residuals
plot_fitted_resid = function(model) {
  plot(fitted(model), resid(model),
       col = "dodgerblue", pch = 20, cex = 1.5,
       xlab = "Fitted", ylab = "Residuals",
       main = "Fitted vs residuals plot")
  abline(h = 0, col = "darkorange", lwd = 1)
}

# Plot Normal Q-Q
plot_qq = function(model) {
  qqnorm(resid(model), col = "dodgerblue", pch = 20, cex = 1.5)
  qqline(resid(model), col = "darkorange", lwd = 1)
}

# Breusch-Pagan test (constant variance)
get_bp_decision = function(model) {
  bptest(model)$p.value
}

# Shapiro-Wilk test (normality)
get_sw_decision = function(model) {
  shapiro.test(resid(model))$p.value
}

# Number of parameters
get_num_params = function(model) {
  length(coef(model))
}

# LOOCV RMSE
get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

# Adjusted R^2
get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}

# Function that combines the previously defined functions
get_overview = function(model) {
  plot_fitted_resid(model)
  plot_qq(model)
  loocv_rmse = get_loocv_rmse(model)
  adj_r2 = get_adj_r2(model)
  bp_p_value = get_bp_decision(model)
  shapiro_p_value = get_sw_decision(model)
  betas = get_num_params(model)
  list(loocv_rmse = loocv_rmse, adj_r2 = adj_r2, bp_p_value = bp_p_value, shapiro_p_value = shapiro_p_value, betas = betas)
}
```

We can start the process of finding our model by using an additive model as a baseline:

```{r}
model_add = lm(SalePrice ~ ., data = train_hd)
```

Let's run the overview function to assess this baseline model:

```{r}
get_overview(model_add)
```

From the results above you can note that there are troubles on every resulting parameter, except the adjusted $R^2$. LOOCV RMSE is `inf` (because there are hat values equal to 1), both assumptions are violated (p-values are very low), and the number of parameters used is huge (because of the categorical variables, we end up having more betas than the number of predictors available). The Fitted vs Residuals plot also seems to violate both, the 0 mean at any fitted value and the constant variance at any fitted value (no linearity and no constant variance).

Based on the normal Q-Q plot is evident that there is a huge deviation at the edges of the plot. We can apply a log transformation on the response and check whether this minimizes the effect:

```{r}
model_add_log = lm(log(SalePrice) ~ ., data = train_hd)
get_overview(model_add_log)
```

The Shapiro's p-value is worse, but the normal Q-Q plot looks better. This can be due to some outliers that can be observed on the plot.

The outliers can be identified by using `plot` function:

```{r, warning = FALSE}
plot(model_add_log)
```

From the resulting plots we can see that observations `1682`, `578`, and `213` seem to be outliers. Let's remove them:

```{r}
train_hd = train_hd[-c(1682, 578, 213),]
model_add_log = lm(log(SalePrice) ~ ., data = train_hd)
get_overview(model_add_log)
```

It now looks a bit better, but there is still a noticeable violation of normality at the edges. The Fitted vs Residuals plot also looks better, especially the 0 mean at any fitted value.

We tried to find outliers or influential points, but using the heuristics of the book is not possible to find any:

```{r}
# Number of influential points in the transformed model
sum(cooks.distance(model_add_log) > 4 / length(cooks.distance(model_add_log)))

# Number of outliers in the transformed model
sum(abs(rstandard(model_add_log)) > 2)
```

We can also see the positive effect of the `log` function in the response by looking at its histogram. Without `log`:

```{r}
hist(train_hd$SalePrice,
     breaks = 30,
     main = "Histogram of Sale Price without transformation",
     xlab = "Sale Price",
     col = "gray")
```

As you can see this plot is right skewed. This can be normalized by using `log`:

```{r}
hist(log(train_hd$SalePrice),
     breaks = 30,
     main = "Histogram of Sale Price with log transformation",
     xlab = "Sale Price",
     col = "gray")
```

We can notice that the obtained plot is much better and the `log transformation` makes sense for the response.

Let's use this transformed additive model as our baseline. There are still many paramaters that can be improved.

### Selection methods

There are many predictors to play with. We can try to choose a smaller base of "relevant" numerical predictors and then assess the addition of categorical variables through a `forward` process.

One way of doing this is by selecting those numerical predictors that are highly correlated to the response.

```{r}
# Get correlation list of all numerical variables against Sale Price
sale_price_cor = cor(train_hd[sapply(train_hd, is.numeric)])[,"SalePrice"]
sale_price_cor
```

We can then remove those values that are low in magnitude. We'll choose a threshold of 0.4:

```{r}
# Filter correlations with magnitude smaller than a specified threshold
sale_price_cor = sale_price_cor[abs(sale_price_cor) > 0.4]
sale_price_cor
```

We now have a smaller group of numerical predictors to choose from. By reading the documentation of this set of parameters one can see that some of them are the result of the others or are very related. This obviously leads to collinearity issues.

Year built and year of remodelation are repetitive. If there was no remodelation the year built is used on `Year Remod/Add`. Let's keep `Year Remod/Add` only.

`BsmtFin SF 1` and `Total Bsmt SF` are also related. Let's keep `Total Bsmt SF` which has a higher correlation with SalePrice. Same happens with `1st Flr SF` and `Gr Liv Area`, we'll keep `Gr Liv Area`.

There are also 3 numerical variables related to the garage. The `Garage Cars` seems to be the most relevant.

We can now simplify our `sale_price_cor` vector with the chosen variables:

```{r}
# Auxiliary vector with names
var = names(sale_price_cor)

sale_price_cor = sale_price_cor[var != "Year Built" & var != "BsmtFin SF 1" & var != "1st Flr SF" & var != "Garage Yr Blt" & var != "Garage Area"]

sale_price_cor
```

Let's fit a model with the chosen numerical variables and finish the collinearity study using `vif` analysis. Don't forget that we are transforming the response using `log`.

```{r}
# Fit a simple additive model with the chosen numerical variables
model_add_num = lm(log(SalePrice) ~ ., data = train_hd[, names(sale_price_cor)])

# Run vif test
library(faraway)
vif(model_add_num)
```

There are no values over 5 which is the heuristic used in the course book that should cause concern. It seems that we removed the right numerical variables.

We can apply the `get_overview` function to evaluate assumptions as well:

```{r}
get_overview(model_add_num)
```

Everything looks similar to the additive model using all the predictors. This time the LOOCV RMSE is not `inf` which is also a good signal (no hat values of 1).

Let's now check whether transforming any of these predictors makes sense. A visual inspection of `pairs` plot might suggest possible transformations.

```{r}
pairs(train_hd[, names(sale_price_cor)])
```

Based on this output, it seems that trying a `log` transformation on `Gr Liv Area` and `Total Bsmt SF` can make sense to soften the effect ot extreme values:

```{r}
model_add_num_log = lm(log(SalePrice) ~ `Overall Qual` + `Year Remod/Add` + `Mas Vnr Area` + log(`Total Bsmt SF`) + log(`Gr Liv Area`) + `Full Bath`  + `TotRms AbvGrd` + `Fireplaces` + `Garage Cars`, data = train_hd)
get_overview(model_add_num_log)
```

After the transformation the assumption parameters improved and the rest remain almost the same. We can keep these changes.

Let's now try to simplify the model using a backward selection procedure and AIC.

```{r}
model_add_num_log_back = step(model_add_num_log, direction = "backward", trace = 0)
model_add_num_log_back$coefficients
get_overview(model_add_num_log_back)
```

The number of rooms above the ground were removed from the model. Let's confirm that this is not a significant removal using an ANOVA test:

```{r}
p_val = anova(model_add_num_log_back, model_add_num_log)$"Pr(>F)"[2]
p_val
```

We failed to reject the null hypothesis for any rational significance level, which means that the simplified model can be used and the number of rooms above ground were not significant. We now have only 9 numerical variables.

Let's see whether this model can be improved by adding categorical variables. This can be done using a forward selection procedure and BIC to get the smaller possible model:

```{r}
# The scope includes all the filtered numerical variables plus all the available categorical variables
model_add_mix = step(model_add_num_log_back, direction = "forward", scope = SalePrice ~ `Overall Qual` + `Year Remod/Add` + `Mas Vnr Area` + log(`Total Bsmt SF`) + log(`Gr Liv Area`) + `Full Bath` + `Fireplaces` + `Garage Cars` + Street + `MS SubClass` + `MS Zoning` + `Lot Shape` + `Land Contour` + Utilities + `Lot Config` + `Land Slope` + Neighborhood + `Condition 1` + `Condition 2` + `Bldg Type` + `House Style` + `Roof Style` + `Roof Matl` + `Exterior 1st` + `Exterior 2nd` + `Mas Vnr Type` + `Exter Qual` + `Exter Cond` + Foundation + `Bsmt Qual` + `Bsmt Cond` + `Bsmt Exposure` + `BsmtFin Type 1` + `BsmtFin Type 2` + Heating + `Heating QC` + `Central Air` + Electrical + `Kitchen Qual` + Functional + `Garage Type` + `Garage Finish` + `Garage Qual` + `Garage Cond` + `Paved Drive` + `Sale Type` + `Sale Condition`, k = log(length(resid(model_add_num_log_back))), trace = 0)
```

We can now run our diagnostics for the obtained model:

```{r}
get_overview(model_add_mix)
```

The assumptions plots remain the same. There is a noticeable improvement on adjusted $R^2$ and our LOOCV RMSE is `inf` again. This is due to some hat values being 1. Let's explore how many we have:

```{r}
sum(hatvalues(model_add_mix) == 1)
```

Since the number of conflicting hat_values is so low we can simply ignore them and calculate LOOCV RMSE manually:

```{r}
loocv_rmse_final_mod = sqrt(mean((resid(model_add_mix)[-which(hatvalues(model_add_mix) == 1)] / (1 - hatvalues(model_add_mix)[-which(hatvalues(model_add_mix) == 1)])) ^ 2))
loocv_rmse_final_mod
```

This is a much better value than our previous model, which is super good for prediction purposes. The `model_add_mix` is our final choice. We'll assess it against the test set in the `results` section. We'll also discuss the chosen model in general.

### Practical questions derived from the dataset

## Results

### Predictive model chosen

The chosen model to predict Sale Price for the given dataset was stored in `model_add_mix`. It uses 20 predictors out of 80 originally available, which is aligned to our initial commitment towards not only getting a good model for prediction, but also a balanced model between prediction and explanation. 12 of those predictors are categorical and 8 numerical.

Below you can see the formula of the chosen model, which includes all the predictors and transformations used:

```{r}
model_add_mix$call
```

Since some of the predictors are categorical, then the number of coefficients in the model is bigger than the number of predictors used (due to the use of dummy variables). The model has in total `r length(model_add_mix$coefficients)` coefficients.

To assess the final model, we can first focus on the prediction capabilities and then around explanatory capabilities.

From a prediction perspective we can first plot the fitted values versus the actual values to get a visual look of how the chosen model performs on the training set:

```{r}
plot(train_hd$SalePrice, exp(fitted(model_add_mix)),
     xlab = "Actual price (dollars)",
     ylab = "Predicted price (dollars)",
     main = "Predicted vs actual price in dollars for training set",
     col = "orange")
abline(0, 1, lwd = 2, col = "dodgerblue")
```

The obtained plot is self-explanatory. The points are very close to the identity line which means that the predictions are relatively similar to the actual values. It seems that there is a constant variance around the identity line (perhaps the epsilon of our model) which increases as the actual price increases. A possible explanation for this is that, in practical terms, expensive houses are not easy to sell so the price can fluctuate a lot depending on the negotiations terms of the sale, therefore, the price is not as 'easy' to predict than lower price homes.

Let's do the same for the test set. The predicted values are first stored in a variable so we can use it later to calculate more results.

```{r}
# We are removing a conflicting observation for "Kitchen Qual" that is only present on the test set
# This is because there is only one observation with that value
test_hd = test_hd[-which(test_hd$`Kitchen Qual` == "Po"),]

# Predict values
predicted = exp(predict(model_add_mix, test_hd))
```

We can now plot for the test set. These are unbiased results that can tell us a lot about the chosen model:

```{r}
plot(test_hd$SalePrice, predicted,
     xlab = "Actual price (dollars)",
     ylab = "Predicted price (dollars)",
     main = "Predicted vs actual price in dollars for test set",
     col = "orange")
abline(0, 1, lwd = 2, col = "dodgerblue")
```

We can see that the behavior of this plot is really similar to the plot of the training set. The model seems to be doing a great job on predicting prices. For higher values, the deviation is bigger which confirms the idea that the price of an expensive house is difficult to predict. More iterations of the model can be included in order to improve the prediction of those high priced houses.

Quantitatively, the chosen model has an adjusted $R^2$ of `r summary(model_add_mix)$adj.r.squared`. This is a decent value which can be interpreted as the model being able to explain a big part of the variation of the response. In terms of the test set we can calculate the RMSE which will simplify in a number what we saw in the previous plot:

```{r}
rmse_test = sqrt(mean((test_hd$SalePrice - predicted)^2))
rmse_test
```

The obtained RMSE for the test set is `r rmse_test`. Let's compare it with the training RMSE. This will give us a clue around overfitting issues. Both results are put together in a table:

```{r}
rmse_train = sqrt(mean((train_hd$SalePrice - exp(fitted(model_add_mix)))^2))
rmse_train

library("knitr")
results = data.frame('Training' = c(rmse_train), 'Test' = c(rmse_test))
rownames(results) <- c("RMSE")
kable(results, format = "markdown")
```

The test RMSE is lower than the training RMSE. That's a good signal that the model is not overfitting the data. This has a positive impact on the predictive goal of our model.

To sustain our theory that our model is better on predicting lower prices, we can calculate the RMSE for house prices less than 200000 dollars:

```{r}
sqrt(mean((test_hd$SalePrice[test_hd$SalePrice < 200000] - predicted[which(test_hd$SalePrice < 200000)])^2))
```

Notice how the test RMSE is significantly reduced.

One good way to assess our model against others in terms of prediction is by checking Kaggle's competition. There is a public leaderboard that scores models by RMSLE over a test set. All those models are using the same dataset that we used for our project. The RMSLE of our model is given by:

```{r}
sqrt(mean((log(test_hd$SalePrice) - log(predicted))^2))
```

As of today, this score would place our model on position 43 out of 4741, not bad at all. The highest score by the date this report was submitted is 0.08397.

Let's now focus on the explanatory capabilities of our model. Even though we were able to siginificantly reduced the amount of predictors used (20 out of 80), the model is not passing the assumptions tests, and there is an evident deviance on the normal Q-Q plot. Let's review this information:

```{r}
get_overview(model_add_mix)
```

As you can see both assumption tests (Shapiro and Breusch) are returning really small values which is a clear violation of normality and constant variance. This means that any interpretation coming from the coefficients is suspect and this model shouldn't be used for explanatory purposes as it is now. Moreover, even if the assumptions weren't violated, the use of 20 predictors and 89 coefficients would make an explanation almost impossible to digest for a human.

### Practical questions derived from the dataset

## Discussion

This section will be focused on analyze the usefulness of the chosen model based on the results shown on the previous section, discuss the answers provided for the practical questions defined in the proposal, and some lessons learned while playing around with the data itself. Let's enumarate them by category:

- **Prediction**:
    - The model was chosen by significantly reducing the number of predictors and combining several techniques studied in the course (forward selection process, correlation analysis, statistical tests, unusual observations analysis, etc) without sacrificing prediction usefulness. The chosen model returned the lowest LOOCV RMSE of all our analysis compared to the other models tested (`r loocv_rmse_final_mod`, remember that this LOOCV RMSE value is calculated using the log of the response). This is a nice metric which prevents overfitting, penalizes larger models implicitly, and focuses on errors which increase the confidence around prediction capabilities.
    - Two plots were provided to compare predicted vs actual sale prices for both, the training set and the test set. There is a nice trend close to the identity line and an obvious increase of variance for expensive houses. From these results we can conclude that our model is better on predicting average priced houses. More adjustments can be considered to improve the prediction for those houses, or maybe those are simply difficult to predict due to the elevated price and negotiation conditions.
    - RMSE for both, training and test set, was provided. The RMSE of the test set is lower than the training set which give us confidence around not having any overfitting issue.
    - RMSE is radically improved by filtering house prices lower than 200000 dollars. This confirms our conclusion around the model being better on predicting lower priced houses.
    - RMSLE for our test set was calculated. We did this to relatively compare our model to thousands of other models submitted on a Kaggle competition that used exactly the same dataset. As of today, our model would be on position 43 out of more than four thousand models.

- **Explanation**:
    - The normality and constant variance assumptions are violated by the model. Even though the goal of creating a model with a reduced amount of predictors was achieved, we couldn't improve the assumption metrics. In other words, the coefficients shouldn't be used to try to explain any relationship. The practical questions that we tried to answer are better explanation examples than the chosen model itself.

- **Practical questions**: TODO: ADD HERE.

- **General**:
    - The process of cleaning the data is not trivial at all. Even though this dataset was cured previously by the author, there are still many `NA`s and subtleties to play with. From removing useless predictors (like identifiers or observation counters) to finding outliers and remove observations that can break prediction when dividing the test and the training set (e.g. some categories with only one observation).
    - Exhaustive selection methods are not always an option. We started by trying using the `regsubsets` function with all the predictors, but the processing time was insane. We then decided on using a different approach combinind transformations, a forward selection method, and a collinearity and correlation analysis. In other words, there is no a single recipe to find a 'good' model.
    - A general exploration data is key before starting a model selection process. You need to know your data, check the description of the predictors, and use a bit of expertise and intuition to set a baseline model and iterate from there.

## Appendix

Here is a list of libraries that must be installed in order to run the RMD file attached to this assignment. Please make sure you have all of them installed:

```{r, eval = FALSE}
install.packages("readr")
install.packages("lmtest")
install.packages("faraway")
install.packages("knitr")
install.packages("jtools")
```

## References

1. De Cock, Dean. (2011). *Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project*. Journal of Statistics Education.
2. *House Prices: Advanced Regression Techniques*. Obtained on July 9th, 2018, from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
3. Dalpiaz, D. (2018). *Applied Statistics with R*. Obtained on July 9th, 2018, from: https://daviddalpiaz.github.io/appliedstats/
4. De Cock, Dean. (2011). *Ames, Iowa house price dataset*. Obtained on July 9th, 2018, from: http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls
5. De Cock, Dean. (2011). *Ames, Iowa house price dataset description*. Obtained on July 9th, 2018, from: https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt