linearRegressionAnalysis/proposal.Rmd at master · fvargaspiedra/linearRegressionAnalysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: "Proposal - Final Project"
author: "Hari Manan (NetID: nfnh2), Javier Matias (NetID: javierm4), Francisco Vargas (NetID: fav3)"
date: 'July 17th, 2018'
output:
  html_document:
    toc: yes
  pdf_document: default
urlcolor: cyan
---

## Students names

- Hari Manan (NetID: nfnh2)
- Javier Matias (NetID: javierm4)
- Francisco Vargas (NetID: fav3)

## Tentative title

Prediction of House Prices in Ames, Iowa Using Regression Analysis Techniques

## Dataset description

The Ames Housing dataset describes houses sold between 2006 and 2010 in Ames, Iowa. It contains 1460 observations with 81 features (23 nominal/categorical, 23 ordinal, 14 discrete, and 21 continuous).

The features can be roughly divided into 6 categories. We include some examples of each category:


- **House characteristics**: e.g. number of bedrooms, number of bathrooms, electrical system
- **Space measurements**: e.g. lot area, living area
- **Quality ratings**: e.g. overall condition, kitchen condition
- **Construction characteristics and materials**: e.g. roof material
- **Zone and location classification**: e.g. zoning classification, neighborhood
- **Terms of sale**: e.g. month and year sold, sale price

The following link has a description of all the predictors: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The dataset complies with all the requirements for the final project on STAT-420.

## Background

This dataset was compiled by Dean DeCock in 2011 to be used for Data Science education and research. It was an effort towards expanding a famous dataset about the housing market in Boston by adding variables to those already provided in the Boston Housing dataset.

The link to the original paper can be found here: https://ww2.amstat.org/publications/jse/v19n3/decock.pdf

The dataset is also openly available here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

## Motivation

The main goal of this project is to develop a simple model using a reduced number of variables that will allow the team to predict the sale price of a house. There are several reasons why the team decided to use this dataset for the final project. Most of them are related to research and personal motivations and are enumerated below:

1. **Model simplification**: This dataset offers many variables. Even though accuracy tends to improve as the number of predictors increases, larger models also increase complexity and become more difficult to interpret. Complex models are also highly succeptible to overfitting. The goal of this project is to derive a model using a subset of the predictor variables while achieving a comparable level of performance.
2. **Cleanness**: The dataset is well curated and does not require extensive cleansing. This will allow the team to focus on analyzing the data itself.
3. **Questions**: From a personal interest perspective, this dataset is related to a frequent activity: buying a house. This is a well-known activity that allows for some interesting questions:
    - **How effectively can we predict the price of a house given a set of explanatory variables?** A test/training set evaluation process can help us measure the accuracy under acceptable error ranges. This could give us some insight to answer other questions such as: *"Does the time of year affect the sale price of the house?"*
    - **Can we identify unusual observations that might be affecting the model?** Understanding leverage, outliers, and influential points in the model will allow us to identify unusual observations in our regressions model.
    - **Are there any interactions between the variables affecting the price of the houses?** Experimenting with dummy variables and interaction models might help answer questions such as: *"Do some neighborhoods value having a pool or a garage more than others?"*
    - **What are the most influential predictors?** Hypothesis tests might be the best path in this case.
    - **Can we build a reduced version of the model without sacrificing accuracy?** Use variable selection procedures, collinearity, and ANOVA evaluation to find a good model for the housing dataset from a set of possible models.
    - **Are transformations necessary to improve the accuracy of the model?** Transformation techniques and assumption tests might lead us to the answer.


The questions and motivations exposed are not limited to this proposal. During the execution of the project some of them might change. Our goal is to try to answer as much as we can.

## Load proof

Below you can find some simple `R` code that shows how to load the data and the first 10 lines of the response variable (sale price) for this project:

```{r message=FALSE}
library(readr)

housing_data = read_csv("data/houseprices.csv")
head(housing_data$SalePrice, 10)
```

## References

1. De Cock, Dean. (2011). *Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project*. Journal of Statistics Education.
2. *House Prices: Advanced Regression Techniques*. Obtained on July 9th, 2018, from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
3. Dalpiaz, D. (2018). *Applied Statistics with R*. Obtained on July 9th, 2018, from: https://daviddalpiaz.github.io/appliedstats/