-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathProject.Rmd
223 lines (182 loc) · 9.34 KB
/
Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
---
title: "Assessing Health and Economic Impact of Weather Events"
output:
html_document:
keep_md: true
---
## Synopsis of Study Results
This study U.S. National Oceanic and Atmospheric Administration's
(NOAA) storm database. This database tracks characteristics of major
storms and weather events in the United States, including when and
where they occur, as well as estimates of any fatalities, injuries,
and property damage.
Wind events, which include tornadoes and hurricanes, are by the far
the most harmful in aggregate, causing over 100,000 injuries and 90
deaths over the course of this study. Though less frequent, severe
heat events have the highest incidence of deaths and injuries per
event. This study finds that severe rain and wind events are by far
the most costly in terms of dollars spent to replace property and crop
damage.
## Questions this study considers
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
1. Across the United States, which types of events have the greatest economic consequences?
## Notes about the environment used
This study was done using the following tools, including OS and Programming language versions
The study was conducted on a 64-bit Windows 7 machine with 4 cores.
R language was `r R.version.string`
For publishing to `rpubs.com`, I used RStudio version 0.98.1103
The full project may be found on Github at `https://github.com/gregoryg/05-reproducible-research-assignment-2`
## Data Processing
### Set libraries used in this analysis
```{r loadLibraries, warning=FALSE, results="hide", message=FALSE}
library(stringr)
library(lubridate)
library(sqldf)
library(ggplot2)
library(reshape2)
library(gridExtra)
```
### Loading the data
```{r loadData}
dseturl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dsetzip <- "data/StormData.csv.bz2"
dsetrds <- "data/StormData.RDS"
if (!file.exists(dsetzip)) {
download.file(url = dseturl,
destfile = dsetzip,
method = "curl")
}
## For faster processing, check for R Data Set save file
## To force data cleaning, delete .RDS file before running/knitting/weaving
RDSloaded <- FALSE
if (!file.exists(dsetrds)) {
d <- read.csv(file = bzfile(dsetzip), strip.white = TRUE)
} else {
d <- readRDS(dsetrds)
RDSloaded <- TRUE
}
```
## Cleaning the data
Because we want to determine the most costly disasters, we have to
examine the the property damage (`PROPDMG`) and crop damage
(`CROPDMG`) values. These are simple integers which must be
multiplied by an exponent given in another field (`PROPDMGEXP` and
`CROPDMGEXP` respectively. Unfortunately, not all of the exponent
values are valid; many appear to be rounding or entry errors from
older entry systems.
For the purposes of this study, we accept only the following values
for exponent:
* H hundred (x100)
* K thousand (x1,000)
* M million (x1,000,000)
* B billion (x1,000,000,000)
```{r addUSDcol}
if(!RDSloaded) {
calcUSD <- function(dmg, dmgexp) dmg * switch(toupper(dmgexp), H=100, K=1000, M=1000000, B=1000000000, 1)
d$pdmgUSD <- mapply(calcUSD, d$PROPDMG, d$PROPDMGEXP)
d$cdmgUSD <- mapply(calcUSD, d$CROPDMG, d$CROPDMGEXP)
}
```
To assist in date analysis, we will convert dates to POSIXct format
```{r addPOSIXdate}
if(!RDSloaded)
d$BEGIN_UTC <- mdy(str_extract(d$BGN_DATE, "[^ ]+"))
```
The manual entry nature of the data causes huge difficulties in
categorizing the weather events. For example, you will find high wind
events entered in completely arbitrary ways, mixing terminology,
abbreviations, upper and lower case etc. (`thunderstorm`, `gusty
thunderstorm wind`, `gusty wind/rain`, `marine tstm wind`). We are
going to attempt to categorize the most impactful events by looking
for common words and abbreviations in a relative handful of weather
categories.
```{r addWeatherCategory}
if (!RDSloaded) {
generateEvent <- function(evt) {
evt <- tolower(evt)
ifelse(grepl("lightning", evt), "lightning",
ifelse(grepl("hail", evt), "hail",
ifelse(grepl("rain|flood|wet|fld", evt), "rain",
ifelse(grepl("snow|winter|wintry|blizzard|sleet|cold|ice|freeze|avalanche|icy", evt), "winter",
ifelse(grepl("thunder|tstm|tornado|wind|hurricane|funnel|tropical +storm", evt), "wind",
ifelse(grepl("fire", evt), "fire",
ifelse(grepl("fog|visibility|dark|dust", evt), "low visibility",
ifelse(grepl("surf|surge|tide|tsunami|current", evt), "ocean surge",
ifelse(grepl("heat|high +temp|record +temp|warm|dry", evt), "heat",
ifelse(grepl("volcan", evt), "volcanic activity",
"uncategorized"
))))))))))
}
d$weatherCategory <- mapply(generateEvent, d$EVTYPE)
}
```
For purposes of this study, USA is defined as the 50 states in the
continental US, plus District of Columbia, Hawaii and
Alaska. territories, protectorates, and military regions are excluded
```{r addUSAcol}
if (!RDSloaded)
d$isUSA <- mapply( function(st) st %in% state.abb, d$STATE )
```
In the interest of performance, we will save the data frame as an R
data set (`.RDS` file). After reading, immediately subset the data to
only those records originating in the US.
```{r saveRDS}
if (!RDSloaded) {
saveRDS(d, file=dsetrds)
d <- readRDS(dsetrds)
RDSloaded <- TRUE
}
## subset to only USA data
d <- d[d$isUSA == TRUE,]
```
## Results
### Question 1: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Refer to the result of the query below, which groups US mortalities and injuries by weather category.
```{r findHarmfulEvents, message=FALSE}
harm <- sqldf("select sum(FATALITIES) as deaths, sum(INJURIES) as injuries, weatherCategory,count(*) as sumrecs from d group by weatherCategory ")
harm$weatherCategory <- factor(harm$weatherCategory, levels=harm[order(harm$injuries), "weatherCategory"])
## ggplot loves long form, which means a type of normalization using melt()
hdat <- melt(harm, id.vars=c("weatherCategory", "sumrecs"), measure.vars=c("deaths", "injuries"))
hdat <- sqldf("select *,(value/sumrecs)*100 as pctPerEvent from hdat")
plot1 <- ggplot(hdat, aes(x=weatherCategory, y=value, fill=variable)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Total Casualties per Weather Type") + xlab("Weather type") + ylab("# Casualties")
plot2 <- ggplot(hdat, aes(x=weatherCategory, y=pctPerEvent, fill=variable)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Average Casualties per Incident") + xlab("Weather type") + ylab("% Casualties Reported")
grid.arrange(plot1, plot2, ncol=1, sub="Figure 1")
```
Wind events -- including tornadoes and hurricanes -- have the highest
impact on health in terms of absolute numbers reported. Heat events
-- including fires and heat waves -- show the highest percentage of
casualties per event.
### Question 2: Across the United States, which types of events have the greatest economic consequences?
```{r}
crop <- sqldf("select sum(pdmgUSD) as propertyDmgUSD, sum(cdmgUSD) as cropDmgUSD, count(*) as sumrecs, weatherCategory from d group by weatherCategory")
crop <- sqldf("select *, propertyDmgUSD + cropDmgUSD as totalCost from crop")
## create long form on crop vs property damage columns
cdat <- melt(crop, id.vars=c("weatherCategory", "sumrecs", "totalCost"), measure.vars=c("propertyDmgUSD", "cropDmgUSD"))
cdat <- sqldf("select *, (value/sumrecs) as costPerEvent from cdat")
plot3 <- ggplot(cdat, aes(x=weatherCategory, y=value/1000000000, fill=variable)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Damages per Weather Type") + xlab("Weather type") + ylab("Damages in Billions USD")
grid.arrange(plot3, ncol=1, sub="Figure 2")
```
Rain and wind events are the most costly weather types, both in terms
of property and damage to crops. There is a very significant crop
damage cost in the "uncategorized" weather events. This bears
examination, and possibly some re-evaluation of the categorization
used in this study.
## Appendix
### Utility functions
Some functions not used in the published analysis that may be useful
```{r}
## timeconv() is used to take inconsistent times and convert them to a standard format
## times will either be 24-hour HHMM ("1330") or 12-hour strings with AM/PM ("01:30:00 PM")
## functions returns format of 24-hour HH:MM:SS ("13:30:00")
timeconv <- function(x) {
if (nchar(x) == 4) {
paste(substr(x,1,2), substr(x,3,4), "00", sep=":")
} else {
timeadd <- 1
if (substr(x,nchar(x)-1,nchar(x)) == "PM") timeadd <- 12
paste0(
formatC(as.integer(substr(x,1,2)) + timeadd, width=2, format="d", flag="0"),
substr(x,3,nchar(x)-3))
}
}
```