-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path21-explore-austin.Rmd
398 lines (257 loc) · 15.7 KB
/
21-explore-austin.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: "21-explore-austin"
output:
pdf_document:
toc: yes
toc_depth: '3'
html_notebook:
toc: yes
toc_depth: 3
toc_float: yes
number_sections: yes
---
## Load libs, common functions
```{r Load, results='hide', message=FALSE, warning=FALSE}
source(knitr::purl("11-load-austin.Rmd"))
fs::file_delete("11-load-austin.R")
p_load(ggplot2, OpenStreetMap, rgdal, leaflet, sp, Hmisc)
```
## Basic EDA
First we can take a look at the datasets and right away see that the residential permit set is TINY compared to the overall set
``` {r Dims}
dim(austin_permits)
dim(res_demo_permits)
```
From there I was interested in exploring the range of time occupied by the permits. Crazily enough, it seems to go all the way back to 1921, though as shown in the histogram, the frequency goes up dramatically when we hit the 80s. Only 2265 permits exist from before 1980, so if necesary, we can effectively ignore those.
```{r Years}
austin_permits %>%
filter(calendar_year_issued < 1980) %>%
nrow()
austin_permits %>%
summarize_if(is.numeric, range, na.rm=TRUE)
austin_permits %>%
ggplot(aes(x=calendar_year_issued)) +
geom_bar(position='dodge')
```
Most of the columns were almost entirely full, but I wanted to take a look at some columns that had a low density of non-N/A values.
I separated them into 3 categories that I believe are uniquely distinct:
Medium density, which contained a substantial amount of missing columns, but also a substantial amount of data.
Low density sqft, which were sqft related columns that were almost entirely empty.
And finally low density valuation, which had valuation related columns that were almost entirely empty.
``` {R Densities}
med_density_columns = c("number_of_floors", "housing_units")
low_density_columns_sqft = c("total_existing_bldg_sqft", "remodel_repair_sqft", "total_new_add_sqft")
low_density_columns_valuation = c("total_valuation_remodel", "total_job_valuation", "building_valuation", "building_valuation_remodel", "electrical_valuation", "electrical_valuation_remodel", "mechanical_valuation", "mechanical_valuation_remodel", "plumbing_valuation", "plumbing_valuation_remodel", "med_gas_valuation_remodel")
```
From there we can analyze how full those columns are. I used the inverses of the sqft and valuation categories so that I could sum up the number of non-N/A value present in any of the columns.
For the valuation columns, this summed almost exactly to 1, leading me to believe there is almost exactly one valuation per sub-permit (will discuss sub-permits later), but its not exact. I'll have to do a little more research to determine exactly what the slight discrepancy is.
``` {r Density Calculations}
med_column_densities = select(austin_permits, all_of(med_density_columns)) %>%
map(~sum(is.na(.))/length(.))
med_column_densities
inverse_valuation_column_densities = select(austin_permits, all_of(low_density_columns_valuation)) %>%
map(~(1-sum(is.na(.))/length(.)))
inverse_valuation_column_densities
Reduce("+",inverse_valuation_column_densities)
inverse_sqft_column_densities = select(austin_permits, all_of(low_density_columns_sqft)) %>%
map(~(1-sum(is.na(.))/length(.)))
inverse_sqft_column_densities
Reduce("+",inverse_sqft_column_densities)
```
## Map
This still needs some touching up, but is intended to help visualize the locations of permits on a map. In order to keep computation reasonable, I'm currently plotting a random sample of 100,000 and using clustering to reduce the visual load. This allows an interactive way to roughly visualize where permits are coming from.
In the future, I would like to update this map to be able to show how the locations of the permits changed through the years.
```{r Visualize Map, results = 'hide', message=FALSE, warning=FALSE}
map <- openmap(c(30.0000,-98.00000),
c(31.05,-97.00009), zoom=10,'osm')
```
``` {r Map Part 2}
dropped <- sample_n(drop_na(austin_permits, latitude, longitude),100000)
coordinates(dropped)<-~longitude+latitude
# proj4string(dropped)<-CRS("+init=epsg:4326")
# plot(map)
# points(spTransform(head(dropped),osm()),cex=1000)
leaflet(dropped) %>% addTiles() %>% addMarkers(
clusterOptions = markerClusterOptions()
)
```
## Unique Permits
There are many repeated permit numbers within the data set, likely because of multiple permits being required for different elements of the same job. This section attempts to answer questions about the nature of those permits. The main data set contains mostly unique permits, and maxes out at 5 permits per permit number. This indicates that a job site may have up to 1 permit of each permit type, given that there are 5 permit types. The demolition set has entirely unique permit numbers.
``` {r Examine Unique Permits}
num_total_ids <- length(austin_permits$permit_num_clean)
num_unique_ids <- length(unique(austin_permits$permit_num_clean))
num_total_ids - num_unique_ids
unique_ids <- unique(austin_permits$permit_num_clean)
austin_permits$permit_num_clean<- as.numeric(austin_permits$permit_num_clean)
austin_permits_repeat_count = austin_permits %>% select(permit_num_clean) %>%
group_by(permit_num_clean) %>%
mutate(count = n()) %>%
distinct() %>%
arrange(-count)
austin_permits_repeat_count
hist(austin_permits_repeat_count$count)
```
This performs the same calculations as the previous chunk, except this time on the demo set. The results are much less interesting though, due to all of the permit numbers being unique.
``` {r Demos Unique}
demo_unique_ids <- unique(res_demo_permits$permit_num_clean)
res_demo_permits$permit_num_clean<- as.numeric(res_demo_permits$permit_num_clean)
demo_permits_repeat_count = res_demo_permits %>% select(permit_num_clean) %>%
group_by(permit_num_clean) %>%
mutate(count = n()) %>%
distinct() %>%
arrange(-count)
demo_permits_repeat_count
hist(demo_permits_repeat_count$count)
```
## Construction VS Demolition
This section is intended to answer some of our questions about numbers is relation to Construction vs. Demolition. Currently it contains information and visualizations related to the number of permits of different types, though it will be updated with debris information when that is received.
This examines total residential demolitions versus partial residential demolitions, just to give an idea of the scale of demolitions, as well as make it easier to interface with other data sets that make this distinction in different ways.
``` {r Residential Demolitions}
num_residential_demolitions <- nrow(res_demo_permits)
num_total_residential_demolitions <- length(res_demo_permits$demolition_category[res_demo_permits$demolition_category == "Total Demolition"])
num_partial_residential_demolitions <- length(res_demo_permits$demolition_category[res_demo_permits$demolition_category == "Partial Demolition"])
res_demo_points <- rbind(c(num_total_residential_demolitions, num_partial_residential_demolitions))
res_demo_labels <- c("Complete Residential Demolitions", "Partial Residential Demolitions")
res_demo_permits %>%
ggplot(aes(x=calendar_year_issued, fill=demolition_category)) +
geom_bar(position='dodge')
```
Next we can start pulling overall information about constructions versus demolitions. This will help answer the continuous questions we've had about how much of a share of the debris demolitions take up. I anticipate this information being much more useful when the debris data arrives.
``` {r CVD}
# demo_classes <- demo_unique_table$work_class
demo_classes <- c("Demolition", "Demo", "Interior Demo Non-Structural")
demolitions <- austin_permits[austin_permits$work_class %in% demo_classes,]
num_demolitions <- nrow(demolitions)
num_commercial_demolitions <- length(demolitions$permit_class_mapped[demolitions$permit_class_mapped == "Commercial"])
num_constructions <- nrow(austin_permits) - num_demolitions
total_points <- rbind(c(num_demolitions, num_constructions))
total_labels <- c("Total Demolition Permits", "Total Construction Permits")
barplot(total_points, names.arg=total_labels)
```
Finally we analyze the breakdown between Commercial versus residential demolitions, as we can expect this to be an important factor in how much debris a project generates.
``` {r RVC}
demo_points <- rbind(c(num_demolitions, num_commercial_demolitions, num_residential_demolitions))
demo_labels <- c("Total Demolitions", "Commercial Demolitions", "Residential Demolitions")
barplot(demo_points, names.arg = demo_labels)
# num_demolitions - (num_commercial_demolitions + num_residential_demolitions)
```
## Examine Permit Statuses
Examine the breakdown of permit statuses over the years and attempt to determine meaning/useful information from them. Notably - this shouldn't have any effect for non-nashville data sets, since the final permit status is only useful for determining whether or not a permit generated any waste at all, which will already be known information.
``` {r Permit Statuses}
selected_statuses <- table(unlist(austin_permits$status_current))
selected_statuses
barplot(prop.table(selected_statuses))
```
Here I took some of Adam's code to make a visualization of the permit type by year. However, in this case it is much less useful due to the overwhelming number of permit types. I did manage to reduce it slightly by only taking permit types that had more than 1000 permits of that type.
``` {r By Year}
austin_permits %>%
group_by(status_current) %>%
filter(n()>1000) %>%
group_by(calendar_year_issued, status_current) %>%
summarise(counts = n()) %>%
filter(calendar_year_issued>1980) %>%
ggplot(aes(x=calendar_year_issued, y=counts, fill=status_current)) +
geom_bar(stat='identity', position='dodge')
```
## Examine Work Classes
Examine the breakdown of work classes overall.
``` {r Work Classes}
selected_work_classes <- table(unlist(austin_permits$work_class))
selected_work_classes
austin_permits %>%
ggplot(aes(x=work_class)) +
geom_bar(aes(y=..prop.., group=1)) +
theme(axis.text.x = element_text(angle = 45, hjust=1))
```
Same as above, a breakdown of work classes by year this time. Unfortunately, the work classes are more evenly spread, so a useful visualization can only contain work classes present on more than 20,000 permits.
``` {r work class by year}
austin_permits %>%
filter(calendar_year_issued > 1980) %>%
group_by(calendar_year_issued, work_class) %>%
mutate(work_class_abbrev = ifelse(n()>10000, work_class, 'other')) %>%
ggplot(aes(x=calendar_year_issued, fill=work_class_abbrev)) +
geom_bar(width=0.5, position='dodge')
```
## Valuations
This section explores combining valuations. I'm not really sure how legitimate to consider theese valuations, as a fairly substantial portion of them contain extremely large values. Will need to talk to the team at Nashville Metro to see what are the reasonable bounds on valuations.
``` {r valuations}
no_valuations <- austin_permits %>%
filter_at(low_density_columns_valuation, all_vars(is.na(.)))
no_valuations
austin_permits$total_valuation <- rowSums(austin_permits[,low_density_columns_valuation], na.rm = TRUE)
sum(austin_permits$total_valuation == 0)
combined_valuations = aggregate(austin_permits$total_valuation, by=list(Category=austin_permits$permit_num_clean), FUN=sum)
max(combined_valuations$x)
combined_valuations %>%
filter(x >1000000) %>%
nrow()
sum(combined_valuations == 0)/nrow(combined_valuations)
```
## Feather
``` {r feather}
write_feather(austin_permits, expand_boxpath("Austin, Tx/austin_permits.feather"))
write_feather(res_demo_permits, expand_boxpath("Austin, Tx/austin_res_demo_permits.feather"))
```
## Austin Tonnages
``` {r explore tonnages}
tonnages <- read_feather(expand_boxpath("Austin, Tx/austin_tonnages.feather"))
# First off, there are multiple forms of tonnage listed, so I want to check here to make sure there aren't any discrepancies between them.
# From this, we can see that "Total Tons" and "Total Tons5" (as well as likely there respective Diversion and Landfill columns, even though those aren't relevant to our modeling) are identical, but "Commingled Total Tons" is slightly different. It's only different in 30 rows, and most of those rows are 0s in the commingled column, and actual values in the "Total Tons" column, so I'm going to treat "Total Tons" as the true source through the rest of this process
identical(tonnages[['Total Tons']],tonnages[['Total Tons5']])
identical(tonnages[['Commingled Total Tons']],tonnages[['Total Tons5']])
identical(tonnages[['Commingled Total Tons']],tonnages[['Total Tons']])
comparing <- tonnages %>%
select(`Commingled Total Tons`, `Total Tons`) %>%
filter(`Commingled Total Tons` != `Total Tons`)
comparing
# Check repeat permit numbers and get histogram of number of repeats. It is clear that very few permit numbers are repeated, and even then, they are only repeated a single additional time. This is why my original analysis yielded a larger data set than expected.
tonnages$permit_num_clean<- as.numeric(tonnages$permit_num_clean)
tonnages_repeat_count = tonnages %>% select(permit_num_clean) %>%
group_by(permit_num_clean) %>%
mutate(count = n()) %>%
distinct() %>%
arrange(-count)
tonnages_repeat_count
hist(tonnages_repeat_count$count)
# Next we look at repeat tonnage numbers, as a rough gauge for the ACTUAL amount of repeated entries. This is not used to filter or change the data in any way, just to get a rough idea of how many repeats to expect. As we can see from the histogram, it is a significant amount of repeats of various numbers.
repeat_count <- tonnages %>% select(`Total Tons`) %>%
filter(`Total Tons` > 0) %>%
group_by(`Total Tons`) %>%
mutate(count = n()) %>%
distinct() %>%
arrange(-count)
repeat_count
hist(repeat_count$count)
# Now we can pull out only the reference permits and see if that roughly aligns with what we are expecting
filter_reference <- tonnages %>%
select(permit_num_clean, `Total Tons`, `Submitted Permit Type`) %>%
filter(`Submitted Permit Type` == "Reference")
# Then, just for safety, we filter out repeated permit numbers and permits with 0 tonnage, so we know that we are maintaining unique ids
filter_reference$permit_num_clean <- as.numeric(filter_reference$permit_num_clean)
aggregate_filtered <- aggregate(filter_reference$`Total Tons`, by=list(Category=filter_reference$permit_num_clean), FUN=max)
aggregate_renamed <- aggregate_filtered %>%
filter (x != 0) %>%
rename(permit_num_clean = Category,
`Total Tons` = x)
# Here we check the same numbers as before, except this time we look at the filtered set. As we can see here, there are now some, but very few, repeat tonnages. This is good, because we expect a small number of the tonnages to randomly be the same.
repeat_count_filtered <- aggregate_renamed %>% select(`Total Tons`) %>%
filter(`Total Tons` > 0) %>%
group_by(`Total Tons`) %>%
mutate(count = n()) %>%
distinct() %>%
arrange(-count)
hist(repeat_count_filtered$count)
# Finally we check the length of the unique permit numbers, and it is the same as the total length of the dataset, showing that we no longer have any repeat permit numbers
unique_check <- unique(aggregate_renamed$permit_num_clean)
length(unique_check)
nrow(aggregate_renamed)
# Lets also take a look at the distribution of the remaining tonnages
hist(aggregate_renamed$`Total Tons`)
#Well, that wasn't super helpful. Let's zoom in on the section where the vast majority of values are. Even with all of the zeroes filtered out, the majority of the data is between 0-50 tons. Also of note, only 50 out of the 535 permits fall above this 500 ton threshold.
large_values_removed <- aggregate_renamed %>%
filter(`Total Tons` < 500)
nrow(aggregate_renamed) - nrow(large_values_removed)
hist(large_values_removed$`Total Tons`)
#Here is the original graph logged, just so that it is a bit more readable
hist(log(aggregate_renamed$`Total Tons`))
```