-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path40-modeling.Rmd
354 lines (296 loc) · 16.7 KB
/
40-modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
title: "40-modeling"
output: html_notebook
---
```{r Imports}
source(knitr::purl("41-modeling-forecast.Rmd", output="tmp-40.R", quiet=TRUE))
source(knitr::purl("42-synthetic-permits.Rmd", output="tmp-40.R", quiet=TRUE))
source(knitr::purl("43-tonnage-predictions.Rmd", output="tmp-40.R", quiet=TRUE))
fs::file_delete("tmp-40.R")
```
This notebook combines the pieces of the modeling pipeline found in `41`, `42`, and `43`. The only functions with which the user should interact are `forecast_tonnage` and (possibly) `forecast_npermits`. The former will make a user-specified number of sets of synthetic permits and provide tonnage/debris predictions for each. The latter is mostly a wrapper around the functions in `42` that can be used to generate forecasts of the number of permits that will be filed in Nashville of various subtypes. This forecast is fed into `forecast_tonnage`. (See function docs for more details.)
# Main function
```{r forecast tonnage}
forecast_tonnage <- function(nsets = 1, city = "nashville", forecast_fname = NULL,
forecast_cols = NULL, model_fname = NULL, seed = 2021,
use_mean = FALSE, ind_num = FALSE){
#' Forecast tons of debris generated in Nashville
#'
#' Creates a forecast of the waste generated by construction and demolition projects in Nashville for some number of fiscal years into the future. This is a "driver" function which interacts with and merges together several independent pieces of the modeling pipeline.
#'
#' @param nsets (num, default=1) Number of sets of synthetic permits to generate.
#' @param city (chr, default="nashville") Name of city whose historical permits are used to generate synthetic permit features.
#' @param forecast_fname (chr, default=NULL) Name of file containing monthly forecasts of permits generated of various subtypes. True default value is specified in `load_forecast`.
#' @param forecast_cols (chr array, default=NULL) Names of columns used to make up subtypes in forecast_fname. While these could be inferred from values in forecast_fname, we instead allow the user to specify values and check that these columns were used to construct forecast_fname. A warning will print if no values are supplied, but the values will be set to the default of `c("comm_v_res", "project_type")`.
#' @param model_fname (chr, default=NULL) Name of an R-data file containing the tonnage-prediction model we wish to use. The saved object should be a named list output by `create_model`.
#' @param seed (num, default=2021) RNG seed affecting draws of synthetic permit features.
#' @param use_mean (bool, default=FALSE) If TRUE, use mean of monthly npermits forecast for each realization. If FALSE, draw a value of npermits from the Gaussian forecast.
#' @param ind_num (bool, default=FALSE) If TRUE, independently draw numeric features of permits. (This is currently broken and will only operates if FALSE. See `gen_syn_permits` for more details.)
#'
#' @return (tibble) `nsets` sets of future synthetic permits. Contains both permit features and predicted tonnages.
#' @export
#'
#' @examples forecast_tonnage(nsets = 100)
# Set seed once here
set.seed(seed)
# Load model and extract features
model <- load_tonnage_model(model_fname)
all_cols <- get_features(model, "all")
# We won't sample dates. We handle them separately, so extract them here
date_mask <- str_detect(all_cols, "date")
date_cols <- all_cols[date_mask]
all_cols <- all_cols[!date_mask]
# Load source dataframe for permit construction
df_source <- load_source_df(city) %>%
select(all_of(all_cols)) %>%
drop_na()
if(!all(all_cols %in% colnames(df_source))){
stop(glue("Not all columns in {all_cols} were found in source dataframe."))
}
# Load forecasts of npermits
df_forecast <- load_forecast(fname = forecast_fname) %>%
filter(data=="prediction") %>%
select(!data)
# Assume default forecast_cols if user doesn't specify
if(is.null(forecast_cols)){
forecast_cols <- c("comm_v_res", "project_type")
print(glue("Warning - `forecast_cols` not provided. We will assume default of {forecast_cols}"))
print("Explicitly pass `forecast_cols` to silence this warning.")
}
# Check that forecast columns are in forecast
for(cnames in list(colnames(df_forecast), colnames(df_source), all_cols)){
if(!all(forecast_cols %in% cnames)){
stop(glue("ERROR - some of `forecast_cols` {forecast_cols} not found in either df_forecast, df_source, or model"))
}
}
# Columns that we'll sample in synthetic permits
extra_cols <- all_cols[!(all_cols %in% forecast_cols)]
# Generate synthetic permits
permits <- df_forecast %>%
forecast_to_permits(df_source, nsets, forecast_cols, use_mean = use_mean,
extra_cols = extra_cols, ind_num = ind_num) %>%
add_dates(date_cols)
# Get predictions
predictions <- permits %>%
predict_tonnage(model) %>%
add_fy_info()
return(predictions)
}
```
## Part 1 - Forecast npermits
```{r load forecast from file}
load_forecast <- function(fname = NULL){
#' Load forecast of number of permits
#'
#' Load a forecast (output of `forecast_npermits_sub` or `forecast_npermits`) of the number of permits of various subtypes from feather file. If no filename is supplied, a default is assumed.
#'
#' @param fname (chr, default = NULL) Path to file containing forecast. If none is passed, the file used is "forecasts/best_forecast.feather".
#'
#' @return(tibble) Tibble containing forecasted number of permits for some specific timeframe and subtypes.
if(is.null(fname)){
fname <- expand_boxpath("forecasts/best_forecast.feather")
print(glue("Forecast file not specified. Reading from {fname}"))
}
forecast <- read_feather(fname)
return(forecast)
}
```
Note - This is the function which we currently use to make forecasts that enter into the pipeline. At the moment, it forces monthly forecasts, despite the functions called within having the ability to make yearly forecasts. This is intentional as there is some desire to get quarterly projections, which would be impossibly with yearly forecasts. This is possibly bad design to place this here instead of in `41`.
```{r forecast npermits}
forecast_npermits <- function(forecast_cols, nyr = 6, city = "nashville"){
#' Forecast number of permits
#'
#' Forecast the number of permits for each desired subtype on a monthly basis.
#'
#' @param forecast_cols (char array) Columns to construct subtypes we wish to forecast
#' @param nyr (num, default=6) Number of years of permits to generate. Permits are generated on a monthly basis, starting with the first month after what
#' @param city (chr, default="nashville") Name of city from which forecasts are built. (only "nashville" works for now)
#'
#' @return (tibble) Tibble containing npermits forecasts for specified subtypes.
#' @export
#'
#' @examples forecast_npermits(forecast_cols = c("comm_v_res", "project_type"))
load_forecast_df(city = city) %>%
forecast_npermits_sub(forecast_cols = forecast_cols, forecast_yrs = nyr, freq = 12) %>%
rename(forecast = count) %>%
mutate(fy = if_else(month>"Jun", year+1, year)) %>%
add_fy_info()
}
```
## Part 2 - Create sets of synthetic permits
This function interacts with `42` to generate synthetic permits. It is set up to generate several sets of permits, which will be identified via the column `sample`.
```{r from forecasts to synthetic permits}
forecast_to_permits <- function(df_forecast, df_source, nsets, forecast_cols, use_mean = FALSE, extra_cols = NULL, ind_num = FALSE){
#' Convert forecast to permits
#'
#' Generate sets of synthetic permits from a tibble of Gaussian forecasts. This is not intended as a standalone function but rather as a piece of `forecast_tonnage`.
#'
#' @param df_forecast (tibble) Gaussian forecasts output from `forecast_npermits`.
#' @param df_source (tibble) Historic permits from which features will be sampled in `gen_syn_permits`.
#' @param nsets (num) Number of sets of permits to generate. Each set will have randomness associated with it due to the bootstrap generation of features as well as the Gaussian sampling of `num_permits_sub` (the latter if `use_mean` is set to FALSE).
#' @param forecast_cols (chr array) Names of categorical columns that make up desired subtypes for which forecasts are built.
#' @param use_mean (bool, default = FALSE) If FALSE, sample from Gaussian forecast. If TRUE, use mean prediction from forecast.
#' @param extra_cols (chr array, default = NULL) Names of extra features (beyond `forecast_cols`) to sample from `df_source`. If NULL, all features will be sampled.
#' @param ind_num (bool, default = FALSE) If TRUE, sample each numeric column independently when generating synthetic permits. If FALSE, bootstrap resampling is performed on entire rows of the dataframe.
#'
#' @return (tibble) Tibble containing synthetic permits with columns `year`, `sample`, and everything in `forecast_cols` and `extra_cols`.
# Sample npermits from monthly Gaussian forecasts nsets times. Aggregate draws by sample, fiscal year, subtype
npermit_draws <- df_forecast %>%
draw_npermits(nsets, use_mean = use_mean) %>%
select(!c(forecast, sd))
# Get total number of permits of each subtype across all years, samples
num_permits_sub <- npermit_draws %>%
group_by(across(all_of(forecast_cols))) %>%
summarise(count = sum(draw)) %>%
ungroup()
# Generate features of synthetic permits; fy, sample is lost here so arrange and add unique id
syn_permits <- num_permits_sub %>%
gen_syn_permits(df_source = df_source, extra_cols, ind_num = ind_num) %>%
arrange(across(all_of(forecast_cols))) %>% # arrange by forecast_cols, so this df matches one below
mutate(permit_number = as.character(row_number())) # add id for join (overwrites sampled permit number)
# Join dataframe containing year, sample to synthetic permits with features
npermit_draws %>%
expandRows(count = "draw", count.is.col = TRUE) %>%
arrange(across(all_of(forecast_cols))) %>% # arrange by forecast_cols to match syn_permits
select(sample, month, year, fy) %>% # cut out forecast_cols to prep for join
mutate(permit_number = as.character(row_number())) %>% # add id for join (overwrites sampled permit number)
left_join(syn_permits, by=c("permit_number" = "permit_number"), keep=FALSE)
}
```
```{r add dates from month and year}
add_dates <- function(df, date_cols){
#' Add dates to synthetic permits
#'
#' Add several date columns to synthetic permits. Dates should come from the forecast, but also might be sampled when building the synthetic permits. This will overwrite any incorrect sampled dates with the forecast dates.
#'
#' @param df (tibble) Synthetic permits with one or more "date" columns (e.g. "issued_date")
#' @param date_cols (chr array) All columns in df containing dates
#'
#' @return (tibble) Synthetic permits with correct dates (to nearest month).
#' @export
#'
#' @examples add_dates(syn_permits, date_cols = c("issued_date"))
if(length(date_cols)!=0){
for(d in date_cols){
df[,d] = as.Date(paste('01', df$month, df$year), format='%d %b %Y')
}
}
return(df)
}
```
```{r add fy info}
add_fy_info <- function(df){
#' Add fiscal-year info
#'
#' Add quarter, fiscal_year, and fiscal_quarter (the latter two as strings ready for plotting purposes) based on the values of month and fy. This can be applied to both the forecast and the synthetic permits.
#'
#' @param df (tibble) Synthetic permits or forecast tibble (needs "month" and "fy")
#'
#' @return (tibble) Original tibble with quarter, fiscal_year, and fiscal_quarter added in.
df %>% mutate(quarter = as.integer(month)) %>%
mutate(quarter = case_when(quarter < 4L ~ 3L,
quarter < 7L ~ 4L,
quarter < 10L ~ 1L,
quarter < 13L ~ 2L,
TRUE ~ NA_integer_)) %>%
mutate(fiscal_year = factor(str_c("FY", as.character(fy %% 2000), sep=" "))) %>%
mutate(fiscal_quarter = factor(str_c(as.character(fiscal_year), " - Q", as.character(quarter), sep="")))
}
```
## Part 3 - Predict tonnage from permits
```{r load tonnage model}
load_tonnage_model <- function(fname = NULL){
#' Load tonnage model
#'
#' Load a trained model that predicts debris generated (tonnage) at the permit level. Models are the output of `create_model` in `43`.
#'
#' @param fname (chr, default = NULL) Path to file containing saved model. If NULL, default model is `expand_boxpath("models/best_model.rds")`.
#'
#' @return (named list) Trained model created in `create_model` in `43`
if(is.null(fname)){
fname <- expand_boxpath("models/best_model.rds")
print(glue("No model file supplied. Using {fname}"))
}
model <- readRDS(fname)
return(model)
}
```
Note -- the column named `total_debris` below should be extracted from the `model` instead of being hard-coded. This is lazy programming on my part. The user can definitely extract this from the model with some awareness of where it's buried within `model$model`.
```{r predict tonnage of waste generated}
predict_tonnage <- function(permits, model){
#' Predict tonnage for synthetic permits
#'
#' Get tonnage predictions for synthetic permits using trained model.
#'
#' @param permits (tibble) synthetic permits
#' @param model (named list) Trained model created in `create_model` in `43`
#'
#' @return (tibble) Synthetic permits with added column of `total_debris`
predictions <- get_prediction(model, permits) %>%
rename(total_debris = prediction)
return(predictions)
}
```
## Extra functions
### Draw npermits
```{r draw values of npermits}
draw_npermits <- function(df_forecast, nsets, use_mean = FALSE){
#' Draw values of npermits from Gaussian forecasts
#'
#' Using the tibble of Gaussian forecasts, sample exact values of `npermits` from Gaussian distribution. Because `npermits` must be an integer, sampled values are rounded. The value of `nsets` specifies how many draws of `npermits` (i.e. samples) are performed.
#'
#' @param df_forecast (tibble) Gaussian forecasts
#' @param nsets (num) Number of draws from each Gaussian distribution (i.e., number of samples to make)
#' @param use_mean (bool, default = FALSE) If TRUE, use mean of each distribution instead of drawing a random value of `npermits`
#'
#' @return (tibble) Tibble containing draws of `npermits` from Gaussian forecast. `nsets` draws (samples) exist for each row in `df_forecast`
results <- df_forecast %>%
expandRows(count = nsets, count.is.col = FALSE) %>% # replicate rows for number of sets we desire
arrange(year, month) %>% # arrange so that sample id assignment makes sense
mutate(sample = row_number()%%nsets) %>% # assign each sample an id number `sample`
rowwise() %>%
mutate(draw = round(rnorm(1, mean=forecast, sd=sd), 0)) %>%
mutate(draw = if_else(draw<0, 0, draw)) %>% # draws can be negative; fix this
ungroup()
# It's somewhat inefficient to always do the above draws, but oh well
if(use_mean){
results <- results %>%
mutate(draw = round(forecast, 0))
}
return(results)
}
```
### Plotting
NOTE -- I think all plotting functions should exist in one notebook (maybe 45?), but for now let's place them here.
I'm not sure exactly what type of plots we want. For now, I'm just going to make a simple plot with Gaussian errors. Note that I never actually checked if the errors on tonnage are Gaussian. But for now I'm just plotting the mean prediction and one- and two- sigma.
```{r plot results}
plot_predictions <- function(predictions){
predictions %>%
group_by(sample,fy) %>%
summarise(ntot=sum(total_debris, na.rm=TRUE)) %>%
ungroup() %>%
group_by(fy) %>%
summarise(total_debris = mean(ntot), sd = stats::sd(ntot)) %>%
ungroup() %>%
ggplot(aes(fy, total_debris)) +
geom_ribbon(aes(x=fy, ymax=total_debris+2*sd, ymin=total_debris-2*sd), fill="blue", alpha=.25)+
geom_ribbon(aes(x=fy, ymax=total_debris+sd, ymin=total_debris-sd), fill="blue", alpha=.25)+
geom_line(aes(y = total_debris), colour = 'blue')
}
```
# Test
```{r test predictions, purl=FALSE}
predictions <- forecast_tonnage(forecast_cols = c("project_type", "comm_v_res"), nsets = 10)
predictions %>%
filter(fy>2021 & fy<2027) %>%
plot_predictions()
forecast_aggregate <- predictions %>%
group_by(sample,fy) %>%
summarise(ntot = sum(total_debris, na.rm = TRUE)) %>%
ungroup() %>%
group_by(fy) %>%
summarise(total_debris = mean(ntot), sd = stats::sd(ntot)) %>%
ungroup()
forecast_aggregate
predictions
```