40-modeling.Rmd

---
title: "40-modeling"
output: html_notebook
---

```{r Imports}
source(knitr::purl("41-modeling-forecast.Rmd", output="tmp-40.R", quiet=TRUE))
source(knitr::purl("42-synthetic-permits.Rmd", output="tmp-40.R", quiet=TRUE))
source(knitr::purl("43-tonnage-predictions.Rmd", output="tmp-40.R", quiet=TRUE))
fs::file_delete("tmp-40.R")
```

This notebook combines the pieces of the modeling pipeline found in `41`, `42`, and `43`. The only functions with which the user should interact are `forecast_tonnage` and (possibly) `forecast_npermits`. The former will make a user-specified number of sets of synthetic permits and provide tonnage/debris predictions for each. The latter is mostly a wrapper around the functions in `42` that can be used to generate forecasts of the number of permits that will be filed in Nashville of various subtypes. This forecast is fed into `forecast_tonnage`. (See function docs for more details.)

# Main function
```{r forecast tonnage}
forecast_tonnage <- function(nsets = 1, city = "nashville", forecast_fname = NULL, 
                             forecast_cols = NULL, model_fname = NULL, seed = 2021, 
                             use_mean = FALSE, ind_num = FALSE){
  #' Forecast tons of debris generated in Nashville
  #'
  #' Creates a forecast of the waste generated by construction and demolition projects in Nashville for some number of fiscal years into the future. This is a "driver" function which interacts with and merges together several independent pieces of the modeling pipeline.
  #'
  #' @param nsets (num, default=1) Number of sets of synthetic permits to generate.
  #' @param city (chr, default="nashville") Name of city whose historical permits are used to generate synthetic permit features.
  #' @param forecast_fname (chr, default=NULL) Name of file containing monthly forecasts of permits generated of various subtypes. True default value is specified in `load_forecast`.
  #' @param forecast_cols (chr array, default=NULL) Names of columns used to make up subtypes in forecast_fname. While these could be inferred from values in forecast_fname, we instead allow the user to specify values and check that these columns were used to construct forecast_fname. A warning will print if no values are supplied, but the values will be set to the default of `c("comm_v_res", "project_type")`.
  #' @param model_fname (chr, default=NULL) Name of an R-data file containing the tonnage-prediction model we wish to use. The saved object should be a named list output by `create_model`.
  #' @param seed (num, default=2021) RNG seed affecting draws of synthetic permit features.
  #' @param use_mean (bool, default=FALSE) If TRUE, use mean of monthly npermits forecast for each realization. If FALSE, draw a value of npermits from the Gaussian forecast.
  #' @param ind_num (bool, default=FALSE) If TRUE, independently draw numeric features of permits. (This is currently broken and will only operates if FALSE. See `gen_syn_permits` for more details.)
  #'
  #' @return (tibble) `nsets` sets of future synthetic permits. Contains both permit features and predicted tonnages.
  #' @export
  #'
  #' @examples forecast_tonnage(nsets = 100)
  
  # Set seed once here
  set.seed(seed)
  
  # Load model and extract features
  model <- load_tonnage_model(model_fname)
  all_cols <- get_features(model, "all")
  
  # We won't sample dates. We handle them separately, so extract them here
  date_mask <- str_detect(all_cols, "date")
  date_cols <- all_cols[date_mask]
  all_cols <- all_cols[!date_mask]
  
  # Load source dataframe for permit construction
  df_source <- load_source_df(city) %>% 
    select(all_of(all_cols)) %>% 
    drop_na()
  
  if(!all(all_cols %in% colnames(df_source))){
    stop(glue("Not all columns in {all_cols} were found in source dataframe."))
  }
  
  # Load forecasts of npermits
  df_forecast <- load_forecast(fname = forecast_fname) %>% 
    filter(data=="prediction") %>%
    select(!data)
  
  # Assume default forecast_cols if user doesn't specify
  if(is.null(forecast_cols)){
    forecast_cols <- c("comm_v_res", "project_type")
    print(glue("Warning - `forecast_cols` not provided. We will assume default of {forecast_cols}"))
    print("Explicitly pass `forecast_cols` to silence this warning.")
  }
  
  # Check that forecast columns are in forecast
  for(cnames in list(colnames(df_forecast), colnames(df_source), all_cols)){
    if(!all(forecast_cols %in% cnames)){
      stop(glue("ERROR - some of `forecast_cols` {forecast_cols} not found in either df_forecast, df_source, or model"))
    }
  }
  
  # Columns that we'll sample in synthetic permits
  extra_cols <- all_cols[!(all_cols %in% forecast_cols)]
  
  # Generate synthetic permits
  permits <- df_forecast %>% 
    forecast_to_permits(df_source, nsets, forecast_cols, use_mean = use_mean, 
                        extra_cols = extra_cols, ind_num = ind_num) %>% 
    add_dates(date_cols)
  
  # Get predictions
  predictions <- permits %>% 
    predict_tonnage(model) %>% 
    add_fy_info()
  
  return(predictions)
}
```

## Part 1 - Forecast npermits

```{r load forecast from file}
load_forecast <- function(fname = NULL){
  #' Load forecast of number of permits
  #'
  #' Load a forecast (output of `forecast_npermits_sub` or `forecast_npermits`) of the number of permits of various subtypes from feather file. If no filename is supplied, a default is assumed.
  #'
  #' @param fname (chr, default = NULL) Path to file containing forecast. If none is passed, the file used is "forecasts/best_forecast.feather".
  #'
  #' @return(tibble) Tibble containing forecasted number of permits for some specific timeframe and subtypes.

  if(is.null(fname)){
    fname <- expand_boxpath("forecasts/best_forecast.feather")
    print(glue("Forecast file not specified. Reading from {fname}"))
  }
  forecast <- read_feather(fname)
  return(forecast)
}
```

Note - This is the function which we currently use to make forecasts that enter into the pipeline. At the moment, it forces monthly forecasts, despite the functions called within having the ability to make yearly forecasts. This is intentional as there is some desire to get quarterly projections, which would be impossibly with yearly forecasts. This is possibly bad design to place this here instead of in `41`.

```{r forecast npermits}
forecast_npermits <- function(forecast_cols, nyr = 6, city = "nashville"){
  #' Forecast number of permits
  #'
  #' Forecast the number of permits for each desired subtype on a monthly basis.
  #'
  #' @param forecast_cols (char array) Columns to construct subtypes we wish to forecast
  #' @param nyr (num, default=6) Number of years of permits to generate. Permits are generated on a monthly basis, starting with the first month after what 
  #' @param city (chr, default="nashville") Name of city from which forecasts are built. (only "nashville" works for now)
  #'
  #' @return (tibble) Tibble containing npermits forecasts for specified subtypes.
  #' @export
  #'
  #' @examples forecast_npermits(forecast_cols = c("comm_v_res", "project_type"))
  
  load_forecast_df(city = city) %>% 
    forecast_npermits_sub(forecast_cols = forecast_cols, forecast_yrs = nyr, freq = 12) %>% 
    rename(forecast = count) %>% 
    mutate(fy = if_else(month>"Jun", year+1, year)) %>% 
    add_fy_info()
}
```

## Part 2 - Create sets of synthetic permits

This function interacts with `42` to generate synthetic permits. It is set up to generate several sets of permits, which will be identified via the column `sample`.

```{r from forecasts to synthetic permits}
forecast_to_permits <- function(df_forecast, df_source, nsets, forecast_cols, use_mean = FALSE, extra_cols = NULL, ind_num = FALSE){
  #' Convert forecast to permits
  #'
  #' Generate sets of synthetic permits from a tibble of Gaussian forecasts. This is not intended as a standalone function but rather as a piece of `forecast_tonnage`.
  #'
  #' @param df_forecast (tibble) Gaussian forecasts output from `forecast_npermits`.
  #' @param df_source (tibble) Historic permits from which features will be sampled in `gen_syn_permits`.
  #' @param nsets (num) Number of sets of permits to generate. Each set will have randomness associated with it due to the bootstrap generation of features as well as the Gaussian sampling of `num_permits_sub` (the latter if `use_mean` is set to FALSE).
  #' @param forecast_cols (chr array) Names of categorical columns that make up desired subtypes for which forecasts are built.
  #' @param use_mean (bool, default = FALSE) If FALSE, sample from Gaussian forecast. If TRUE, use mean prediction from forecast.
  #' @param extra_cols (chr array, default = NULL) Names of extra features (beyond `forecast_cols`) to sample from `df_source`. If NULL, all features will be sampled.
  #' @param ind_num (bool, default = FALSE) If TRUE, sample each numeric column independently when generating synthetic permits. If FALSE, bootstrap resampling is performed on entire rows of the dataframe.
  #'
  #' @return (tibble) Tibble containing synthetic permits with columns `year`, `sample`, and everything in `forecast_cols` and `extra_cols`.
  
  # Sample npermits from monthly Gaussian forecasts nsets times. Aggregate draws by sample, fiscal year, subtype
  npermit_draws <- df_forecast %>% 
    draw_npermits(nsets, use_mean = use_mean) %>% 
    select(!c(forecast, sd))
  
  # Get total number of permits of each subtype across all years, samples
  num_permits_sub <- npermit_draws %>% 
    group_by(across(all_of(forecast_cols))) %>% 
    summarise(count = sum(draw)) %>% 
    ungroup()
  
  # Generate features of synthetic permits; fy, sample is lost here so arrange and add unique id
  syn_permits <- num_permits_sub %>% 
    gen_syn_permits(df_source = df_source, extra_cols, ind_num = ind_num) %>% 
    arrange(across(all_of(forecast_cols))) %>%   # arrange by forecast_cols, so this df matches one below
    mutate(permit_number = as.character(row_number()))         # add id for join (overwrites sampled permit number)
  
  # Join dataframe containing year, sample to synthetic permits with features
  npermit_draws %>% 
    expandRows(count = "draw", count.is.col = TRUE) %>% 
    arrange(across(all_of(forecast_cols))) %>%   # arrange by forecast_cols to match syn_permits
    select(sample, month, year, fy) %>%          # cut out forecast_cols to prep for join
    mutate(permit_number = as.character(row_number())) %>%     # add id for join (overwrites sampled permit number)
    left_join(syn_permits, by=c("permit_number" = "permit_number"), keep=FALSE)
}
```


```{r add dates from month and year}
add_dates <- function(df, date_cols){
  #' Add dates to synthetic permits
  #'
  #' Add several date columns to synthetic permits. Dates should come from the forecast, but also might be sampled when building the synthetic permits. This will overwrite any incorrect sampled dates with the forecast dates.
  #'
  #' @param df (tibble) Synthetic permits with one or more "date" columns (e.g. "issued_date")
  #' @param date_cols (chr array) All columns in df containing dates
  #'
  #' @return (tibble) Synthetic permits with correct dates (to nearest month).
  #' @export
  #'
  #' @examples add_dates(syn_permits, date_cols = c("issued_date"))
  
  if(length(date_cols)!=0){
    for(d in date_cols){
      df[,d] = as.Date(paste('01', df$month, df$year), format='%d %b %Y')
    }
  }
  return(df)
}
```


```{r add fy info}
add_fy_info <- function(df){
  #' Add fiscal-year info
  #'
  #' Add quarter, fiscal_year, and fiscal_quarter (the latter two as strings ready for plotting purposes) based on the values of month and fy. This can be applied to both the forecast and the synthetic permits.
  #'
  #' @param df (tibble) Synthetic permits or forecast tibble (needs "month" and "fy")
  #'
  #' @return (tibble) Original tibble with quarter, fiscal_year, and fiscal_quarter added in.
  
  df %>% mutate(quarter = as.integer(month)) %>% 
    mutate(quarter = case_when(quarter < 4L ~ 3L,
                               quarter < 7L ~ 4L,
                               quarter < 10L ~ 1L,
                               quarter < 13L ~ 2L,
                               TRUE ~ NA_integer_)) %>% 
    mutate(fiscal_year = factor(str_c("FY", as.character(fy %% 2000), sep=" "))) %>% 
    mutate(fiscal_quarter = factor(str_c(as.character(fiscal_year), " - Q", as.character(quarter), sep="")))
}
```


## Part 3 - Predict tonnage from permits

```{r load tonnage model}
load_tonnage_model <- function(fname = NULL){
  #' Load tonnage model
  #'
  #' Load a trained model that predicts debris generated (tonnage) at the permit level. Models are the output of `create_model` in `43`.
  #'
  #' @param fname (chr, default = NULL) Path to file containing saved model. If NULL, default model is `expand_boxpath("models/best_model.rds")`.
  #'
  #' @return (named list) Trained model created in `create_model` in `43`
  
  if(is.null(fname)){
    fname <- expand_boxpath("models/best_model.rds")
    print(glue("No model file supplied. Using {fname}"))
  }
  model <- readRDS(fname)
  
  return(model)
}
```

Note -- the column named `total_debris` below should be extracted from the `model` instead of being hard-coded. This is lazy programming on my part. The user can definitely extract this from the model with some awareness of where it's buried within `model$model`.

```{r predict tonnage of waste generated}
predict_tonnage <- function(permits, model){
  #' Predict tonnage for synthetic permits
  #'
  #' Get tonnage predictions for synthetic permits using trained model.
  #'
  #' @param permits (tibble) synthetic permits
  #' @param model (named list) Trained model created in `create_model` in `43`
  #'
  #' @return (tibble) Synthetic permits with added column of `total_debris`

  predictions <- get_prediction(model, permits) %>% 
    rename(total_debris = prediction)

  return(predictions)
}
```


## Extra functions

### Draw npermits

```{r draw values of npermits}
draw_npermits <- function(df_forecast, nsets, use_mean = FALSE){
  #' Draw values of npermits from Gaussian forecasts
  #'
  #' Using the tibble of Gaussian forecasts, sample exact values of `npermits` from Gaussian distribution. Because `npermits` must be an integer, sampled values are rounded. The value of `nsets` specifies how many draws of `npermits` (i.e. samples) are performed.
  #'
  #' @param df_forecast (tibble) Gaussian forecasts
  #' @param nsets (num) Number of draws from each Gaussian distribution (i.e., number of samples to make)
  #' @param use_mean (bool, default = FALSE) If TRUE, use mean of each distribution instead of drawing a random value of `npermits`
  #'
  #' @return (tibble) Tibble containing draws of `npermits` from Gaussian forecast. `nsets` draws (samples) exist for each row in `df_forecast`
  
  results <- df_forecast %>% 
    expandRows(count = nsets, count.is.col = FALSE) %>%   # replicate rows for number of sets we desire
    arrange(year, month) %>%                              # arrange so that sample id assignment makes sense
    mutate(sample = row_number()%%nsets) %>%              # assign each sample an id number `sample`
    rowwise() %>%
    mutate(draw = round(rnorm(1, mean=forecast, sd=sd), 0)) %>% 
    mutate(draw = if_else(draw<0, 0, draw)) %>%           # draws can be negative; fix this
    ungroup()
  
  # It's somewhat inefficient to always do the above draws, but oh well
  if(use_mean){
    results <- results %>% 
      mutate(draw = round(forecast, 0))
  }
  
  return(results)
}
```

### Plotting
NOTE -- I think all plotting functions should exist in one notebook (maybe 45?), but for now let's place them here.

I'm not sure exactly what type of plots we want. For now, I'm just going to make a simple plot with Gaussian errors. Note that I never actually checked if the errors on tonnage are Gaussian. But for now I'm just plotting the mean prediction and one- and two- sigma.

```{r plot results}
plot_predictions <- function(predictions){
  predictions %>% 
    group_by(sample,fy) %>% 
    summarise(ntot=sum(total_debris, na.rm=TRUE)) %>% 
    ungroup() %>% 
    group_by(fy) %>% 
    summarise(total_debris = mean(ntot), sd = stats::sd(ntot)) %>% 
    ungroup() %>% 
    ggplot(aes(fy, total_debris)) +
    geom_ribbon(aes(x=fy, ymax=total_debris+2*sd, ymin=total_debris-2*sd), fill="blue", alpha=.25)+
    geom_ribbon(aes(x=fy, ymax=total_debris+sd, ymin=total_debris-sd), fill="blue", alpha=.25)+
    geom_line(aes(y = total_debris), colour = 'blue')
}
```


# Test
```{r test predictions, purl=FALSE}
predictions <- forecast_tonnage(forecast_cols = c("project_type", "comm_v_res"), nsets = 10)

predictions %>% 
  filter(fy>2021 & fy<2027) %>% 
  plot_predictions()

forecast_aggregate <- predictions %>%
    group_by(sample,fy) %>% 
        summarise(ntot = sum(total_debris, na.rm = TRUE)) %>% 
        ungroup() %>% 
        group_by(fy) %>% 
        summarise(total_debris = mean(ntot), sd = stats::sd(ntot)) %>% 
        ungroup()

forecast_aggregate
predictions
```