Skip to content

Address some edge and not-so-edge cases for get_test_data; tweak/clean code. #52

Open
@brookslogan

Description

@brookslogan

Robustify get_test_data to work against various things that could be thrown at it, and make sure it does what we want:

  • Design decision: remove x parameter and use recipe's data, or default x to be the recipe's data, or explain in docstring why not.
  • Doc tweak: remove "The step will be added to the sequence of operations for this recipe" (or implement things to make it true).
  • Unsorted time_values: either arrange in this function, or change some other function to guarantee that it will be pre-arranged, and note that we rely on this within a comment.
  • Design decision: Nonconsecutive time_values: slice_tail doesn't care about the spacing between time_values; should it?
  • Design decision: attr(x, "metadata", exact=TRUE)[["as_of"]] is ignored; should it be?
  • Robustness: raise a user-friendly error message if we don't find any lag steps
  • Design decision / code tweak: use inherits(step, "step_epi_lag") rather than "lag" %in% names(step)
  • Code tweak: "insufficient training data" -> "insufficient time_values available to form the test instance"
  • Robustness: make sure this doesn't break or do the wrong thing if the user has already prep'd. The mix between term and var info probably only works now because we haven't prepd and the term info is initialized to be equal to the var info (why does recipes provide this initial value??). If we only want filtering on raw term info, can this instead just be all based on the var info? If we have to use the term data, we need to check whether a lot of transformation steps could cause all the raw terms to disappear: then we have to double-check that our logic is right + worry about dplyr::if_any incorrect behavior given zero columns (it gives TRUE rather than FALSE on zero cols)
  • Design decision: if the NA raw term / var filter does do anything, is it what we want? Filtering based on if_any(, ~ !is.na(.x)) only ensures that we have at least one non-NA variable in the kept rows, not, for example, that we have all non-NA predictors at the kept rows. Additionally, if the original time values are consecutive or evenly spaced, and this filter makes them nonconsecutive, is this okay? If it filters out the latest time_values which the user might expect to be the basis of the test instance, is this okay? (If the answer to the last question is no, then we probably need to take a parameter like test_ref_time_value or maybe infer something from the as_of metadata.)
  • Code tweak (maybe): try using tidyverse helpers (see ?purrr::map_int, ?purrr::reduce, vctrs::vec_unchop, etc.) rather than a for and append to calculate the max_lags / max-max-lag? This may have some robustness benefits here in case the step object is malformed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions