Open
Description
Robustify get_test_data
to work against various things that could be thrown at it, and make sure it does what we want:
- Design decision: remove
x
parameter and userecipe
's data, or defaultx
to be the recipe's data, or explain in docstring why not. - Doc tweak: remove "The step will be added to the sequence of operations for this recipe" (or implement things to make it true).
- Unsorted
time_value
s: eitherarrange
in this function, or change some other function to guarantee that it will be pre-arrange
d, and note that we rely on this within a comment. - Design decision: Nonconsecutive
time_value
s:slice_tail
doesn't care about the spacing betweentime_value
s; should it? - Design decision:
attr(x, "metadata", exact=TRUE)[["as_of"]]
is ignored; should it be? - Robustness: raise a user-friendly error message if we don't find any lag steps
- Design decision / code tweak: use
inherits(step, "step_epi_lag")
rather than"lag" %in% names(step)
- Code tweak: "insufficient training data" -> "insufficient
time_value
s available to form the test instance" - Robustness: make sure this doesn't break or do the wrong thing if the user has already
prep
'd. The mix between term and var info probably only works now because we haven'tprep
d and the term info is initialized to be equal to the var info (why doesrecipes
provide this initial value??). If we only want filtering on raw term info, can this instead just be all based on the var info? If we have to use the term data, we need to check whether a lot of transformation steps could cause all the raw terms to disappear: then we have to double-check that our logic is right + worry aboutdplyr::if_any
incorrect behavior given zero columns (it gives TRUE rather than FALSE on zero cols) - Design decision: if the NA raw term / var filter does do anything, is it what we want? Filtering based on
if_any(, ~ !is.na(.x))
only ensures that we have at least one non-NA variable in the kept rows, not, for example, that we have all non-NA predictors at the kept rows. Additionally, if the original time values are consecutive or evenly spaced, and this filter makes them nonconsecutive, is this okay? If it filters out the latesttime_value
s which the user might expect to be the basis of the test instance, is this okay? (If the answer to the last question is no, then we probably need to take a parameter liketest_ref_time_value
or maybe infer something from theas_of
metadata.) - Code tweak (maybe): try using tidyverse helpers (see
?purrr::map_int
,?purrr::reduce
,vctrs::vec_unchop
, etc.) rather than afor
andappend
to calculate themax_lags
/ max-max-lag? This may have some robustness benefits here in case the step object is malformed.