Address some edge and not-so-edge cases for  `get_test_data`; tweak/clean code.

Robustify `get_test_data` to work against various things that could be thrown at it, and make sure it does what we want:
- [ ] Design decision: remove `x` parameter and use `recipe`'s data, or default `x` to be the recipe's data, or explain in docstring why not.
- [ ] Doc tweak: remove "The step will be added to the sequence of operations for this recipe" (or implement things to make it true).
- [ ] Unsorted `time_value`s: either `arrange` in this function, or change some other function to guarantee that it will be pre-`arrange`d, and note that we rely on this within a comment.
- [ ] Design decision: Nonconsecutive `time_value`s: `slice_tail` doesn't care about the spacing between `time_value`s; should it?
- [ ] Design decision: `attr(x, "metadata", exact=TRUE)[["as_of"]]` is ignored; should it be?
- [ ] Robustness: raise a user-friendly error message if we don't find any lag steps
- [ ] Design decision / code tweak: use `inherits(step, "step_epi_lag")` rather than `"lag" %in% names(step)`
- [ ] Code tweak: "insufficient training data" -> "insufficient `time_value`s available to form the test instance"
- [ ] Robustness: make sure this doesn't break or do the wrong thing if the user has already `prep`'d.  The mix between term and var info probably only works now because we haven't `prep`d and the term info is initialized to be equal to the var info (why does `recipes` provide this initial value??).  If we only want filtering on raw term info, can this instead just be all based on the var info?  If we have to use the term data, we need to check whether a lot of transformation steps could cause all the raw terms to disappear: then we have to double-check that our logic is right + worry about `dplyr::if_any` incorrect behavior given zero columns (it gives TRUE rather than FALSE on zero cols) 
- [ ] Design decision: if the NA raw term / var filter does do anything, is it what we want?  Filtering based on `if_any(, ~ !is.na(.x))` only ensures that we have at least one non-NA variable in the kept rows, not, for example, that we have all non-NA predictors at the kept rows.  Additionally, if the original time values are consecutive or evenly spaced, and this filter makes them nonconsecutive, is this okay?  If it filters out the latest `time_value`s which the user might expect to be the basis of the test instance, is this okay?  (If the answer to the last question is no, then we probably need to take a parameter like `test_ref_time_value` or maybe infer something from the `as_of` metadata.)
- [ ] Code tweak (maybe): try using tidyverse helpers (see `?purrr::map_int`, `?purrr::reduce`, `vctrs::vec_unchop`, etc.) rather than a `for` and `append` to calculate the `max_lags` / max-max-lag?  This may have some robustness benefits here in case the step object is malformed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address some edge and not-so-edge cases for `get_test_data`; tweak/clean code. #52

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Address some edge and not-so-edge cases for get_test_data; tweak/clean code. #52

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Address some edge and not-so-edge cases for `get_test_data`; tweak/clean code. #52