Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks fail with duckplyr #5

Open
krlmlr opened this issue Nov 3, 2024 · 1 comment
Open

Checks fail with duckplyr #5

krlmlr opened this issue Nov 3, 2024 · 1 comment

Comments

@krlmlr
Copy link

krlmlr commented Nov 3, 2024

The duckplyr package is aimed to be a drop-in replacement for dplyr, with full behavior compatibility. To assert that, I'm running checks with a rigged version of dplyr. This package fails its checks in this scenario.

Details: https://github.com/krlmlr/dplyr/blob/f-revdep-duckplyr/revdep/problems.md .

Learn more about duckplyr: https://duckplyr.tidyverse.org/ .

From the error message, I can't tell immediately what the cause of the failure is. I'd appreciate your help: can you please help digest a reproducible example that shows how duckplyr is behaving differently from dplyr in your use case?

The modified dplyr version can be installed with any of:

pak::pak("krlmlr/dplyr@f-revdep-duckplyr")
# remotes::install_github("krlmlr/dplyr@f-revdep-duckplyr")
# devtools::install_github("krlmlr/dplyr@f-revdep-duckplyr")

Thanks a lot for your help! Please let me know if you have any questions.

Tracker: tidyverse/duckplyr#297.

@mattheaphy
Copy link
Owner

I'm not sure exactly what's happening behind the scenes, but I was able to distill the problem down to some interaction between duckplyr and a tidymodels workflow using xgboost when step_rename() or step_mutate() are included in a recipe step.

The example below doesn't rely on offsetreg, so I'll look to close this issue assuming we agree the issue lies elsewhere (recipes or somewhere else in the tidymodels ecosystem).

library(tidymodels)
library(duckplyr)
methods_overwrite()

rec <- recipe(mpg ~ wt, mtcars) |>
  step_rename(wgt = wt)

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(boost_tree(learn_rate = 1,
                       sample_size = 1,
                       mtry = 1,
                       min_n = 1,
                       tree_depth = 1,
                       trees = 1) |>
              set_engine("xgboost") |> 
              set_mode("regression")) |>
  fit(data = mtcars)

wf |> predict(mtcars) |> sum()

Observations

  • Running this example multiple times will produce volatile results. This shouldn't be possible because a single tree is fit on one predictor across the full training set. Sometimes an error is returned:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) : 
  [08:08:35] src/data/data.cc:461: Check failed: valid: Label contains NaN, infinity or a value too large.
  • If we take out methods_overwrite(), the bottom line will return the same results every time.
  • If the recipe is simplified to rec <- recipe(mpg ~ wt, mtcars) then everything works fine.
  • I experimented with adding other recipe steps like step_log() and didn't notice any issues.
  • If either step_rename() or step_mutate() are added to a recipe, results are volatile. Interestingly, this happens even if these functions are passed with no arguments.
  • I haven't been able to recreate this issue with other models or engines besides xgboost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants