Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: [R-package] Add support for specifying training indices in lgb.cv() #3989

Closed
wants to merge 10 commits into from

Conversation

julioasotodv
Copy link
Contributor

@julioasotodv julioasotodv commented Feb 16, 2021

fixes #3924

@StrikerRUS StrikerRUS changed the title Add support for specifying training indices in lgb.cv() [R-package] Add support for specifying training indices in lgb.cv() Feb 16, 2021
@ghost
Copy link

ghost commented Feb 16, 2021

CLA assistant check
All CLA requirements met.

@julioasotodv
Copy link
Contributor Author

No idea why Azure pipelines are unhappy. It looks like it was an issue with the instances while building the containers... Shall we force rebuild?

@StrikerRUS
Copy link
Collaborator

@julioasotodv

It looks like it was an issue with the instances while building the containers... Shall we force rebuild?

Thanks, done!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to contribute! This is a really nice addition.

Could you please add a unit test in the lgb.cv() section, checking that this behavior works as expected? It would be great if you can set up a small time-series cross validation example as the test. https://github.com/microsoft/LightGBM/blob/master/R-package/tests/testthat/test_basic.R#L262

R-package/R/lgb.cv.R Outdated Show resolved Hide resolved
R-package/R/lgb.cv.R Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

I changed the PR's description to say "fixes" instead of "as seen in". That way when we merge this, the issue will be closed automatically.

@julioasotodv julioasotodv changed the title [R-package] Add support for specifying training indices in lgb.cv() wip: [R-package] Add support for specifying training indices in lgb.cv() Feb 16, 2021
@julioasotodv julioasotodv changed the title wip: [R-package] Add support for specifying training indices in lgb.cv() WIP: [R-package] Add support for specifying training indices in lgb.cv() Feb 16, 2021
@julioasotodv
Copy link
Contributor Author

Thanks for taking the time to contribute! This is a really nice addition.

Could you please add a unit test in the lgb.cv() section, checking that this behavior works as expected? It would be great if you can set up a small time-series cross validation example as the test. https://github.com/microsoft/LightGBM/blob/master/R-package/tests/testthat/test_basic.R#L262

Sure! I will try to come up with a meaningful test.

@jameslamb
Copy link
Collaborator

@julioasotodv are you still interested in contributing this? I'm available to help if you have any questions.

@julioasotodv
Copy link
Contributor Author

julioasotodv commented Mar 12, 2021

@jameslamb Indeed! Sorry had a couple of busy weeks...

Yes, I was thinking on how to make a test for this, as I don't see it straightforward... Since lgb.cv() basically returns the results list, I could only think about seeding everything (model through params and nfold) and check that result changes with and without train_folds.

But perhaps there is a better way I did not think of...

@jameslamb
Copy link
Collaborator

No problem, and no rush at all!

Let me know if it's too simplistic, but what about this?

  1. Construct a dataset for a regression tasks where the relationship between the features and target should be easy for LightGBM to figure out. Like maybe where you have a few features that are all some_constant * y + some_noise()
  2. Construct a second, much larger dataset with the same number of features that is all random noise
  3. Set folds to only slices of the "real" data (so you aren't evaluating on any of the noise data)
  4. Run lgb.cv() twice, once without train_folds and once with train_folds set to only contain indices from the "real data". Make num_iterations and num_leaves fairly small (like num_iterations = 5L, num_leaves = 5L).
  5. Compare model performance between the two cases, evaluated only against the "real" data. The model trained on only the "real data" (where you specified train_folds) should have much better performance than the one trained on mostly noise.

@jameslamb
Copy link
Collaborator

gently pinging @julioasotodv , is there anything I can do to help you with this?

@jameslamb jameslamb removed the request for review from Laurae2 May 4, 2021 16:20
@jameslamb
Copy link
Collaborator

jameslamb commented Sep 1, 2021

For now, I'm going to close this PR due to lack of response. @julioasotodv, thank you again for your interest in LightGBM and a great issue write-up in #3924! If you have the time and interest to contribute this in the future, we'd welcome a new pull request.

I think providing better support for customization of the cross-validation process (like for cases mentioned in #3924 where you want to do time-based splits) is a valuable contribution to the R package, but not one that will be done by maintainers before the next release (#4310).

I'm going to lock discussion on this PR for now to focus all discussion of this feature back on #3924. Anyone who is interested in contributing this feature (or wants to ask maintainers to give it higher priority and implement it) is welcome to comment on #3924.

@jameslamb jameslamb closed this Sep 1, 2021
@microsoft microsoft locked and limited conversation to collaborators Sep 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-package] add support for specifying training indices in lgb.cv()
3 participants