-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965
Comments
Thanks for using LightGBM! Can you please share a minimal, reproducible example? For example, using one of the freely-available dataset from @TremaMiguel 's example in #4951 (comment) is a great example of a small, self-contained example used to demonstrate an issue. |
Thanks very much for updating the description with a reproducible example! Excellent write-up, we really appreciate it. I can confirm that on the most recent published version of
Note that "constructed" has a special meaning in LightGBM. It doesn't mean "called LightGBM does some preprocessing like binning continuous features into histograms, dropping unsplittable features, encoding categorical features, and more. That preprocessing is what this project refers to as "constructing" a Dataset. When you initially call Line 1071 in f85dfa2
That code initializes a LightGBM Once that LightGBM/python-package/lightgbm/basic.py Lines 1805 to 1806 in f85dfa2
So, back to your example...when you first run LightGBM/python-package/lightgbm/basic.py Line 2577 in f85dfa2
So if you didn't call the example code showing this (click me)import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size))
train_params = {
'objective': 'binary',
'verbose': -1,
'seed': 42
}
# confirm that Dataset handle is None
assert lgb_train.handle is None
model = lgb.train(train_params, lgb_train, num_boost_round=10)
# now the Dataset holds a pointer to a constructed Dataset on the C++ side
print(lgb_train.handle)
# c_void_p(140426868894112) so what should be done about this?For now, to re-use the same Dataset for training continuation, I think you'll have to set Looks like that is exactly what this project does in its tests for training continuation. LightGBM/tests/python_package_test/test_engine.py Lines 900 to 911 in ce486e5
But I think in the future, LightGBM should support the pattern you've described above. I'm not exactly sure where to make changes, but it makes sense to me that you might want to perform continued training on the same Dataset like this. |
Are there any updates on this task? I've run into the same issue, and it's a problem for me too! |
Thanks for using LightGBM. If you're interested in working on this or in providing specific details that would help us understand why addressing this is valuable, that would be welcome! But please avoid generic "any updates?" comments. Those just generate notifications for people subscribed to the issue and don't help to advance this project. |
Description
If the training dataset was construcrted with free_raw_data = True, it is possible to use it only once. Trying to continue training (using init_model parameter) leads to an error:
LightGBMError: Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)
Reproducible example
Environment info
LightGBM version: 3.3.1
Python version 3.9.7
The text was updated successfully, but these errors were encountered: