Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965

Open
iwanko opened this issue Jan 21, 2022 · 4 comments
Labels

Comments

@iwanko
Copy link

iwanko commented Jan 21, 2022

Description

If the training dataset was construcrted with free_raw_data = True, it is possible to use it only once. Trying to continue training (using init_model parameter) leads to an error:

LightGBMError: Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)

Reproducible example

import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size))
train_params = {
    'objective': 'binary',
    'verbose': -1,
    'seed': 42
}

model = lgb.train(train_params, lgb_train, num_boost_round=10)
model = lgb.train(train_params, lgb_train, num_boost_round=10, init_model = model)

Environment info

LightGBM version: 3.3.1
Python version 3.9.7

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM!

Can you please share a minimal, reproducible example? For example, using one of the freely-available dataset from sklearn.datasets in sciki-learn.

@TremaMiguel 's example in #4951 (comment) is a great example of a small, self-contained example used to demonstrate an issue.

@jameslamb jameslamb changed the title Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? [python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? Jan 21, 2022
@jameslamb
Copy link
Collaborator

Thanks very much for updating the description with a reproducible example! Excellent write-up, we really appreciate it.

I can confirm that on the most recent published version of lightgbm (3.3.2) and on the latest commit on master (f85dfa2), the provided code raises the following error

Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.


How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)

Note that "constructed" has a special meaning in LightGBM. It doesn't mean "called lgb.Dataset()".

LightGBM does some preprocessing like binning continuous features into histograms, dropping unsplittable features, encoding categorical features, and more. That preprocessing is what this project refers to as "constructing" a Dataset.

When you initially call lgb.Dataset() in the Python package, the returned Python object holds information like the raw data and the parameters to use in that preprocessing. When the .construct() method is called on that object, LightGBM passes the raw data and parameters to C++ code like LGBM_DatasetCreateFromMat()

int LGBM_DatasetCreateFromMat(const void* data,

That code initializes a LightGBM Dataset object in memory and returns a pointer to it, which is stored in Dataset.handle on the Python side.

Once that Dataset object has been constructed, LightGBM no longer needs your raw input data (e.g. the numpy array passed into lgb.Dataset()). So, by default, it removes its copy of that data.

if self.free_raw_data:
self.data = None

So, back to your example...when you first run lgb_train = lgb.Dataset(...), you've created a Dataset object on the Python side, but it hasn't been "constructed" yet. The first time you use that object for training, LightGBM will "construct" it.

train_set.construct()

So if you didn't call the .construct() method on the Dataset before training, then it's first usage is also when it's constructed.

example code showing this (click me)
import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size))
train_params = {
    'objective': 'binary',
    'verbose': -1,
    'seed': 42
}

# confirm that Dataset handle is None
assert lgb_train.handle is None

model = lgb.train(train_params, lgb_train, num_boost_round=10)

# now the Dataset holds a pointer to a constructed Dataset on the C++ side
print(lgb_train.handle)
# c_void_p(140426868894112)

so what should be done about this?

For now, to re-use the same Dataset for training continuation, I think you'll have to set free_raw_data=False when first calling lgb.Dataset().

Looks like that is exactly what this project does in its tests for training continuation.

def test_continue_train_reused_dataset():
X, y = make_synthetic_regression()
params = {
'objective': 'regression',
'verbose': -1
}
lgb_train = lgb.Dataset(X, y, free_raw_data=False)
init_gbm = lgb.train(params, lgb_train, num_boost_round=5)
init_gbm_2 = lgb.train(params, lgb_train, num_boost_round=5, init_model=init_gbm)
init_gbm_3 = lgb.train(params, lgb_train, num_boost_round=5, init_model=init_gbm_2)
gbm = lgb.train(params, lgb_train, num_boost_round=5, init_model=init_gbm_3)
assert gbm.current_iteration() == 20

But I think in the future, LightGBM should support the pattern you've described above. I'm not exactly sure where to make changes, but it makes sense to me that you might want to perform continued training on the same Dataset like this.

Linking some relevant discussions: #2899, #2906

@jamespinkerton
Copy link

jamespinkerton commented Feb 9, 2025

Are there any updates on this task? I've run into the same issue, and it's a problem for me too!

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM. If you're interested in working on this or in providing specific details that would help us understand why addressing this is valuable, that would be welcome!

But please avoid generic "any updates?" comments. Those just generate notifications for people subscribed to the issue and don't help to advance this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants