[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965

iwanko · 2022-01-21T18:16:19Z

Description

If the training dataset was construcrted with free_raw_data = True, it is possible to use it only once. Trying to continue training (using init_model parameter) leads to an error:

LightGBMError: Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)

Reproducible example

import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size))
train_params = {
    'objective': 'binary',
    'verbose': -1,
    'seed': 42
}

model = lgb.train(train_params, lgb_train, num_boost_round=10)
model = lgb.train(train_params, lgb_train, num_boost_round=10, init_model = model)

Environment info

LightGBM version: 3.3.1
Python version 3.9.7

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-01-21T18:49:27Z

Thanks for using LightGBM!

Can you please share a minimal, reproducible example? For example, using one of the freely-available dataset from sklearn.datasets in sciki-learn.

@TremaMiguel 's example in #4951 (comment) is a great example of a small, self-contained example used to demonstrate an issue.

jameslamb · 2022-01-22T19:30:51Z

Thanks very much for updating the description with a reproducible example! Excellent write-up, we really appreciate it.

I can confirm that on the most recent published version of lightgbm (3.3.2) and on the latest commit on master (f85dfa2), the provided code raises the following error

Cannot set predictor after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

How is continued training different from the first run, if the description states that the original dataset is freed after constructing the dataset? (Not after it's first usage.)

Note that "constructed" has a special meaning in LightGBM. It doesn't mean "called lgb.Dataset()".

LightGBM does some preprocessing like binning continuous features into histograms, dropping unsplittable features, encoding categorical features, and more. That preprocessing is what this project refers to as "constructing" a Dataset.

When you initially call lgb.Dataset() in the Python package, the returned Python object holds information like the raw data and the parameters to use in that preprocessing. When the .construct() method is called on that object, LightGBM passes the raw data and parameters to C++ code like LGBM_DatasetCreateFromMat()

LightGBM/src/c_api.cpp

Line 1071 in f85dfa2

int LGBM_DatasetCreateFromMat(const void* data,

That code initializes a LightGBM Dataset object in memory and returns a pointer to it, which is stored in Dataset.handle on the Python side.

Once that Dataset object has been constructed, LightGBM no longer needs your raw input data (e.g. the numpy array passed into lgb.Dataset()). So, by default, it removes its copy of that data.

LightGBM/python-package/lightgbm/basic.py

Lines 1805 to 1806 in f85dfa2

    
           if self.free_raw_data: 
        
               self.data = None

So, back to your example...when you first run lgb_train = lgb.Dataset(...), you've created a Dataset object on the Python side, but it hasn't been "constructed" yet. The first time you use that object for training, LightGBM will "construct" it.

LightGBM/python-package/lightgbm/basic.py

Line 2577 in f85dfa2

train_set.construct()

So if you didn't call the .construct() method on the Dataset before training, then it's first usage is also when it's constructed.

example code showing this (click me)

import numpy as np
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
lgb_train = lgb.Dataset(X, y, init_score=np.zeros(y.size))
train_params = {
    'objective': 'binary',
    'verbose': -1,
    'seed': 42
}

# confirm that Dataset handle is None
assert lgb_train.handle is None

model = lgb.train(train_params, lgb_train, num_boost_round=10)

# now the Dataset holds a pointer to a constructed Dataset on the C++ side
print(lgb_train.handle)
# c_void_p(140426868894112)

so what should be done about this?

For now, to re-use the same Dataset for training continuation, I think you'll have to set free_raw_data=False when first calling lgb.Dataset().

Looks like that is exactly what this project does in its tests for training continuation.

LightGBM/tests/python_package_test/test_engine.py

Lines 900 to 911 in ce486e5

    
           def test_continue_train_reused_dataset(): 
        
               X, y = make_synthetic_regression() 
        
               params = { 
        
                   'objective': 'regression', 
        
                   'verbose': -1 
        
               } 
        
               lgb_train = lgb.Dataset(X, y, free_raw_data=False) 
        
               init_gbm = lgb.train(params, lgb_train, num_boost_round=5) 
        
               init_gbm_2 = lgb.train(params, lgb_train, num_boost_round=5, init_model=init_gbm) 
        
               init_gbm_3 = lgb.train(params, lgb_train, num_boost_round=5, init_model=init_gbm_2) 
        
               gbm = lgb.train(params, lgb_train, num_boost_round=5, init_model=init_gbm_3) 
        
               assert gbm.current_iteration() == 20

But I think in the future, LightGBM should support the pattern you've described above. I'm not exactly sure where to make changes, but it makes sense to me that you might want to perform continued training on the same Dataset like this.

Linking some relevant discussions: #2899, #2906

jamespinkerton · 2025-02-09T01:51:33Z

Are there any updates on this task? I've run into the same issue, and it's a problem for me too!

jameslamb · 2025-02-09T02:57:08Z

Thanks for using LightGBM. If you're interested in working on this or in providing specific details that would help us understand why addressing this is valuable, that would be welcome!

But please avoid generic "any updates?" comments. Those just generate notifications for people subscribed to the issue and don't help to advance this project.

jameslamb added awaiting response question labels Jan 21, 2022

jameslamb changed the title ~~Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour?~~ [python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? Jan 21, 2022

jameslamb removed the awaiting response label Jan 22, 2022

jameslamb mentioned this issue Mar 26, 2022

[python-package] label attribute of training dataset changes from pandas series to numpy array after model training #5099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965

[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965

iwanko commented Jan 21, 2022 •

edited

Loading

jameslamb commented Jan 21, 2022

jameslamb commented Jan 22, 2022

jamespinkerton commented Feb 9, 2025 •

edited

Loading

jameslamb commented Feb 9, 2025

[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965

[python] Reusing dataset constructed with free_raw_data = True isn't possible. Intended behaviour? #4965

Comments

iwanko commented Jan 21, 2022 • edited Loading

Description

Reproducible example

Environment info

jameslamb commented Jan 21, 2022

jameslamb commented Jan 22, 2022

so what should be done about this?

jamespinkerton commented Feb 9, 2025 • edited Loading

jameslamb commented Feb 9, 2025

iwanko commented Jan 21, 2022 •

edited

Loading

jamespinkerton commented Feb 9, 2025 •

edited

Loading