how your code performing on sparse data #16

Sandy4321 · 2023-08-06T19:20:40Z

seems to be GBM are bad on sparse data for classification
how your code performing on sparse data
NLP one hot data is very sparse , let say 98% of data are zeros?

btbpanda · 2023-08-07T20:49:17Z

Hi @Sandy4321

Thanks for your question. You are right, all GBMs by design are probably not the best choice for dealing with the sparse data. Even thought some of SotA implementations such as LightGBM or XGBoost support the sparse format and implement specific features for this data type, performance may be less than neural networks or even linear models. But it actually depends on the task - each problem are individual and only experiment will show you what is the best. However, unfortunately, py-boost has no built-in support of sparse arrays. To handle it, you should manually convert it to the dense array. We have a plan to support sparsity for both features and targets, but don't expect it will be released soon. But some optimizations could be made here. All of them could save some memory and prevent overfitting. Some times it could be enough to fit GPU memory, if dataset is not large:

limit max_bin to 8-16 or even 4
limit colsample to 0.1-0.2
limit max_depth to 3-4

But in general, common approach to train GBM over sparse representation will be dimension reduction (via SVD for example) before training or using another representation than BoW/tf-idf. Typically I expect a better performance from both approaches regardless of GBM implementation, espessialy for NLP task where we have a lot of pretrained language models.

Sandy4321 · 2023-08-17T21:47:44Z

"You are right, all GBMs by design are probably not the best choice for dealing with the sparse data."
can you share some link
i glad you understand it, may support your opinion by evidence ,
since as rule people do not aware about such an issue

i can not find any serious web link to persuade these people, that GBM is bad on sparse data ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how your code performing on sparse data #16

how your code performing on sparse data #16

Sandy4321 commented Aug 6, 2023

btbpanda commented Aug 7, 2023

Sandy4321 commented Aug 17, 2023 •

edited

Loading

how your code performing on sparse data #16

how your code performing on sparse data #16

Comments

Sandy4321 commented Aug 6, 2023

btbpanda commented Aug 7, 2023

Sandy4321 commented Aug 17, 2023 • edited Loading

Sandy4321 commented Aug 17, 2023 •

edited

Loading