Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how your code performing on sparse data #16

Open
Sandy4321 opened this issue Aug 6, 2023 · 2 comments
Open

how your code performing on sparse data #16

Sandy4321 opened this issue Aug 6, 2023 · 2 comments

Comments

@Sandy4321
Copy link

seems to be GBM are bad on sparse data for classification
how your code performing on sparse data
NLP one hot data is very sparse , let say 98% of data are zeros?

@btbpanda
Copy link
Contributor

btbpanda commented Aug 7, 2023

Hi @Sandy4321

Thanks for your question. You are right, all GBMs by design are probably not the best choice for dealing with the sparse data. Even thought some of SotA implementations such as LightGBM or XGBoost support the sparse format and implement specific features for this data type, performance may be less than neural networks or even linear models. But it actually depends on the task - each problem are individual and only experiment will show you what is the best. However, unfortunately, py-boost has no built-in support of sparse arrays. To handle it, you should manually convert it to the dense array. We have a plan to support sparsity for both features and targets, but don't expect it will be released soon. But some optimizations could be made here. All of them could save some memory and prevent overfitting. Some times it could be enough to fit GPU memory, if dataset is not large:

  1. limit max_bin to 8-16 or even 4
  2. limit colsample to 0.1-0.2
  3. limit max_depth to 3-4

But in general, common approach to train GBM over sparse representation will be dimension reduction (via SVD for example) before training or using another representation than BoW/tf-idf. Typically I expect a better performance from both approaches regardless of GBM implementation, espessialy for NLP task where we have a lot of pretrained language models.

@Sandy4321
Copy link
Author

Sandy4321 commented Aug 17, 2023

"You are right, all GBMs by design are probably not the best choice for dealing with the sparse data."
can you share some link
i glad you understand it, may support your opinion by evidence ,
since as rule people do not aware about such an issue

i can not find any serious web link to persuade these people, that GBM is bad on sparse data ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants