-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how your code performing on sparse data #16
Comments
Hi @Sandy4321 Thanks for your question. You are right, all GBMs by design are probably not the best choice for dealing with the sparse data. Even thought some of SotA implementations such as LightGBM or XGBoost support the sparse format and implement specific features for this data type, performance may be less than neural networks or even linear models. But it actually depends on the task - each problem are individual and only experiment will show you what is the best. However, unfortunately, py-boost has no built-in support of sparse arrays. To handle it, you should manually convert it to the dense array. We have a plan to support sparsity for both features and targets, but don't expect it will be released soon. But some optimizations could be made here. All of them could save some memory and prevent overfitting. Some times it could be enough to fit GPU memory, if dataset is not large:
But in general, common approach to train GBM over sparse representation will be dimension reduction (via SVD for example) before training or using another representation than BoW/tf-idf. Typically I expect a better performance from both approaches regardless of GBM implementation, espessialy for NLP task where we have a lot of pretrained language models. |
"You are right, all GBMs by design are probably not the best choice for dealing with the sparse data." i can not find any serious web link to persuade these people, that GBM is bad on sparse data ? |
seems to be GBM are bad on sparse data for classification
how your code performing on sparse data
NLP one hot data is very sparse , let say 98% of data are zeros?
The text was updated successfully, but these errors were encountered: