-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check failed: (best_split_info.left_count) > (0) #4946
Comments
Thanks very much for using LightGBM! I suspect that this is the same as other issues that have been reported (e.g. #4739, #3679), but it's hard to say without more details. Are you able to provide any (and hopefully all) of the following?
That would be very helpful. Without such information and with only an error message, significant investigation will probably be required to figure out why you encountered this error. |
Thanks for your reply! I build the source code in this command: Sorry I can't paste my code because the code is on my comany machine. I can't screenshots or copy the code, but I can descript my code: |
That's ok, completely understand that the code might be sensitive. Are you able to share the values for |
params = { |
Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code? |
I personally don't have easy access to hardware like that. I might try at some point to get a VM from a cloud provider and work on some of the open GPU-specific issues in this project, but can't commit to that. Maybe some other maintainer or contributor will be able to help you. |
I have A100-40G and the above code fails. |
shared some details on the issue in another topic #2793 (comment) |
I was able to reproduce this. Details are below, but running the @chixujohnny sample code:
dies with If I switch from OpenCL to CUDA ( Details (for the curious):
|
Hi @jameslamb and others, It is possible for me to deterministically recreate this error. I was wondering if you had any pointers about how to go about debugging this further.
Thanks in advance for any advice! |
code for above
|
I was able to successfully run the large dataset with a change to src/treelearner/ocl/histogram256.cl. I wanted to see if there was some sort of type difference between C++ and OpenCL.
However, if you redefine ordered_gradients as __global const * ordered_gradients, the context will fill in the type, and the large training set runs. At first, I thought score_t was defined differently in the OpenCL code and C++, but I verified that they are both floats. It's not yet clear to me why the change allows the program to finish and I am investigating this. |
Thank you so much for the help! Whenever you feel you've identified the root cause, if you'd like to open a pull request we'd appreciate it, and can help with the contribution and testing process. |
I have the same issue. I changed histogram256.ocl file as advised in this thread. But still the issue persists. |
Thats interesting. I'll see if I can mimic that later this week when I have more bandwidth. I did trace down where that parameter is used. It's used in binary_objective.hpp line 93: if (is_unbalance_ && cnt_positive > 0 && cnt_negative > 0) { |
I still get the same error without modifying the cl file and using this config. Let me know if yours differs @Bhuvanamitra : task = train |
This error seems to occur during the tree building process, indicating a situation where a split was found but the left child of the split doesn't contain any data. As a workaround, I found that setting min_split_gain to 1 avoids this issue. However, I understand that this solution might not be suitable for all use cases, as it could potentially make the model more conservative about creating new splits, which might not yield the best results in all scenarios. I wanted to bring this to your attention and see if there might be a more general solution to this issue. Any insights or suggestions would be greatly appreciated. |
Hi @tolleybot
Note that this time complains about |
Same issue. Is there any method to avoid this issue like changing some params? |
I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0. |
I use min_child_weight to 1 and 1/X_train.shape[0]. Can't solve this error bro |
Is this problem solved in any 4.0.0 and above versions? |
I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem? |
I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly |
@jameslamb |
I will give it a try to see if it resolves the issue and get back to you with a response in the near future. This problem has truly been a source of long-standing distress for me |
Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead. Instructions for that build:
To use it, pass That version is more actively maintained and faster, and might not suffer from this issue.
The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.
For those "many people" watching this, here's how you could help:
|
Thank you very much, I'll give it a try |
Hi, I found a bug when training with large X_train.
lgb-gpu version: 3.3.2
CUDA=11.1
CentOS
ram=2TB
GPU=A100-40G
when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this:
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686
When I use LGB==3.2.1 , I have the same problem as #4480 : when the GPU memory more than 8.3G will Memory Object Allocation Failure
in this LGB version==3.3.2 , LGB can't load more than 17G GPU memory (GPU has 40G memory), it seems like something problem occur in the tree split step and only happened when GPU memory loaded more than 17G.
Another colleague has this same problem, his lgb version is 3.3.1
The text was updated successfully, but these errors were encountered: