-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
60x Slowdown Using Concurrency #6728
Comments
Hey @JohanLoknaQC, thanks for using LightGBM. By default LightGBM uses all available threads on the machine unless you tell it otherwise. So in your examples you're submitting n tasks and assigning only n - 1 threads, so they have to fight each other to execute them. I think the easiest way to fix this is by doing something like |
Thanks a lot for the answer! However, after adding the suggested fix (see code above) the run-times remains virtually unchanged. It does seem like something else might be causing this additional run-time. |
Sorry, I think that only works if provided through the command line. Can you please set the params = {
"objective": "regression",
"metric": "rmse",
"num_leaves": 31, # the default value
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"verbose": 0,
"num_threads": n - 1, # <- set this
} |
Thank you very much - this solved this issue. Just for reference, it also worked when the affinities were set quite arbitrarily, e.g. |
Description
There seems to be a clear issue related to how
lightgbm
handles resource sharing. When restricting the number of cores associated with a process, the runtime increases significantly.In the example provided below, the run time using all cores (0-15) is about 1.821 seconds. When restricting the process to all cores but one (0 - 14), the runtime increases to 109.31 seconds; more than a 60x increase. This only happens if the resource restriction is done from within the Python script. If the affinity is set beforehand using
taskset -c 0-14
the runtime is approximately the same, 1.796 seconds.This makes training multiple
lightgbm
models in parallel undesirable, at least if the subprocesses are called from within a Python script. As this a common pattern of implementing concurrency, this appears to be a limitation which can hopefully be easily addressed and fixed.Thanks!
Reproducible example
lgbm_affinity.py
lgbm_affinity.sh
Output
Environment info
LightGBM version or commit hash:
Command(s) you used to install LightGBM
Other used packages:
The example was run on an AWS instance (
ml.m5.4xlarge
) with 16 cores.Additional Comments
The text was updated successfully, but these errors were encountered: