Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest NCCL 2.24.3 might crash XGBoost. #11154

Open
trivialfis opened this issue Jan 9, 2025 · 6 comments
Open

Latest NCCL 2.24.3 might crash XGBoost. #11154

trivialfis opened this issue Jan 9, 2025 · 6 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Jan 9, 2025

Workaround:

export NCCL_RAS_ENABLE=0
@trivialfis
Copy link
Member Author

This should not affect conda build.

@jakirkham
Copy link
Contributor

How do things look with NCCL 2.25.1-1?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 4, 2025

@jakirkham Just tried 2.25.1-1 (as part of #11202). I get the same error. I had to set the env var NCCL_RAS_ENABLE=0.

@jakirkham
Copy link
Contributor

Thanks Hyunsu! 🙏

This is with conda, pip, or both?

hcho3 added a commit to hcho3/xgboost that referenced this issue Feb 4, 2025
@hcho3
Copy link
Collaborator

hcho3 commented Feb 4, 2025

@jakirkham The issue only arises if NCCL was installed from pip. The issue does not arise if:

  1. NCCL is installed from Conda
  2. XGBoost was built with CMake flags: -DUSE_DLOPEN_NCCL=OFF (don't use dlopen for NCCL)

So this issue won't arise for the Conda package of XGBoost.

hcho3 added a commit to hcho3/xgboost that referenced this issue Feb 5, 2025
@trivialfis
Copy link
Member Author

Need to remove CI workarounds once the new nccl is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants