-
Notifications
You must be signed in to change notification settings - Fork 153
added logic to create graph depending on driver cuda version #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @SergiMuac, thanks for your contribution! I checked Newton code and they don't do such checking either. I'd like to err on the side of simplicity and just warn the user that they should use 12.4+ for graph capture support, something we do in fact currently document. Do you feel strongly about having this in the code? |
|
Hi @kevinzakka, Short anwer: Yes! I have identified a bug in the Allow me to explain, to the best of my understanding, what is happening. CUDA is backward compatible across versions in the sense that kernel modules can be recompiled on demand using instructions compatible with older versions. However, the CUDA Runtime itself has version-dependent limitations, and certain features are simply unavailable in older runtimes. In the case of The second issue concerns CUDA graph capture. This feature is only available starting from CUDA Runtime 12.4, but this requirement is not checked anywhere in the code. As a result, the code starts normally, but fails when driver_ver = wp.context.runtime.driver_version
driver_ver = float(f"{driver_ver[0]}.{driver_ver[1]}")
self.use_cuda_graph = (
self.wp_device.is_cuda
and wp.is_mempool_enabled(self.wp_device)
and driver_ver >= _MIN_DRIVER_FOR_CONDITIONAL_GRAPHS
)With these two changes, execution would be robust across any CUDA 12.x version. If, after this explanation, you still prefer not to introduce these changes, I strongly believe that at a minimum the relevant CUDA graph or module-loading exceptions should be properly caught, so that the code does not continue failing silently when a CUDA version prior to 12.4 is used. After further investigation, I have simplified the checks and reduced the additional code to a minimum. I would be interested to hear your thoughts on this approach. |
src/mjlab/sim/sim.py
Outdated
| if self.use_cuda_graph: | ||
| print("Warming up CUDA kernels...") | ||
| mjwarp.step(self.wp_model, self.wp_data) | ||
| wp.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot confirm that this step is unnecessary, as no dedicated setup is available to validate the scenario. In theory, if the CUDA driver runtime accepts CUDA graph capture but does not support lazy module loading, preloading all modules before enabling graph capture would prevent a potential crash; however, it is unclear whether any released CUDA driver versions actually exhibit this behavior.
Empirically, the CUDA versions that satisfy the initial runtime check also appear to support lazy module loading, suggesting that this potential failure mode is already implicitly covered.
Tests performed on the available machines indicate that the system functions correctly without the warm-up step; therefore, it can be removed, with the understanding that a separate pull request can be opened in the future if needed. The changes will be pushed accordingly.
|
It appears that the pipeline crashes when checking CUDA version because GitHub runners do not have CUDA drivers installed. I'll add try/except mechanism to handle this scenario gracefully. |
|
Need to take a look at the failing tests |
|
Hi @SergiMuac! I've refactored the CUDA graph checking logic to make it cleaner and fix a scope bug. Could you cherry-pick this commit into your PR? git fetch https://github.com/mujocolab/mjlab.git feat/support_cuda122
git cherry-pick 9baedc7 |
kevinzakka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See last comment.
Starting on CUDA 12.4, loading kernel modules in a graph is supported. However, it does not work for older CUDA versions that are actively used on GPU clusters (like 12.2).
This PR adds the logic to check the CUDA driver version, which might be different from the toolkit version. This way, the variable that enables the graph creation is set properly.
This fixes the following error:
Warp CUDA error 900: operation not permitted when stream is capturing (in function wp_cuda_load_module, /builds/omniverse/warp/warp/native/warp.cu:4389)Tested on pc with rtx5090 cuda 13.0 and on cluster H100 cuda 12.2.