Skip to content

Conversation

@SergiMuac
Copy link

@SergiMuac SergiMuac commented Dec 12, 2025

Starting on CUDA 12.4, loading kernel modules in a graph is supported. However, it does not work for older CUDA versions that are actively used on GPU clusters (like 12.2).

This PR adds the logic to check the CUDA driver version, which might be different from the toolkit version. This way, the variable that enables the graph creation is set properly.

This fixes the following error:
Warp CUDA error 900: operation not permitted when stream is capturing (in function wp_cuda_load_module, /builds/omniverse/warp/warp/native/warp.cu:4389)

Tested on pc with rtx5090 cuda 13.0 and on cluster H100 cuda 12.2.

@kevinzakka
Copy link
Collaborator

Hi @SergiMuac, thanks for your contribution! I checked Newton code and they don't do such checking either. I'd like to err on the side of simplicity and just warn the user that they should use 12.4+ for graph capture support, something we do in fact currently document. Do you feel strongly about having this in the code?

@SergiMuac
Copy link
Author

Hi @kevinzakka,

Short anwer: Yes!

I have identified a bug in the Sim.py code. The condition
self.use_cuda_graph = self.wp_device.is_cuda and wp.is_mempool_enabled(self.wp_device)
is necessary but not sufficient. While it is true that this logic works for all CUDA 12.4+ versions, and it would be possible to simply state in the documentation that MJLab requires CUDA 12.4+, this does not seem to be the best approach.

Allow me to explain, to the best of my understanding, what is happening. CUDA is backward compatible across versions in the sense that kernel modules can be recompiled on demand using instructions compatible with older versions. However, the CUDA Runtime itself has version-dependent limitations, and certain features are simply unavailable in older runtimes. In the case of sim.py, I have identified two issues that cause the code to fail silently. First, when using an older CUDA Runtime, kernel modules are recompiled as described above, but the script enables graph capture before all modules are fully loaded, which leads to an error. This can be addressed by performing a simple warm-up of the modules, for example by calling step once at the beginning.

The second issue concerns CUDA graph capture. This feature is only available starting from CUDA Runtime 12.4, but this requirement is not checked anywhere in the code. As a result, the code starts normally, but fails when create_graph is called. This could be fixed by adding an additional condition that explicitly checks the CUDA Runtime version, for example:

driver_ver = wp.context.runtime.driver_version
driver_ver = float(f"{driver_ver[0]}.{driver_ver[1]}")
self.use_cuda_graph = (
    self.wp_device.is_cuda
    and wp.is_mempool_enabled(self.wp_device)
    and driver_ver >= _MIN_DRIVER_FOR_CONDITIONAL_GRAPHS
)

With these two changes, execution would be robust across any CUDA 12.x version. If, after this explanation, you still prefer not to introduce these changes, I strongly believe that at a minimum the relevant CUDA graph or module-loading exceptions should be properly caught, so that the code does not continue failing silently when a CUDA version prior to 12.4 is used.

After further investigation, I have simplified the checks and reduced the additional code to a minimum. I would be interested to hear your thoughts on this approach.

Comment on lines 142 to 145
if self.use_cuda_graph:
print("Warming up CUDA kernels...")
mjwarp.step(self.wp_model, self.wp_data)
wp.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot confirm that this step is unnecessary, as no dedicated setup is available to validate the scenario. In theory, if the CUDA driver runtime accepts CUDA graph capture but does not support lazy module loading, preloading all modules before enabling graph capture would prevent a potential crash; however, it is unclear whether any released CUDA driver versions actually exhibit this behavior.
Empirically, the CUDA versions that satisfy the initial runtime check also appear to support lazy module loading, suggesting that this potential failure mode is already implicitly covered.
Tests performed on the available machines indicate that the system functions correctly without the warm-up step; therefore, it can be removed, with the understanding that a separate pull request can be opened in the future if needed. The changes will be pushed accordingly.

@SergiMuac
Copy link
Author

SergiMuac commented Dec 22, 2025

It appears that the pipeline crashes when checking CUDA version because GitHub runners do not have CUDA drivers installed. I'll add try/except mechanism to handle this scenario gracefully.

@kevinzakka
Copy link
Collaborator

Need to take a look at the failing tests

@kevinzakka
Copy link
Collaborator

Hi @SergiMuac! I've refactored the CUDA graph checking logic to make it cleaner and fix a scope bug. Could you cherry-pick this commit into your PR?

git fetch https://github.com/mujocolab/mjlab.git feat/support_cuda122
git cherry-pick 9baedc7

Copy link
Collaborator

@kevinzakka kevinzakka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See last comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants