You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two solutions I verified for this issue, not sure which one is better:
[This PR] add utility next to compute. Similar fixes here and here
[My personal preference] remove device capabilities from docker_scheduler and rely on default values OR let the user customize via nvidia-container-toolkit, e.g. we don't set this for aws_batch_scheduler and it works fine here. Not sure if it's backward compatible with old versions of nvidia-container-runtime though.
Looks like this issue can be closed after the fix was merged, @andywag.
I still wonder if we can remove device_request completely from local_docker to let it default to compute,utility.
🐛 Bug
nvidia Docker images require adding libraries like
libnvidia-ml
that are part ofutility
capability.TorchX currently only adds
compute
here.There are two solutions I verified for this issue, not sure which one is better:
utility
next tocompute
. Similar fixes here and heredocker_scheduler
and rely on default values OR let the user customize vianvidia-container-toolkit
, e.g. we don't set this foraws_batch_scheduler
and it works fine here. Not sure if it's backward compatible with old versions of nvidia-container-runtime though.NOTE: nvidia-container-runtime has been superseded by nvidia-container-toolkit.
Module (check all that applies):
torchx.spec
torchx.component
torchx.apps
torchx.runtime
torchx.cli
torchx.schedulers
torchx.pipelines
torchx.aws
torchx.examples
other
To Reproduce
Steps to reproduce the behavior:
This results in a crash with:
Expected behavior
libnvidia-ml
and other libraries should be added to container.Environment
conda
,pip
, source,docker
):Additional context
The text was updated successfully, but these errors were encountered: