local_docker does not add utility nvidia libraries to containers

## 🐛 Bug

nvidia Docker images require adding libraries like `libnvidia-ml` that are part of `utility` capability.

TorchX currently only adds `compute` [here](https://github.com/pytorch/torchx/blob/eb7e3d87b10842f2f257b623f5a7ee084630f9f8/torchx/schedulers/docker_scheduler.py#L317).

There are two solutions I verified for this issue, **not sure which one is better**:

* [This PR] add `utility` next to `compute`. Similar fixes [here](https://github.com/portainer/portainer/blob/8a81d95253c1f1946f0ee5bc85db557cc66e602e/app/react/docker/containers/CreateView/ResourcesTab/GpuFieldset/toViewModel.ts#L10) and [here](https://github.com/arvados/arvados/blob/b0079c916db05e728cca119333e6dd6a1afd8a83/lib/crunchrun/docker.go#L172)
* [My personal preference] remove device capabilities from `docker_scheduler` and rely on default values OR let the user customize via `nvidia-container-toolkit`, e.g. we don't set this for `aws_batch_scheduler` and it works fine [here](https://github.com/pytorch/torchx/blob/eb7e3d87b10842f2f257b623f5a7ee084630f9f8/torchx/schedulers/aws_batch_scheduler.py#L236). Not sure if it's backward compatible with old versions of nvidia-container-runtime though.

**NOTE**: [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) has been superseded by [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit).

Module (check all that applies):
 * [ ] `torchx.spec`
 * [ ] `torchx.component`
 * [ ] `torchx.apps`
 * [ ] `torchx.runtime`
 * [ ] `torchx.cli`
 * [x] `torchx.schedulers`
 * [ ] `torchx.pipelines`
 * [ ] `torchx.aws`
 * [ ] `torchx.examples`
 * [ ] `other`


## To Reproduce

Steps to reproduce the behavior:

1.
1.
1.

This results in a crash with:
```
train/0 [0]:[rank: 0] Global seed set to 10
train/0 [0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:distributed_backend=nccl
train/0 [0]:All distributed processes registered. Starting with 1 processes
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:
train/0 [0]:Error executing job with overrides: ['++trainer.max_steps=10']
train/0 [0]:Traceback (most recent call last):
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
train/0 [0]:    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
train/0 [0]:    return function(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
train/0 [0]:    self._run(model, ckpt_path=ckpt_path)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
train/0 [0]:    self.__setup_profiler()
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
train/0 [0]:    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
train/0 [0]:    dirpath = self.strategy.broadcast(dirpath)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
train/0 [0]:    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]:    return func(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2597, in broadcast_object_list
train/0 [0]:    broadcast(object_sizes_tensor, src=src, group=group)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]:    return func(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1900, in broadcast
train/0 [0]:    work = default_pg.broadcast([tensor], opts)
train/0 [0]:torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
train/0 [0]:ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
train/0 [0]:Last error:
train/0 [0]:Failed to open libnvidia-ml.so.1
```

## Expected behavior

`libnvidia-ml` and other libraries should be added to container.

## Environment



 - torchx version (e.g. 0.1.0rc1):
 - Python version:
 - OS (e.g., Linux):
 - How you installed torchx (`conda`, `pip`, source, `docker`):
 - Docker image and tag (if using docker):
 - Git commit (if installed from source):
 - Execution environment (on-prem, AWS, GCP, Azure etc):
 - Any other relevant information:

```
PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.26

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.201-191.748.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2699.588
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.02
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
...
```

## Additional context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

local_docker does not add utility nvidia libraries to containers #906

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

local_docker does not add utility nvidia libraries to containers #906

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions