Closed
Description
🐛 Bug
nvidia Docker images require adding libraries like libnvidia-ml
that are part of utility
capability.
TorchX currently only adds compute
here.
There are two solutions I verified for this issue, not sure which one is better:
- [This PR] add
utility
next tocompute
. Similar fixes here and here - [My personal preference] remove device capabilities from
docker_scheduler
and rely on default values OR let the user customize vianvidia-container-toolkit
, e.g. we don't set this foraws_batch_scheduler
and it works fine here. Not sure if it's backward compatible with old versions of nvidia-container-runtime though.
NOTE: nvidia-container-runtime has been superseded by nvidia-container-toolkit.
Module (check all that applies):
-
torchx.spec
-
torchx.component
-
torchx.apps
-
torchx.runtime
-
torchx.cli
-
torchx.schedulers
-
torchx.pipelines
-
torchx.aws
-
torchx.examples
-
other
To Reproduce
Steps to reproduce the behavior:
This results in a crash with:
train/0 [0]:[rank: 0] Global seed set to 10
train/0 [0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:distributed_backend=nccl
train/0 [0]:All distributed processes registered. Starting with 1 processes
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:
train/0 [0]:Error executing job with overrides: ['++trainer.max_steps=10']
train/0 [0]:Traceback (most recent call last):
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
train/0 [0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
train/0 [0]: return function(*args, **kwargs)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
train/0 [0]: self._run(model, ckpt_path=ckpt_path)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
train/0 [0]: self.__setup_profiler()
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
train/0 [0]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
train/0 [0]: dirpath = self.strategy.broadcast(dirpath)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
train/0 [0]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]: return func(*args, **kwargs)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2597, in broadcast_object_list
train/0 [0]: broadcast(object_sizes_tensor, src=src, group=group)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]: return func(*args, **kwargs)
train/0 [0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1900, in broadcast
train/0 [0]: work = default_pg.broadcast([tensor], opts)
train/0 [0]:torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
train/0 [0]:ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
train/0 [0]:Last error:
train/0 [0]:Failed to open libnvidia-ml.so.1
Expected behavior
libnvidia-ml
and other libraries should be added to container.
Environment
- torchx version (e.g. 0.1.0rc1):
- Python version:
- OS (e.g., Linux):
- How you installed torchx (
conda
,pip
, source,docker
): - Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information:
PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.26
Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.201-191.748.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 2699.588
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4600.02
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
...
Additional context
Metadata
Metadata
Assignees
Labels
No labels