Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local_docker does not add utility nvidia libraries to containers #906

Closed
1 of 10 tasks
clumsy opened this issue May 6, 2024 · 2 comments
Closed
1 of 10 tasks

local_docker does not add utility nvidia libraries to containers #906

clumsy opened this issue May 6, 2024 · 2 comments

Comments

@clumsy
Copy link
Contributor

clumsy commented May 6, 2024

🐛 Bug

nvidia Docker images require adding libraries like libnvidia-ml that are part of utility capability.

TorchX currently only adds compute here.

There are two solutions I verified for this issue, not sure which one is better:

  • [This PR] add utility next to compute. Similar fixes here and here
  • [My personal preference] remove device capabilities from docker_scheduler and rely on default values OR let the user customize via nvidia-container-toolkit, e.g. we don't set this for aws_batch_scheduler and it works fine here. Not sure if it's backward compatible with old versions of nvidia-container-runtime though.

NOTE: nvidia-container-runtime has been superseded by nvidia-container-toolkit.

Module (check all that applies):

  • torchx.spec
  • torchx.component
  • torchx.apps
  • torchx.runtime
  • torchx.cli
  • torchx.schedulers
  • torchx.pipelines
  • torchx.aws
  • torchx.examples
  • other

To Reproduce

Steps to reproduce the behavior:

This results in a crash with:

train/0 [0]:[rank: 0] Global seed set to 10
train/0 [0]:Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:distributed_backend=nccl
train/0 [0]:All distributed processes registered. Starting with 1 processes
train/0 [0]:----------------------------------------------------------------------------------------------------
train/0 [0]:
train/0 [0]:Error executing job with overrides: ['++trainer.max_steps=10']
train/0 [0]:Traceback (most recent call last):
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
train/0 [0]:    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
train/0 [0]:    return function(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
train/0 [0]:    self._run(model, ckpt_path=ckpt_path)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 939, in _run
train/0 [0]:    self.__setup_profiler()
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1069, in __setup_profiler
train/0 [0]:    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1192, in log_dir
train/0 [0]:    dirpath = self.strategy.broadcast(dirpath)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 292, in broadcast
train/0 [0]:    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]:    return func(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2597, in broadcast_object_list
train/0 [0]:    broadcast(object_sizes_tensor, src=src, group=group)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
train/0 [0]:    return func(*args, **kwargs)
train/0 [0]:  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1900, in broadcast
train/0 [0]:    work = default_pg.broadcast([tensor], opts)
train/0 [0]:torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
train/0 [0]:ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
train/0 [0]:Last error:
train/0 [0]:Failed to open libnvidia-ml.so.1

Expected behavior

libnvidia-ml and other libraries should be added to container.

Environment

  • torchx version (e.g. 0.1.0rc1):
  • Python version:
  • OS (e.g., Linux):
  • How you installed torchx (conda, pip, source, docker):
  • Docker image and tag (if using docker):
  • Git commit (if installed from source):
  • Execution environment (on-prem, AWS, GCP, Azure etc):
  • Any other relevant information:
PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64)
GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17)
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.26

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.10.201-191.748.amzn2.x86_64-x86_64-with-glibc2.26
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 535.104.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             2699.588
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.02
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
...

Additional context

clumsy pushed a commit to clumsy/torchx that referenced this issue May 6, 2024
@clumsy
Copy link
Contributor Author

clumsy commented May 6, 2024

Please advise @kiukchung, @d4l3k

facebook-github-bot pushed a commit that referenced this issue May 7, 2024
Differential Revision: D57034038

Pull Request resolved: #907
@clumsy
Copy link
Contributor Author

clumsy commented May 8, 2024

Looks like this issue can be closed after the fix was merged, @andywag.
I still wonder if we can remove device_request completely from local_docker to let it default to compute,utility.

@clumsy clumsy closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant