-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Open
Labels
Description
bug描述 Describe the Bug
1. Bug 情况:
用户在使用【starccm+】占用3/7台 GPU的情况下,paddle指定空闲卡,进程hang住
Python 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0314 06:42:40.373276 10607 header_generator.cc:52] Unable to open file : /paddle/paddle/cinn/runtime/cuda/float16.h
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[pid 1832041] read(10, "rchar: 17688289\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10) = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=984884}, ru_stime={tv_sec=5, tv_usec=658737}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=997976000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 19/9035 10832\n", 1024) = 32
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17688778\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10) = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=985649}, ru_stime={tv_sec=5, tv_usec=658737}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=997963000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 19/9031 10832\n", 1024) = 32
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17689267\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10) = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=986212}, ru_stime={tv_sec=5, tv_usec=658908}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=998007000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 20/9033 10832\n", 1024) = 32
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17689756\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10) = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=986266}, ru_stime={tv_sec=5, tv_usec=659638}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=997967000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 19/9031 10832\n", 1024) = 32
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17690245\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10) = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10) = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=986266}, ru_stime={tv_sec=5, tv_usec=660400}, ...}) = 0
2. 其他补充信息 Additional Supplementary Information
容器:
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0rc1-gpu-cuda12.3-cudnn9.0-trt8.6
paddle版本
paddlepaddle-gpu 3.0.0.dev20250310
docker版本
Docker version 27.3.1, build ce12230