Skip to content

paddle 在A800上hang住 #71664

@wangguan1995

Description

@wangguan1995

bug描述 Describe the Bug

1. Bug 情况:

用户在使用【starccm+】占用3/7台 GPU的情况下,paddle指定空闲卡,进程hang住

Python 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import paddle
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0314 06:42:40.373276 10607 header_generator.cc:52] Unable to open file : /paddle/paddle/cinn/runtime/cuda/float16.h
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

Image

[pid 1832041] read(10, "rchar: 17688289\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10)                 = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=984884}, ru_stime={tv_sec=5, tv_usec=658737}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=997976000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 19/9035 10832\n", 1024) = 32
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17688778\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10)                 = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=985649}, ru_stime={tv_sec=5, tv_usec=658737}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=997963000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 19/9031 10832\n", 1024) = 32
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17689267\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10)                 = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=986212}, ru_stime={tv_sec=5, tv_usec=658908}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=998007000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 20/9033 10832\n", 1024) = 32
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17689756\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10)                 = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=986266}, ru_stime={tv_sec=5, tv_usec=659638}, ...}) = 0
[pid 1832041] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=997967000}, NULL) = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/stat", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "10759 (python) S 10287 10759 102"..., 1024) = 308
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/fd", O_RDONLY|O_DIRECTORY) = 10
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 13 entries */, 512) = 312
[pid 1832041] getdents64(10, 0x7f00b0de4b44 /* 0 entries */, 512) = 0
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/statm", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "2053228 149934 95481 743 0 73501"..., 1024) = 36
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/loadavg", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 1832041] read(10, "20.85 18.73 18.23 19/9031 10832\n", 1024) = 32
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/proc/self/io", O_RDONLY) = 10
[pid 1832041] fstat(10, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
[pid 1832041] read(10, "rchar: 17690245\nwchar: 367314\nsy"..., 1024) = 107
[pid 1832041] close(10)                 = 0
[pid 1832041] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 10
[pid 1832041] read(10, "0-127\n", 8192) = 6
[pid 1832041] close(10)                 = 0
[pid 1832041] getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1, tv_usec=986266}, ru_stime={tv_sec=5, tv_usec=660400}, ...}) = 0

2. 其他补充信息 Additional Supplementary Information

容器:
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddle:3.0.0rc1-gpu-cuda12.3-cudnn9.0-trt8.6

paddle版本
paddlepaddle-gpu 3.0.0.dev20250310

docker版本
Docker version 27.3.1, build ce12230

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions