Skip to content

Nightly regression: gfx1151/gfx1150 libhsa-runtime64 (7.14.0a20260609+) segfaults in GpuAgent::InitDma #5763

@jimw567

Description

@jimw567

Summary

Starting with the 7.14.0a20260609 nightly tarball, libhsa-runtime64 segfaults during hsa_init on gfx1151 (Radeon 8060S, Strix Halo). The 7.14.0a20260608 nightly works fine. The crash happens deep in rocr::AMD::GpuAgent::InitDma() (control jumps to a bogus address 0x100000001), so any HIP program that enumerates devices dies immediately.

Affected builds

Tarballs from https://therock-nightly-tarball.s3.amazonaws.com/therock-dist-linux-<target>-<version>.tar.gz. Discriminating by the libamdhip64.so build hash (the version string in release notes is unreliable):

Nightly date libamdhip64 build hash gfx1151 result
7.14.0a20260607 d34cbb6409 works
7.14.0a20260608 d34cbb6409 works
7.14.0a20260609 1b2a555677 segfault
7.14.0a20260610 1b2a555677 segfault

The regression boundary is precisely June 8 → June 9.

Environment

  • GPU: AMD Radeon 8060S Graphics, gfx1151 (Strix Halo). Also reproduced on gfx1150.
  • Kernel: 6.14.0-1019-oem (in-kernel amdgpu/amdkfd, no amdgpu-dkms), /dev/kfd present.
  • System ROCm: 7.14.0.
  • ROCm userspace under test: TheRock nightly S3 tarballs (userspace-only, no kernel components).

Reproduction

LD_LIBRARY_PATH=$PWD/lib ./llama-cli -m Qwen3-0.6B-Q4_0.gguf -ngl 99 -p "hi" -n 8 -no-cnv

(Any HIP program that calls hipGetDeviceCount / triggers hsa_init will crash the same way.)

gdb backtrace

Crash in libhsa-runtime64.so.1 — control transfers to 0x100000001
  rocr::AMD::GpuAgent::InitDma()
  hsa_init
  hipGetDeviceCount
  ggml_cuda_init

Isolation

  • Swapping only libhsa-runtime64.so* from the 20260608 (or system 7.14.0) build into the 20260609 tarball fixes the crash.
  • Swapping only libamdhip64.so* does not fix it.

So the regression is isolated to libhsa-runtime64 between the 20260608 (d34cbb6409) and 20260609 (1b2a555677) builds.

Expectation

hsa_init should succeed on gfx1151/gfx1150 with the in-kernel amdkfd on kernel 6.14, as it did through 20260608.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    TODO

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions