Skip to content

TSAN timeout in iree/async/platform/io_uring/cts/core_tests on ROCm Linux CI runner #19

@stellaraccident

Description

@stellaraccident

The Linux TSAN CI configuration is consistently timing out in the IREE async io_uring CTS core test on the managed ROCm Linux runner.

Failing run:
https://github.com/ROCm/hrx-system/actions/runs/26672062101?pr=18

Failing job:
Linux / CMake / CI (TSAN, tsan, true, true, false) / TSAN

Runner/job details from the log:

  • Runner: aws-linux-scale-rocm-prod-vzddl-runner-9k42x
  • Workflow ref: refs/pull/18/merge
  • Merge commit: bcd69bc
  • Config:
    • HRX_SANITIZER=tsan
    • HRX_ASSERTIONS=true
    • HRX_BUILD_TYPE=RelWithDebInfo
    • HRX_TEST_GPU=false
    • HRX_PACKAGE=false
  • Test command:
ctest --test-dir /__w/hrx-system/hrx-system/build/linux/install/composed/share/hrx-system/tests \
    --output-on-failure \
    --parallel 48

Failure:

Total Test time (real) = 60.06 sec

The following tests FAILED:
    2 - iree/async/platform/io_uring/cts/core_tests (Timeout)

The timeout occurs inside:

CTS/TsanBridgeTest.MultiThreadReuse/io_uring

The last emitted test output before the 60s CTest timeout is:

[ RUN      ] CTS/TsanBridgeTest.SingleThreadReuse/io_uring_minimal
[       OK ] CTS/TsanBridgeTest.SingleThreadReuse/io_uring_minimal (0 ms)
[ RUN      ] CTS/TsanBridgeTest.MultiThreadReuse/io_uring

Expected behavior:
iree/async/platform/io_uring/cts/core_tests should complete under TSAN, or the specific TSAN-incompatible io_uring case should be disabled with an issue reference.

Local investigation:

  • Built the same TSAN configuration locally using ROCm /srv/vm-shared/shared/rocm-7.14.0a20260527.
  • Local host: Fedora 43, Linux 6.19.6-200.fc43.x86_64, 192 logical CPUs.
  • Full local TSAN installed-test run used --parallel 96; it did not reproduce this timeout. io_uring/cts/core_tests completed.
  • Repeated isolated local runs of iree/async/platform/io_uring/cts/core_tests also completed.
  • Repeated local runs of the whole iree/async/platform/io_uring group completed.
  • Local TSAN did expose unrelated ROCr/HSA TSAN reports in AMDGPU tests on the GPU host, but those are distinct from this CI timeout.

Current hypothesis:
This is likely a Linux/kernel/runtime interaction specific to the current ROCm CI runner environment for io_uring under TSAN. The hang is in the TSAN bridge test’s MultiThreadReuse/io_uring
parameterization, not in the whole TSAN suite generally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions