[Issue]: rocprofiler::agent::(anonymous namespace)::read_topology() causes segfault in PyTorch #59

sraikund16 · 2025-05-01T20:37:44Z

Problem Description

The PyTorch team is trying to migrate from Roctracer to Rocprofiler-sdk but we are running into an issue where internal tests fail because of a segfault during initialization. Link to PR

During init, rocprofiler_configure correctly gets called which begins the toolInit. During this init, a call to rocprofiler_query_available_agents is made here which causes a segfault.

To isolate this issue, we copied the get_gpu_device_agents routine into a standalone script and ran it, which still produces the same segfault. The backtrace of the crash is posted below:

1 (lldb) bt
2  * thread #1, name = 'hip_example', stop reason = signal SIGSEGV: address not mapped to object (fault address=0x0)
3  * frame #0: 0x00007ffede7b6698 libc.so.6`__strlen_evex at strlen-evex.S:77
4    frame #1: 0x00007ffedea3be3b librocprofiler-sdk.so.0`rocprofiler::agent::(anonymous namespace)::read_topology() + 9483
5    frame #2: 0x00007ffedea3d58d librocprofiler-sdk.so.0`rocprofiler::agent::get_agents() + 365
6    frame #3: 0x00007ffedea3ec4f librocprofiler-sdk.so.0`rocprofiler_query_available_agents + 79
7    frame #4: 0x000000000021ffd1 hip_example`main at simple_hip.cpp:48
8    frame #5: 0x00007ffede62c657 libc.so.6`__libc_start_call_main(main=(hip_example`main at simple_hip.cpp:19), argc=1, argv=0x00007fffffffd3b8) at libc_start_call_main.h:58:16
9    frame #6: 0x00007ffede62c718 libc.so.6`__libc_start_main_impl(main=(hip_example`main at simple_hip.cpp:19), argc=1, argv=0x00007fffffffd3b8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffd3a8) at libc-start.c:409:3
10    frame #7: 0x000000000021fe71 hip_example`_start at start.S:116

Based on this, it looks like the issue is occurring in the read_topology() function defined here. Based on talks with the Rocprofiler team, it seems like this routine works internally at AMD. My suspicion is that the read_topology() makes some assumption that is true in the AMD environment but not in the one we are running in. This is preventing us from migrating to Rocprofiler-sdk so if there is any additional information needed, please let us know!

Operating System

CentOS Stream Version 9

CPU

AMD EPYC 9654 96-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

6.2.1

ROCm Component

rocprofiler

Steps to Reproduce

#include <hip/hip_runtime.h>
#include <rocprofiler-sdk/context.h>
#include <rocprofiler-sdk/cxx/name_info.hpp>
#include <rocprofiler-sdk/fwd.h>
#include <rocprofiler-sdk/marker/api_id.h>
#include <rocprofiler-sdk/registration.h>
#include <rocprofiler-sdk/rocprofiler.h>
#include <iostream>

int main() {
  int deviceCount = 0;
  std::vector<rocprofiler_agent_v0_t> agents;
  // Callback used by rocprofiler_query_available_agents to return
  // agents on the device. This can include CPU agents as well. We
  // select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
  rocprofiler_query_available_agents_cb_t iterate_cb =
      [](rocprofiler_agent_version_t agents_ver,
         const void** agents_arr,
         size_t num_agents,
         void* udata) {
        if (agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
          throw std::runtime_error{"unexpected rocprofiler agent version"};
        auto* agents_v =
            static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
        for (size_t i = 0; i < num_agents; ++i) {
          const auto* agent =
              static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
          // if(agent->type == ROCPROFILER_AGENT_TYPE_GPU)
          // agents_v->emplace_back(*agent);
          agents_v->emplace_back(*agent);
        }
        return ROCPROFILER_STATUS_SUCCESS;
      };

  // Query the agents, only a single callback is made that contains a vector
  // of all agents.
  rocprofiler_query_available_agents(
      ROCPROFILER_AGENT_INFO_VERSION_0,
      iterate_cb,
      sizeof(rocprofiler_agent_t),
      const_cast<void*>(static_cast<const void*>(&agents)));
  std::cout << "Main function\n";
  hipError_t err = hipGetDeviceCount(&deviceCount);
  std::cout << "got device count\n";
  if (err != hipSuccess) {
    std::cerr << "Failed to get device count: " << hipGetErrorString(err)
              << std::endl;
    return 1;
  }
  std::cout << "Number of HIP devices: " << deviceCount << std::endl;
  return 0;
}

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

ppanchad-amd · 2025-05-05T13:47:45Z

Hi @sraikund16. Internal ticket has been created to investigate this issue. Thanks!

darren-amd · 2025-05-22T20:40:09Z

Hi @sraikund16,

Thanks for reporting the issue! I gave it a try on an MI300X system on both the latest ROCm version (6.4.1), as well as the version you're running on (6.2.1) and haven't been able to reproduce the issue. Could you please provide the below information:

Where you are running the reproducer (bare-metal or a container)
Your kernel and dkms version
How you are compiling the reproducer (hipcc test.cpp -I${ROCM_PATH}/include -L${ROCM_PATH}/lib -lrocprofiler-sdk)
Output of rocminfo
Output of rocm-smi
The output of this python script that prints the node info: printNodes.txt

Also, could you try running in the the latest pytorch container: docker pull rocm/pytorch:latest? Thanks!

sraikund16 · 2025-05-27T22:34:08Z

@darren-amd It looks like the previous machine that was encountering this issue was deprovisioned for faulty hardware. Let me try running this on another machine and if it still has issues I will provide the information you asked for.

ppanchad-amd added the Under Investigation label May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: rocprofiler::agent::(anonymous namespace)::read_topology() causes segfault in PyTorch #59

[Issue]: rocprofiler::agent::(anonymous namespace)::read_topology() causes segfault in PyTorch #59

sraikund16 commented May 1, 2025

ppanchad-amd commented May 5, 2025

Uh oh!

darren-amd commented May 22, 2025

Uh oh!

sraikund16 commented May 27, 2025

Uh oh!

[Issue]: rocprofiler::agent::(anonymous namespace)::read_topology() causes segfault in PyTorch #59

[Issue]: rocprofiler::agent::(anonymous namespace)::read_topology() causes segfault in PyTorch #59

Comments

sraikund16 commented May 1, 2025

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

ppanchad-amd commented May 5, 2025

Uh oh!

darren-amd commented May 22, 2025

Uh oh!

sraikund16 commented May 27, 2025

Uh oh!