You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The PyTorch team is trying to migrate from Roctracer to Rocprofiler-sdk but we are running into an issue where internal tests fail because of a segfault during initialization. Link to PR
During init, rocprofiler_configure correctly gets called which begins the toolInit. During this init, a call to rocprofiler_query_available_agents is made here which causes a segfault.
To isolate this issue, we copied the get_gpu_device_agents routine into a standalone script and ran it, which still produces the same segfault. The backtrace of the crash is posted below:
1 (lldb) bt
2 * thread #1, name = 'hip_example', stop reason = signal SIGSEGV: address not mapped to object (fault address=0x0)
3 * frame #0: 0x00007ffede7b6698 libc.so.6`__strlen_evex at strlen-evex.S:77
4 frame #1: 0x00007ffedea3be3b librocprofiler-sdk.so.0`rocprofiler::agent::(anonymous namespace)::read_topology() + 9483
5 frame #2: 0x00007ffedea3d58d librocprofiler-sdk.so.0`rocprofiler::agent::get_agents() + 365
6 frame #3: 0x00007ffedea3ec4f librocprofiler-sdk.so.0`rocprofiler_query_available_agents + 79
7 frame #4: 0x000000000021ffd1 hip_example`main at simple_hip.cpp:48
8 frame #5: 0x00007ffede62c657 libc.so.6`__libc_start_call_main(main=(hip_example`main at simple_hip.cpp:19), argc=1, argv=0x00007fffffffd3b8) at libc_start_call_main.h:58:16
9 frame #6: 0x00007ffede62c718 libc.so.6`__libc_start_main_impl(main=(hip_example`main at simple_hip.cpp:19), argc=1, argv=0x00007fffffffd3b8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffd3a8) at libc-start.c:409:3
10 frame #7: 0x000000000021fe71 hip_example`_start at start.S:116
Based on this, it looks like the issue is occurring in the read_topology() function defined here. Based on talks with the Rocprofiler team, it seems like this routine works internally at AMD. My suspicion is that the read_topology() makes some assumption that is true in the AMD environment but not in the one we are running in. This is preventing us from migrating to Rocprofiler-sdk so if there is any additional information needed, please let us know!
Operating System
CentOS Stream Version 9
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
6.2.1
ROCm Component
rocprofiler
Steps to Reproduce
#include <hip/hip_runtime.h>
#include <rocprofiler-sdk/context.h>
#include <rocprofiler-sdk/cxx/name_info.hpp>
#include <rocprofiler-sdk/fwd.h>
#include <rocprofiler-sdk/marker/api_id.h>
#include <rocprofiler-sdk/registration.h>
#include <rocprofiler-sdk/rocprofiler.h>
#include <iostream>
int main() {
int deviceCount = 0;
std::vector<rocprofiler_agent_v0_t> agents;
// Callback used by rocprofiler_query_available_agents to return
// agents on the device. This can include CPU agents as well. We
// select GPU agents only (i.e. type == ROCPROFILER_AGENT_TYPE_GPU)
rocprofiler_query_available_agents_cb_t iterate_cb =
[](rocprofiler_agent_version_t agents_ver,
const void** agents_arr,
size_t num_agents,
void* udata) {
if (agents_ver != ROCPROFILER_AGENT_INFO_VERSION_0)
throw std::runtime_error{"unexpected rocprofiler agent version"};
auto* agents_v =
static_cast<std::vector<rocprofiler_agent_v0_t>*>(udata);
for (size_t i = 0; i < num_agents; ++i) {
const auto* agent =
static_cast<const rocprofiler_agent_v0_t*>(agents_arr[i]);
// if(agent->type == ROCPROFILER_AGENT_TYPE_GPU)
// agents_v->emplace_back(*agent);
agents_v->emplace_back(*agent);
}
return ROCPROFILER_STATUS_SUCCESS;
};
// Query the agents, only a single callback is made that contains a vector
// of all agents.
rocprofiler_query_available_agents(
ROCPROFILER_AGENT_INFO_VERSION_0,
iterate_cb,
sizeof(rocprofiler_agent_t),
const_cast<void*>(static_cast<const void*>(&agents)));
std::cout << "Main function\n";
hipError_t err = hipGetDeviceCount(&deviceCount);
std::cout << "got device count\n";
if (err != hipSuccess) {
std::cerr << "Failed to get device count: " << hipGetErrorString(err)
<< std::endl;
return 1;
}
std::cout << "Number of HIP devices: " << deviceCount << std::endl;
return 0;
}
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered:
Thanks for reporting the issue! I gave it a try on an MI300X system on both the latest ROCm version (6.4.1), as well as the version you're running on (6.2.1) and haven't been able to reproduce the issue. Could you please provide the below information:
Where you are running the reproducer (bare-metal or a container)
Your kernel and dkms version
How you are compiling the reproducer (hipcc test.cpp -I${ROCM_PATH}/include -L${ROCM_PATH}/lib -lrocprofiler-sdk)
Output of rocminfo
Output of rocm-smi
The output of this python script that prints the node info: printNodes.txt
Also, could you try running in the the latest pytorch container: docker pull rocm/pytorch:latest? Thanks!
@darren-amd It looks like the previous machine that was encountering this issue was deprovisioned for faulty hardware. Let me try running this on another machine and if it still has issues I will provide the information you asked for.
Problem Description
The PyTorch team is trying to migrate from Roctracer to Rocprofiler-sdk but we are running into an issue where internal tests fail because of a segfault during initialization. Link to PR
During init,
rocprofiler_configure
correctly gets called which begins thetoolInit
. During this init, a call torocprofiler_query_available_agents
is made here which causes a segfault.To isolate this issue, we copied the
get_gpu_device_agents
routine into a standalone script and ran it, which still produces the same segfault. The backtrace of the crash is posted below:Based on this, it looks like the issue is occurring in the read_topology() function defined here. Based on talks with the Rocprofiler team, it seems like this routine works internally at AMD. My suspicion is that the read_topology() makes some assumption that is true in the AMD environment but not in the one we are running in. This is preventing us from migrating to Rocprofiler-sdk so if there is any additional information needed, please let us know!
Operating System
CentOS Stream Version 9
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
6.2.1
ROCm Component
rocprofiler
Steps to Reproduce
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: