NVBit-1.7
NVBit 1.7 contains a lot of changes (both NVBit core and NVBit tools) to support CUDA 12. Please read the change log carefully and follow the migration guide to port your pre CUDA 12 NVBit tools to this new release, otherwise your NVBit is very likely not to work in CUDA 12 environment.
Changes and migration guide:
- Added Orin
SM_87
, Ada LovelaceSM_89
, HopperSM_90
, support. - Due to potential deadlock during initialization of application, NVBit disables module lazy loading by default: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#possible-issues-when-adopting-lazy-loading. If wanted, user can try to set NO_EAGER_LOAD=1 to enable module lazy loading.
- NVBit tools can no longer use syscalls in instrument functions, therefore printf() and assert() are no longer allowed in the injected functions. Any use of printf() or assert() will prevent your tool from loading and cause application error. As a result, mem_printf example is removed. Instead, tool writers will need to format and transfer their messages on their own. A skeleton example is provided as mem_printf2, which is built on top of mem_trace and requires tool writers to add a string formatter.
- Revised nvbit_at_ctx_init()/nvbit_at_ctx_term() callback rules:
a. CUDA API calls are no longer allowed in the nvbit_at_ctx_init() callback function, please use they in the new nvbit_tool_init() callback function instead. Because CUDA API calls take the same lock which is already taken by CUDA driver (CUDA 12+) at context creation time when Nvbit_at_ctx_init() is invoked, whereas nvbit_tool_init() is invoked before first CUDA kernel launch without taking the lock. Failure to make this change will result in your tool deadlocking. NVBit will warn you about this change, set ACK_CTX_INIT_LIMITATION=1 to acknowledge and disable the warning.
b. Launching a kernel, allocating device or managed memory are no longer allowed in the nvbit_at_ctx_term() callback function, due to a similar locking issue. Failure to make this change will result in your tool deadlocking. - Rewrote mem_trace example to adapt to CUDA 12 changes by following the new nvbit_at_ctx_init()/nvbit_at_ctx_term() callback rules above. Please read the changes from mem_trace carefully and adapt your tool accordingly if it uses ChannelDev and ChannelHost from utils/channel.hpp.
- Added support for cudaLaunchKernelEx (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1gb9c891eb6bb8f4089758e64c9c976db9) API for all tools. Please update your tools to catch all kernel launches during instrumentation.
- NVBit tools are now compiled with arch=all by default to be able to run on all GPU architectures. To reduce tool compilation time and binary size, run
make ARCH=sm_XX
when you are planning to only run your tool on sm_xx GPU architecture. - ppc64le support is dropped.