Skip to content

Implement CUDA EP Plugin profiling API#28216

Open
yuslepukhin wants to merge 9 commits intomainfrom
yuslepukhin/cuda_ep_plugin_profiling
Open

Implement CUDA EP Plugin profiling API#28216
yuslepukhin wants to merge 9 commits intomainfrom
yuslepukhin/cuda_ep_plugin_profiling

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

This pull request adds support for CUPTI-based GPU profiling to the CUDA plugin execution provider (EP) in ONNX Runtime. Profiling is now available in the plugin EP when built with the onnxruntime_ENABLE_CUDA_PROFILING CMake flag, enabling detailed GPU activity tracing and integration with ORT's profiling system. The implementation introduces a new CudaPluginEpProfiler that bridges between ORT's profiling API and CUPTI, and updates the build system, plugin interface, and documentation accordingly.

CUDA Plugin Profiling Integration:

  • Added a new CudaPluginEpProfiler class (cuda_profiler_plugin.h/.cc) that implements the OrtEpProfilerImpl interface, delegates to a CUPTIManager singleton for GPU activity tracing, and provides callbacks for profiling lifecycle and event correlation. [1] [2]
  • Updated the plugin EP interface in cuda_ep.h/cuda_ep.cc to conditionally provide a CreateProfilerImpl callback when profiling is enabled, wiring up the new profiler implementation. [1] [2] [3]
  • Modified the CMake build (onnxruntime_providers_cuda_plugin.cmake) to conditionally link against CUDA::cupti and define the necessary compile-time flags for profiling support.

Documentation Updates:

  • Expanded the design documentation (cuda_plugin_ep_design.md) to describe the profiling and observability architecture, CUPTI integration, correlation ID flow, event collection, and differences from the in-tree CUDA EP profiler. Build configuration and relevant source files are also documented.

Miscellaneous:

  • Included the new profiler header in the plugin EP implementation.
  • Minor test and import adjustments (e.g., test_cuda_plugin_ep.py).

These changes enable the CUDA plugin EP to participate fully in ORT's profiling system, allowing users to observe GPU kernel and memory activity in conjunction with CPU-side events when profiling is enabled.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds CUPTI-backed GPU profiling support to the CUDA plugin Execution Provider so GPU kernel/memcpy activity can be emitted into ONNX Runtime’s profiling JSON when onnxruntime_ENABLE_CUDA_PROFILING is enabled.

Changes:

  • Introduces a plugin-side CudaPluginEpProfiler implementing OrtEpProfilerImpl, using CUPTIManager to collect GPU activity and report it via OrtProfilingEventsContainer.
  • Wires CudaEp::CreateProfiler in the CUDA plugin EP behind ENABLE_CUDA_PROFILING.
  • Updates the CUDA plugin CMake to link CUDA::cupti and adds a Python test + design doc updates for profiling.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Adds a session profiling test that validates basic trace JSON structure and (when enabled) checks for GPU “Kernel” events/metadata.
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h Declares CudaPluginEpProfiler (plugin-side OrtEpProfilerImpl).
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc Implements profiling lifecycle + CUPTI correlation + event conversion to Ort::ProfilingEvent.
onnxruntime/core/providers/cuda/plugin/cuda_ep.h Adds CreateProfilerImpl declaration behind ENABLE_CUDA_PROFILING.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Wires CreateProfiler callback and implements CreateProfilerImpl.
docs/cuda_plugin_ep/cuda_plugin_ep_design.md Documents profiling/observability architecture and build configuration.
cmake/onnxruntime_providers_cuda_plugin.cmake Conditionally links CUPTI and defines compile-time flags for profiling build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep.cc Outdated
Comment thread docs/cuda_plugin_ep/cuda_plugin_ep_design.md Outdated
@yuslepukhin yuslepukhin requested a review from tianleiwu April 24, 2026 18:33
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wiring of OrtEp::CreateProfiler and the OrtEpProfilerImpl callbacks looks correct: C-API boundary handling (EXCEPTION_TO_STATUS, null-checks, *profiler = nullptr on entry) and the TimePoint{} trick in StartEvent to avoid the double epoch-offset from PushCorrelation are right. Two items worth addressing before relying on this as validated CUPTI support, plus one pre-existing issue worth tracking.

Pre-existing issue worth surfacing

onnxruntime/core/providers/cuda/cupti_manager.cc ProcessActivityBuffers subtracts a CPU-epoch start_time_ns from kernel->start without calling NormalizeGPUTimestampToCPUEpoch, while the memcpy branch does normalize. CUPTI activity timestamps are in the CUPTI/GPU timestamp domain, so the kernel branch is computing ts from mixed domains. This PR is the first consumer from the plugin path, and the new Python test does not check that kernel ts/dur land near the corresponding ORT node event, so a misaligned timeline would still pass. Please either normalize both branches consistently (preferred), or add a timeline-plausibility assertion in the new test. cupti_manager.cc is outside this PR's diff so I am mentioning it here rather than inline.

Inline concerns

See inline comments on the CMake compile-definition scope and on the test that treats zero-kernel-events as success.

Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake Outdated
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated
Comment thread docs/cuda_plugin_ep/cuda_plugin_ep_design.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc
Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake
Comment thread onnxruntime/core/providers/cuda/cupti_manager.h
Comment thread onnxruntime/core/providers/cuda/cupti_manager.cc
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up fixes here. The CMake USE_CUDA coupling, output-parameter handling, and doc wording from the earlier review look addressed on the current head.

I found one remaining validation gap: the new CUPTI-backed CUDA plugin profiling path still does not appear to be compiled or exercised by the plugin CI jobs, so the PR can stay green while the feature-specific code path is broken. Please wire one CUDA plugin validation path to build with profiling enabled and run the env-gated assertion.

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py
@tianleiwu
Copy link
Copy Markdown
Contributor

There is build error: cuda_profiler_plugin.h:9:10: fatal error: cupti_manager.h: No such file or directory
Need update cmake setting to add cupti to cuda plugin ep build.

Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean implementation that correctly reuses existing CUPTI infrastructure, properly handles correlation IDs across the C API boundary, and includes thorough testing and documentation.

Positives:

  • Good delegation to CUPTIManager — the plugin profiler doesn't duplicate CUPTI buffer management or correlation tracking.
  • Correct use of TimePoint{} (epoch) in PushCorrelation to avoid double-offsetting, since the bridge already converts relative ORT event IDs to absolute epoch-based IDs.
  • Conditional CreateProfiler = nullptr when profiling is disabled explicitly communicates intent.
  • CI workflows now enable --enable_cuda_profiling and set ORT_CUDA_PROFILING_ENABLED=1 on both Linux and Windows.
  • Design doc Section 14 is thorough, especially the in-tree vs plugin comparison table (14.5).

Note (pre-existing, not introduced by this PR): In CUPTIManager::ProcessActivityBuffers, kernel events use raw kernel->start (CUPTI clock domain) while memcpy events correctly call NormalizeGPUTimestampToCPUEpoch(mmcpy->start). Since the plugin now inherits this behavior, consider documenting it in the design doc's known-limitations section.

Suggestions and nitpicks are inline — all non-blocking.

struct CudaPluginEpProfiler : OrtEpProfilerImpl {
const OrtEpApi& ep_api;
uint64_t client_handle_ = 0;
int64_t ep_profiling_start_offset_ns_ = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: ep_profiling_start_offset_ns_ is stored in StartProfilingImpl but never read afterward — only ort_profiling_start_ (derived from it) is actually used in EndProfilingImplConsume(). Same for start_time_point_. If these aren't reserved for future use, removing them avoids confusion.

Also, since all access goes through the static *Impl methods that static_cast<CudaPluginEpProfiler*>, consider making the data members private (switch to class with the static methods as friend or add accessors) to prevent accidental external mutation.


// Reconstruct the approximate ORT profiling start time so that GPU event
// timestamps (computed by CUPTIManager::Consume) are relative to ORT's start.
self->ort_profiling_start_ = self->start_time_point_ -
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The reconstructed ort_profiling_start_ equals profiling_start_time + (plugin_now - bridge_now), so the skew equals the cross-DLL call latency (typically < 1µs). Acceptable for profiling, but a one-line comment noting the approximation would help readers understand why this isn't an exact reconstruction.

if (onnxruntime_ENABLE_CUDA_PROFILING)
target_link_libraries(onnxruntime_providers_cuda_plugin PRIVATE CUDA::cupti)
target_compile_definitions(onnxruntime_providers_cuda_plugin PRIVATE ENABLE_CUDA_PROFILING)
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: trailing whitespace on the blank line after endif().

session_end_us = max(e["ts"] + e["dur"] for e in cpu_events)
# Allow a small margin for GPU-side clock skew (100ms).
margin_us = 100_000
for event in kernel_events:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: 100ms margin is quite generous for a simple MatMul — GPU kernels should complete well under 1ms. A tighter margin (e.g., 10ms) would catch gross timestamp-domain errors more reliably while still accommodating clock skew on slow CI machines. Though this is a judgment call — erring toward no flaky tests is reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants