Implement CUDA EP Plugin profiling API by yuslepukhin · Pull Request #28216 · microsoft/onnxruntime

yuslepukhin · 2026-04-23T22:49:59Z

This pull request adds support for CUPTI-based GPU profiling to the CUDA plugin execution provider (EP) in ONNX Runtime. Profiling is now available in the plugin EP when built with the onnxruntime_ENABLE_CUDA_PROFILING CMake flag, enabling detailed GPU activity tracing and integration with ORT's profiling system. The implementation introduces a new CudaPluginEpProfiler that bridges between ORT's profiling API and CUPTI, and updates the build system, plugin interface, and documentation accordingly.

CUDA Plugin Profiling Integration:

Added a new CudaPluginEpProfiler class (cuda_profiler_plugin.h/.cc) that implements the OrtEpProfilerImpl interface, delegates to a CUPTIManager singleton for GPU activity tracing, and provides callbacks for profiling lifecycle and event correlation. [1] [2]
Updated the plugin EP interface in cuda_ep.h/cuda_ep.cc to conditionally provide a CreateProfilerImpl callback when profiling is enabled, wiring up the new profiler implementation. [1] [2] [3]
Modified the CMake build (onnxruntime_providers_cuda_plugin.cmake) to conditionally link against CUDA::cupti and define the necessary compile-time flags for profiling support.

Documentation Updates:

Expanded the design documentation (cuda_plugin_ep_design.md) to describe the profiling and observability architecture, CUPTI integration, correlation ID flow, event collection, and differences from the in-tree CUDA EP profiler. Build configuration and relevant source files are also documented.

Miscellaneous:

Included the new profiler header in the plugin EP implementation.
Minor test and import adjustments (e.g., test_cuda_plugin_ep.py).

These changes enable the CUDA plugin EP to participate fully in ORT's profiling system, allowing users to observe GPU kernel and memory activity in conjunction with CPU-side events when profiling is enabled.

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR adds CUPTI-backed GPU profiling support to the CUDA plugin Execution Provider so GPU kernel/memcpy activity can be emitted into ONNX Runtime’s profiling JSON when onnxruntime_ENABLE_CUDA_PROFILING is enabled.

Changes:

Introduces a plugin-side CudaPluginEpProfiler implementing OrtEpProfilerImpl, using CUPTIManager to collect GPU activity and report it via OrtProfilingEventsContainer.
Wires CudaEp::CreateProfiler in the CUDA plugin EP behind ENABLE_CUDA_PROFILING.
Updates the CUDA plugin CMake to link CUDA::cupti and adds a Python test + design doc updates for profiling.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/test/python/transformers/test_cuda_plugin_ep.py	Adds a session profiling test that validates basic trace JSON structure and (when enabled) checks for GPU “Kernel” events/metadata.
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.h	Declares `CudaPluginEpProfiler` (plugin-side `OrtEpProfilerImpl`).
onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc	Implements profiling lifecycle + CUPTI correlation + event conversion to `Ort::ProfilingEvent`.
onnxruntime/core/providers/cuda/plugin/cuda_ep.h	Adds `CreateProfilerImpl` declaration behind `ENABLE_CUDA_PROFILING`.
onnxruntime/core/providers/cuda/plugin/cuda_ep.cc	Wires `CreateProfiler` callback and implements `CreateProfilerImpl`.
docs/cuda_plugin_ep/cuda_plugin_ep_design.md	Documents profiling/observability architecture and build configuration.
cmake/onnxruntime_providers_cuda_plugin.cmake	Conditionally links CUPTI and defines compile-time flags for profiling build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu

Wiring of OrtEp::CreateProfiler and the OrtEpProfilerImpl callbacks looks correct: C-API boundary handling (EXCEPTION_TO_STATUS, null-checks, *profiler = nullptr on entry) and the TimePoint{} trick in StartEvent to avoid the double epoch-offset from PushCorrelation are right. Two items worth addressing before relying on this as validated CUPTI support, plus one pre-existing issue worth tracking.

Pre-existing issue worth surfacing

onnxruntime/core/providers/cuda/cupti_manager.cc ProcessActivityBuffers subtracts a CPU-epoch start_time_ns from kernel->start without calling NormalizeGPUTimestampToCPUEpoch, while the memcpy branch does normalize. CUPTI activity timestamps are in the CUPTI/GPU timestamp domain, so the kernel branch is computing ts from mixed domains. This PR is the first consumer from the plugin path, and the new Python test does not check that kernel ts/dur land near the corresponding ORT node event, so a misaligned timeline would still pass. Please either normalize both branches consistently (preferred), or add a timeline-plausibility assertion in the new test. cupti_manager.cc is outside this PR's diff so I am mentioning it here rather than inline.

Inline concerns

See inline comments on the CMake compile-definition scope and on the test that treats zero-kernel-events as success.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu

Thanks for the follow-up fixes here. The CMake USE_CUDA coupling, output-parameter handling, and doc wording from the earlier review look addressed on the current head.

I found one remaining validation gap: the new CUPTI-backed CUDA plugin profiling path still does not appear to be compiled or exercised by the plugin CI jobs, so the PR can stay green while the feature-specific code path is broken. Please wire one CUDA plugin validation path to build with profiling enabled and run the env-gated assertion.

tianleiwu · 2026-04-28T20:24:19Z

There is build error: cuda_profiler_plugin.h:9:10: fatal error: cupti_manager.h: No such file or directory
Need update cmake setting to add cupti to cuda plugin ep build.

tianleiwu

LGTM

tianleiwu

Clean implementation that correctly reuses existing CUPTI infrastructure, properly handles correlation IDs across the C API boundary, and includes thorough testing and documentation.

Positives:

Good delegation to CUPTIManager — the plugin profiler doesn't duplicate CUPTI buffer management or correlation tracking.
Correct use of TimePoint{} (epoch) in PushCorrelation to avoid double-offsetting, since the bridge already converts relative ORT event IDs to absolute epoch-based IDs.
Conditional CreateProfiler = nullptr when profiling is disabled explicitly communicates intent.
CI workflows now enable --enable_cuda_profiling and set ORT_CUDA_PROFILING_ENABLED=1 on both Linux and Windows.
Design doc Section 14 is thorough, especially the in-tree vs plugin comparison table (14.5).

Note (pre-existing, not introduced by this PR): In CUPTIManager::ProcessActivityBuffers, kernel events use raw kernel->start (CUPTI clock domain) while memcpy events correctly call NormalizeGPUTimestampToCPUEpoch(mmcpy->start). Since the plugin now inherits this behavior, consider documenting it in the design doc's known-limitations section.

Suggestions and nitpicks are inline — all non-blocking.

tianleiwu · 2026-04-28T23:47:23Z

+struct CudaPluginEpProfiler : OrtEpProfilerImpl {
+  const OrtEpApi& ep_api;
+  uint64_t client_handle_ = 0;
+  int64_t ep_profiling_start_offset_ns_ = 0;


Suggestion: ep_profiling_start_offset_ns_ is stored in StartProfilingImpl but never read afterward — only ort_profiling_start_ (derived from it) is actually used in EndProfilingImpl → Consume(). Same for start_time_point_. If these aren't reserved for future use, removing them avoids confusion.

Also, since all access goes through the static *Impl methods that static_cast<CudaPluginEpProfiler*>, consider making the data members private (switch to class with the static methods as friend or add accessors) to prevent accidental external mutation.

tianleiwu · 2026-04-28T23:47:23Z

+
+  // Reconstruct the approximate ORT profiling start time so that GPU event
+  // timestamps (computed by CUPTIManager::Consume) are relative to ORT's start.
+  self->ort_profiling_start_ = self->start_time_point_ -


Nit: The reconstructed ort_profiling_start_ equals profiling_start_time + (plugin_now - bridge_now), so the skew equals the cross-DLL call latency (typically < 1µs). Acceptable for profiling, but a one-line comment noting the approximation would help readers understand why this isn't an exact reconstruction.

tianleiwu · 2026-04-28T23:47:23Z

+if (onnxruntime_ENABLE_CUDA_PROFILING)
+    target_link_libraries(onnxruntime_providers_cuda_plugin PRIVATE CUDA::cupti)
+    target_compile_definitions(onnxruntime_providers_cuda_plugin PRIVATE ENABLE_CUDA_PROFILING)
+endif()


Nitpick: trailing whitespace on the blank line after endif().

tianleiwu · 2026-04-28T23:47:23Z

+                    session_end_us = max(e["ts"] + e["dur"] for e in cpu_events)
+                    # Allow a small margin for GPU-side clock skew (100ms).
+                    margin_us = 100_000
+                    for event in kernel_events:


Nitpick: 100ms margin is quite generous for a simple MatMul — GPU kernels should complete well under 1ms. A tighter margin (e.g., 10ms) would catch gross timestamp-domain errors more reliably while still accommodating clock skew on slow CI machines. Though this is a judgment call — erring toward no flaky tests is reasonable.

Initial impl

67ef765

yuslepukhin requested a review from Copilot April 23, 2026 22:49

Copilot started reviewing on behalf of yuslepukhin April 23, 2026 22:51 View session

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated

github-advanced-security AI found potential problems Apr 23, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Address feedback

bfd9445

yuslepukhin requested a review from tianleiwu April 24, 2026 18:33

tianleiwu reviewed Apr 24, 2026

View reviewed changes

Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake Outdated

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Outdated

tianleiwu reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/cuda_plugin_ep/cuda_plugin_ep_design.md

yuslepukhin added 2 commits April 27, 2026 12:45

address review comments

cae16bb

Merge branch 'main' into yuslepukhin/cuda_ep_plugin_profiling

9d94cb9

yuslepukhin requested review from Copilot and tianleiwu April 27, 2026 19:47

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_profiler_plugin.cc

Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake

Comment thread onnxruntime/core/providers/cuda/cupti_manager.h

Comment thread onnxruntime/core/providers/cuda/cupti_manager.cc

Address review comments

2ea62ec

tianleiwu requested changes Apr 27, 2026

View reviewed changes

tianleiwu reviewed Apr 27, 2026

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py

yuslepukhin added 2 commits April 27, 2026 18:09

Modify CI to enable profiling and signal the test to run

d59b1af

Merge branch 'main' into yuslepukhin/cuda_ep_plugin_profiling

e54e17e

yuslepukhin added 2 commits April 28, 2026 13:46

Fix build problem

82c0034

Merge branch 'main' into yuslepukhin/cuda_ep_plugin_profiling

fdae599

tianleiwu approved these changes Apr 28, 2026

View reviewed changes

Conversation

yuslepukhin commented Apr 23, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Pre-existing issue worth surfacing

Inline concerns

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianleiwu commented Apr 28, 2026

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

tianleiwu Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants