Whisper Redesigned Solution #1229

kunal-vaishnavi · 2025-02-05T18:26:39Z

Description

This PR re-designs how Whisper is created and supported in ONNX Runtime GenAI. The new solution is designed to be used in conjunction with this work in ONNX Runtime.

Some of the added changes include:

Re-designed GenAI config that separates the encoder model and decoder model
- Removes the encoder-decoder-init section
- Creates a new encoder section
- Separates session options, EP options, and model properties to be per-model instead of re-using the decoder's options for all components
- Re-assigns pre-computed cross-attention KV caches as outputs to encoder model instead of inputs to decoder model
Re-designed runtime support that makes the states and steps much clearer
- Creates AudioEncoder, WhisperDecoder (i.e. TextDecoder), and WhisperState as separate states
- Creates AudioFeatures class that can be re-used for other speech models
- Adds generic support for FP32 CPU, FP32 CUDA, FP16 CUDA, and any quantized versions
- Removes temporary workarounds for past-present buffer sharing due to restrictions from both the exported ONNX model and ONNX Runtime
- Handles models with and without the following: buffer sharing, DecoderMaskedMultiHeadAttention, and alignment heads

Known Issues

This branch still has to be synced with the latest changes in the main branch of ONNX Runtime GenAI and active dev branches that will materially change this PR.
The cross QK kernels do not have parity with the alternative, more-accurate approach to compute the cross QKs as a separate inference pass. Currently, it is recommended to use the alternative approach for calculating word-level timestamps.
The cross QK kernels are only supported for CUDA.
The end-to-end working example is still under development here. Once working, a copy of those scripts will be added as a sub-folder in the Python examples.

Motivation and Context

The original implementation of Whisper was added in ONNX Runtime GenAI to create an initial foundation. This new approach is more flexible and more customizable for users. It also introduces an encoder-decoder architecture setup that can be used for other encoder-decoder models or other speech models.

br is already batch_beam_size; fix offset of cache_indir accordingly.

commit acba52c Author: Ryan Hill <[email protected]> Date: Mon Feb 3 15:24:33 2025 -0800 Update src/models/model.h Co-authored-by: aciddelgado <[email protected]> commit 0765339 Merge: 4f2f084 6da4195 Author: Ryan Hill <[email protected]> Date: Fri Jan 31 16:17:42 2025 -0800 Merge remote-tracking branch 'origin/main' into ryanunderhill/providers commit 4f2f084 Author: Ryan Hill <[email protected]> Date: Thu Jan 30 16:02:48 2025 -0800 Refactor device_type commit e6b77f2 Author: Ryan Hill <[email protected]> Date: Thu Jan 30 01:24:37 2025 -0800 Device check simplifications commit 198e8f8 Author: Ryan Hill <[email protected]> Date: Wed Jan 29 22:40:37 2025 -0800 Remove accidental change commit f8ed9ce Author: Ryan Hill <[email protected]> Date: Wed Jan 29 18:16:08 2025 -0800 Previous change also added device interfaces for webgpu & qnn Lint commit e804697 Author: Ryan Hill <[email protected]> Date: Wed Jan 29 18:14:28 2025 -0800 Clean up allocators, now everything is through p_device_* interfaces. commit 53c666c Author: Ryan Hill <[email protected]> Date: Wed Jan 29 12:51:08 2025 -0800 Edward gave me ideas. commit 45dad2b Author: Ryan Hill <[email protected]> Date: Tue Jan 28 11:51:15 2025 -0800 Review feedback commit c11704f Author: Ryan Hill <[email protected]> Date: Sun Jan 26 18:13:35 2025 -0800 Type tweak commit a011fe0 Author: Ryan Hill <[email protected]> Date: Sun Jan 26 18:06:18 2025 -0800 Leftover #ifdef fix commit 6736517 Author: Ryan Hill <[email protected]> Date: Fri Jan 24 20:59:24 2025 -0800 Android tweak commit 2df5fe1 Author: Ryan Hill <[email protected]> Date: Fri Jan 24 20:44:57 2025 -0800 Fix iOS break commit d87807c Author: Ryan Hill <[email protected]> Date: Fri Jan 24 20:40:43 2025 -0800 Don't load cuda library outside of linux & windows commit 0303592 Author: Ryan Hill <[email protected]> Date: Fri Jan 24 18:17:04 2025 -0800 Undefined behavior fix in startup commit 67d914c Merge: fd788d7 0636ce3 Author: Ryan Hill <[email protected]> Date: Fri Jan 24 13:57:58 2025 -0800 Merge with main commit fd788d7 Author: Ryan Hill <[email protected]> Date: Fri Jan 24 13:48:33 2025 -0800 Extra debug logging commit 2bc83eb Author: Ryan Hill <[email protected]> Date: Fri Jan 24 13:46:49 2025 -0800 Crash investigation commit 1734f5c Author: Ryan Hill <[email protected]> Date: Fri Jan 24 03:36:49 2025 -0800 Test instrumenting commit afecf1d Author: Ryan Hill <[email protected]> Date: Wed Jan 22 16:19:11 2025 -0800 Test theory commit 0e7064c Merge: b079b74 8bfd286 Author: Ryan Hill <[email protected]> Date: Wed Jan 22 16:14:33 2025 -0800 Merge with main commit b079b74 Author: Ryan Hill <[email protected]> Date: Tue Jan 21 20:51:21 2025 -0800 Try again to fix C# test commit 133d5a0 Author: Ryan Hill <[email protected]> Date: Tue Jan 21 16:56:59 2025 -0800 Fix C# unit tests commit d3db2f6 Author: Ryan Hill <[email protected]> Date: Tue Jan 21 14:06:51 2025 -0800 Fix input_ids issue from merge commit 49b51ef Author: Ryan Hill <[email protected]> Date: Thu Jan 16 21:33:46 2025 -0800 Build fix commit 5244049 Author: Ryan Hill <[email protected]> Date: Thu Jan 16 21:21:24 2025 -0800 Build fix commit 0f2ea36 Merge: 0bc39a5 ee318f1 Author: Ryan Hill <[email protected]> Date: Thu Jan 16 20:20:54 2025 -0800 Merge with main commit 0bc39a5 Author: Ryan Hill <[email protected]> Date: Thu Jan 16 20:15:19 2025 -0800 Build fixes commit 66321dd Author: Ryan Hill <[email protected]> Date: Wed Jan 15 23:08:47 2025 -0800 Formatting commit bdbb09c Author: Ryan Hill <[email protected]> Date: Wed Jan 15 16:04:50 2025 -0800 Fix merge build issues commit 237fb1e Merge: 41b462a 014c5f6 Author: Ryan Hill <[email protected]> Date: Wed Jan 15 15:43:42 2025 -0800 Merge with main commit 41b462a Author: Ryan Hill <[email protected]> Date: Wed Jan 15 15:34:15 2025 -0800 Finish refactoring model processing Remove as many #if USE_CUDA/USE_DML as possible commit 3823664 Author: Ryan Hill <[email protected]> Date: Sun Dec 15 23:52:16 2024 -0800 Summary: Remove #ifdefs for providers and go through device interface. Details: Add a DML DeviceInterface and DML DeviceBuffer handler. Remove #if blocks that are doing memory copies between device/cpu memory and use the DeviceSpan interface. commit 35e79ce Merge: 34381af c5745fd Author: Ryan Hill <[email protected]> Date: Mon Nov 25 17:00:03 2024 -0800 Merge remote-tracking branch 'origin/main' into ryanunderhill/providers commit 34381af Merge: 7e4668b 4819a8c Author: Ryan Hill <[email protected]> Date: Fri Nov 22 17:24:43 2024 -0800 Merge remote-tracking branch 'origin/main' into ryanunderhill/providers commit 7e4668b Author: Ryan Hill <[email protected]> Date: Fri Nov 22 17:24:35 2024 -0800 Use DeviceInterface for debugging

### Description This PR re-designs how Whisper is created and supported in ONNX Runtime. The new solution leverages [previous optimization work](#15473), and it is designed to be used in conjunction with [this work](microsoft/onnxruntime-genai#1229) in ONNX Runtime GenAI. Some of the added changes include: - Re-designed export that creates new ONNX models without needing a `WhisperBeamSearch` op - Creates one encoder model that also pre-computes the cross-attention KV caches (since they only need to be calculated once) - Creates one decoder model that can be used during pre-fill and token generation - Creates one jump-times model that can be used for word-level timestamps - Removes need for a `WhisperBeamSearch` op to chain the encoder and decoder subgraphs - Removes need to duplicate decoder's weights in memory - Previous solution with the `WhisperBeamSearch` op created an encoder-decoder-init model and decoder-with-past model. The decoder was duplicated twice, one in each. - Removes need for separate logic to export the PyTorch model coming from OpenAI vs. the PyTorch model coming from Hugging Face - Re-factors common parameters and logic used in CPU and CUDA attention kernels - Adds `DUMP_STRING` to enable easy logging of intermediate information when running in debug mode to debug a problem. This info is not printed in release mode so it will not impact performance. - Integrates `DecoderMaskedMultiHeadAttention` into `MultiHeadAttention` - Enables past-present buffer sharing in the `MultiHeadAttention` op for improved performance - Adds `cache_indirection` and `past_sequence_length` as new optional inputs to `MultiHeadAttention` - Adds `output_qk` as new optional output to `MultiHeadAttention` - Enables calculating `output_qk` tensor with FP16 or FP32 precision, regardless of the model's precision - CI tests that run end-to-end across various flag combinations that are used by many customers internally and externally The existing solutions are still available if desired. ### Known Issues - The FP32 CPU model with the `WhisperBeamSearch` op and output QK is currently disabled. This is because ONNX Runtime doesn't currently support output QK kernels on CPU, only on CUDA. - The `DecoderMaskedMultiHeadAttention` CPU kernel has a parity mismatch with the `DecoderMaskedMultiHeadAttention` CUDA kernel. - Using `DecoderMaskedMultiHeadAttention` for the FP32 CPU model is not enabled. Currently, it uses `MultiHeadAttention` to avoid the parity mismatch issue. ### Motivation and Context Using the beam search op has made it more difficult to debug and fix errors that are encountered. This new approach is more flexible and more customizable for users (e.g. by running with ONNX Runtime GenAI). It also helps [this issue](#18216). --------- Co-authored-by: mindest <[email protected]>

* Quant tool: Add `nodes_to_exclude` in `get_qnn_qdq_config` (#23779) * [ORT/CI_Pipeline] Use --enable_generic_interface in ORT builds for EP testing (#23801) Summary of changes: - Changed openVINO test case to use --enable_generic_interface - changed tensorRT test case to use --enable_generic_interface - Fixed ORT builds to USE_FULL_PROTOBUF as openVINO/TensorRT requires them - Fixed pre-processor macro definition which accidently got removed when ORT is build w/o EP ### Description  ### Motivation and Context  Co-authored-by: Karim Vadsariya <[email protected]> * Increase npm package pipeline ReactNative_CI_iOS timeout to 120 mins (#23825) ### Description Increase [npm package pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1080&_a=summary) ReactNative_CI_iOS timeout to 120 mins ### Motivation and Context  * [Mlas] Unblock hardcoded matmul blocking size (#23815) ### Description In GemmBatch, target matrix is cut into blocks to dispatch to multiple threads for intra-op parallelism. Currently the block size hard-coded to 16. If the CPU has > 16 cores, cores are not fully utilized in one op. This change unblocks the number of blocks in various MatMul. __Benchmark results__ Model: llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx set up: 96 core x86 linux Before: Setting intra_op_num_threads to 64 Overriding dimension with name, batch_size, to 3 Session creation time cost: 0.485097 s First inference time cost: 356 ms Total inference time cost: 17.731 s Total inference requests: 50 __Average inference time cost: 354.619 ms__ Total inference run time: 17.7312 s Number of inferences per second: 2.81989 Avg CPU usage: 65 % Peak working set size: 542265344 bytes Avg CPU usage:65 Peak working set size:542265344 After: Setting intra_op_num_threads to 32 Overriding dimension with name, batch_size, to 3 Session creation time cost: 0.523394 s First inference time cost: 316 ms Total inference time cost: 12.2739 s Total inference requests: 50 __Average inference time cost: 245.478 ms__ Total inference run time: 12.2741 s Number of inferences per second: 4.07362 Avg CPU usage: 33 % Peak working set size: 611241984 bytes Avg CPU usage:33 Peak working set size:611241984 Setting intra_op_num_threads to 64 Overriding dimension with name, batch_size, to 3 Session creation time cost: 0.497698 s First inference time cost: 289 ms Total inference time cost: 9.49205 s Total inference requests: 50 __Average inference time cost: 189.841 ms__ Total inference run time: 9.49226 s Number of inferences per second: 5.26745 Avg CPU usage: 65 % Peak working set size: 548470784 bytes Avg CPU usage:65 Peak working set size:548470784 Runs:50 ### Motivation and Context This issue is reported by M365 research team. * Revert changes onn mac-react-native-ci-pipeline.yml (#23845) ### Description  ### Motivation and Context  * Fix flash attention for GQA (Phi4) (#23850) ### Description This change fixes GQA for Flash Attention on Nvidia GPUs. The root cause appears to be `k_start + capped_sg_id < seq_causal_length` check. This is either because, a. seq_causal_length varies per lane, so the check becomes non uniform control flow, which is having interactions with subgroupShuffle. or b. The check itself is incorrect and is wiping out values of v based on the source lane's seq_causal_length. While in actualness values of v need to be causal as per the lane that is going to multiply it with qkt. qkt is already causal because earlier values of qk for out of bounds k are set to min_value, and exp(<-4) are 0. This fix works by removing that causal check and relying on the qk being wiped out earlier. The documentation for causality behavior for GQA is missing to determine which of this reason is the true reason. Prior to this prompts with sequence length > 16 < 32 or 1k would break with Phi 4 but smaller prompts would work. Tested on Intel Alderlake, Nvidia 4070. * Model Builder API (#23223) ### Description  Supports creating a model programmatically using the ORT C or C++ API. Supports augmenting an existing model to add nodes. ### Motivation and Context  * Fix typo: change `Upample` to `Upsample`. (#23838) ### Description  Fixed a typo in function names related to the Upsample CUDA kernel. Changed incorrect spelling Upample to Upsample across relevant functions. ### Motivation and Context  This change is necessary to maintain consistency and prevent potential confusion caused by incorrect function names. * [doc] Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/ (#23848) ### Description  Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/ ### Motivation and Context  * Quant tool: Consistent `get_qdq_config` and `get_qnn_qdq_config` behavior (#23856) * Change the logic to generate the default ep context file name (#23788) Change the logic to generate the default ep context file name ### Description Applies to all EPs: replace the .onnx to _ctx.onnx, instead of directly append extra string _ctx.onnx to existing model path. In QNN EP, also make the context binary .bin file shorter by removing QNNExecutionProvider_ from the file name. * Make Nuget QNN package pipeline 1ES compliant (#23805) ### Description Make [QNN_Nuget_Windows](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1234)1ES compliant ### Motivation and Context  * [js/common] allows using Uint16Array as data for float16 tensor (#23827) ### Description Resolve #23817 ### Motivation and Context  * [js/webgpu] Reland the optimization of ConvTranspose (#23858) This PR fixes the errors in the ConvTranspose optimization and adds tests to ensure the correctness of the implementation. * [OpenVINO] Fix a build warning (#23877) ### Description Fix a warning with std::move usage ### Motivation and Context Possibly allow building without --compile_no_warning_as_error flag * Change gsl::byte to std::byte (#23872) To be compatible with the latest GSL library. Without this fix we will get: ``` onnxruntime\core\providers\cpu\controlflow\loop.cc(247): error C4996: 'gsl::byte': Use std::byte instead. ``` * Allow using extended minimal build for several EPs (#23834) ### Description #### Background From code search, the following EPs use `onnxruntime::GetCpuPreferredNodes()` in their `GetCapabilities()` methods: - CANN - CUDA - DML - JS - ROCM - WebGPU However, the source file that implements `onnxruntime::GetCpuPreferredNodes()` is excluded when minimal build is ON: https://github.com/microsoft/onnxruntime/blob/6df0973e58ba5399fcaa98686f70ed9a9e59aaef/cmake/onnxruntime_framework.cmake#L38-L42 This means that all EPs mentioned above is not able to compile with minimal build. #### Solution The excluded file `core/framework/fallback_cpu_capability.cc` cannot build in minimal build because some of its dependencies are not included in the minimal build. However, in extended minimal build mode, all dependencies are available. This PR looses the restrict and allows to compile this file when it is extended minimal build. After this change, those EPs are able to compile in extended minimal build. * Add dawn to ThirdPartyNotices (#23876) ### Description Add `dawn` to ThirdPartyNotices. * Enable QNN EP weight sharing generation using public API (#23702) ### Description Enable QNN EP weight sharing generation using public API instead of internal interfaces, so that user can integrate into their own toolchain. The change is to share the QnnBackendManager across ORT sessions if ep.share_ep_contexts is enabled. And there is extra option to end the share so that we know when to remove the shared QnnBackendManager from the singleton. Change the tool name from onnxruntime_qnn_ctx_gen to ep_weight_sharing_ctx_gen, so that it can be shared for other EPs. * [QNN-EP]: Fix inference failures while running with htp_shared_memory (#23892) ### Description When using the enable_htp_shared_memory feature, we see that the address of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are not freed leading to memory exhaustion. ### Motivation and Context When using the enable_htp_shared_memory_allocator feature for QNN in GenAI extensions, it leads to inference failures during the second prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI use cases. Co-authored-by: Ashish Garg <[email protected]> * Fix enable_pix_capture build for WebGPU (#23857) The build option --enable_pix_capture is broken. This fixes the problem. --------- Co-authored-by: wp <[email protected]> * [WebGPU-EP Native] Add ReduceMean (#23860) ### Description  ### Motivation and Context  * [WebGPU EP] introduce BiasAdd contrib op (#23861) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Dynamo export and improve benchmark script for SAM2 encoder (#23887) ### Description * Add dynamo export for Sam2 image encoder * Verify fp32 onnx model with CPU EP (to avoid error message from TRT EP). * Update benchmark script: - output ORT profiling - output torch compiled code and unique kernel name for compiled kernel - add an option for nightly package installation - uninstall existing ort packages before installing The node metadata of dynamo exported model can help mapping node in onnx model back to pytorch modeling script. Currently, the graph optimization is not done on dynamo exported model, so it is experimental right now. ### Motivation and Context To support profiling of torch compiled CUDA kernel. * [js/web] improve workaround for bundlers (#23902) ### Description This PR improves the workaround for bundlers in onnxruntime-web. Specifically, the following changes have been made: - Use [this workaround](https://github.com/xenova/onnxruntime/commit/9c50aa2c63bad4cb73ad77ff1c43e0c43da0907f) as suggested by @xenova in https://github.com/huggingface/transformers.js/pull/1161#issuecomment-2695785730 - Use `url > "file:" && url < "file;"` instead of `url.startsWith("file:")` to allow minifiers to remove dead code correctly. This change allows to remove unnecessary dependencies of file parsed from `new URL("ort.bundle.min.js", import.meta.url)` in Vite, and optimize code like `if("file://filepath.js".startsWith("file:")) {do_sth1(); } else {do_sth2();}` into `do_sth1()` for webpack/terser usages. Resolves https://github.com/huggingface/transformers.js/pull/1161 * [webgpu] Restore MatMulNBits workgroup size for Phi-3.5 (#23349) ### Description This change restores the MatMulNBits workgroup size from (8, 8, 1) back to (16, 8, 1) to resolve a performance regression observed on Intel iGPUs during token generation (M=1). ### Motivation and Context As above. Signed-off-by: Jianhui Dai <[email protected]> * [webgpu] support Pad operator (#23141) ### Description  ### Motivation and Context  * [WebNN] Accept Float16Array for float16 data type if it is available (#23894) Float16Array is now shipping and WebNN Chromium implementation has accepted it. We should allow it in WebNN EP as well. * Ensure that the 'cmake_minimum_required' is version 3.5 or greater (#23888) ### Description CMake 4.0 release candidate 2.0 is available, and it cannot compile all of OnnxRuntime out-of-the-box. There's portions of the OnnxRuntime codebase that specify a `cmake_minimum_required` version of 3.0, and CMake 4.0 has removed support for compatibility with CMake < 3.5 - the following error is reported: ``` CMake Error at winml_sdk_helpers.cmake:4 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` Since CMake 3.5 appears to have shipped in 2016, it seems reasonable to set that as a minimum version to fix the error. The root CMakeLists.txt does ask for a minimum version of 3.28, so we could snap to that, but I'm still ramping up on the build, so wanted to propose a minimally sufficient fix. ### Motivation and Context Being able to build with the latest CMake - when it ships - reduces the barrier to entry to building OnnxRuntime, and allows the OnnxRuntime to leverage the latest and greatest tooling. * WebGPU: Remove deprecated subgroups-f16 from WebGPU native and JS EP (#23898) This PR removes the deprecated subgroups-f16 from WebGPU native and JS EP, and also remove the unused deviceInfo in WebGPU JS EP. * [JSEP/WebGPU] Fixed error in softmax dispatch. (#23906) ### Description Fixed an error softmax dispatch ### Motivation and Context Produce expected results for LlaMA model * enable WebGPU EP in WebAssembly build (#23913) ### Description This PR is the first step for migrating the webgpu backend of onnxruntime-web from JSEP based to WebGPU EP based. In this change, we enable building WebGPU EP in a wasm build (ie. `--build_wasm` `--use_webgpu` `--use_jsep`). However, the old build flags should still keep previous behavior. * Adding OpenVINO Windows CI Pipeline (#23919) ### Description  Enable an OpenVINO Windows CI pipeline. This includes: - Downloading the OpenVINO toolkit for Windows from an external source. - Setting up OpenVINO environment variables. - Building the ONNX Runtime OpenVINO Execution Provider. - Running unit tests. ### Motivation and Context  This change is required to run checks on precommit and commit in the ONNX Runtime project. It ensures that the code is tested with the OpenVINO toolkit on Windows, improving the reliability and compatibility of the project. * [WebGPU EP] SoftMax Implementation (#23538) Increase coverage for WebGPU Op * Exclude MAUI projects from GPU C# packaging builds (#23923) ### Description  Use 'desktop only' solution in GPU C# packaging builds. We don't need to include any MAUI support for those builds. ### Motivation and Context  * Support all block sizes that are multiples of 32 for DP4A (#23907) ### Description Simple change 1. The DP4A shader actually supports all block sizes that are multiples of 32, relaxing the restriction and making a small tweak to support sizes other than 32. 2. Moved the shader to a separate file for maintainability. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Example custom op with output type inferencing (#23916) ### Description  Add example of a custom op that is required to do type inference for the output type for the model load to work. Also acts as an example of how to override an ONNX op with a custom implementation. ### Motivation and Context  #23891 * Enabling L2+ Optimizations for EPs (#23517) There are some requirements to modify the graph which are specific to the EP/hardware. ORT has the hardcoded EP list for optimizations but that can't scale and it's hard be extended to enable EP custom optimizations. Here is the prototype to enable L2+ optimizations for EPs (The original overview is provided by @skottmckay) as well as the TRT EP implementation for the ConstantFoldingDQ optimization. Signatures for selection and optimization functions: ```` - Selection: std::function<std::vector<std::unique_ptr<ComputeCapability>>(const GraphViewer&, const KeyValueConfig&)> - Optimization: std::function<Status(const Graph&, const ComputeCapability& this_optimization, ComputeCapability& cc_to_update)> ```` GetCapability - call (new) provider bridge API to lookup pre-defined optimizer by name and get selection function - ComputeCapability.optimize_func, i.e. optimization function, would be set by the optimizer to the function that does the optimization - EP has to update the returning ComputeCapability to include the optimization ComputeCapability in nodes_to_optimize. So that later ORT can perform optimization/transformation accordingly. GraphPartitioner - After assigning the ComputeCapability to the EP and prior to Compile, if the ComputeCapability has nodes_to_optimize, iterate that list - optimization function needs to be called with - a mutable Graph instance - the ComputeCapability for the individual optimization - the overall ComputeCapability so it can be updated * fix binplace file in web pipeline (#23930) * Updated run_CIs_for_external_pr.py to support the Windows OpenVINO CI pipeline (#23931) * Fix ConvInteger handling of optional inputs. (#23935) ### Description  Fix ConvInteger handling of optional inputs. Need to check Exists() and not just the number of inputs. ### Motivation and Context  #23927 * Updated ov version in pipeline (#595) (#23882) ### Description This PR updates the OpenVINO version used in the pipeline from 2024.5.0 to 2025.0.0 Co-authored-by: jatinwadhwa921 <[email protected]> * [AIX] External data handling (#23859) ### Description In BE system, model tensor data coming from external file is not handled properly. This was found during the debugging of (https://github.com/microsoft/onnxruntime-genai/issues/1104)(url) This PR changes do the endianness conversion of data loaded from external file in BE system. * Create a packaging pipeline for a custom nuget package (#23918) * Fix license in example test code. (#23936) * replace usage of gsl::narrow and gsl::narrow_cast in WebGPU EP (#23926) ### Description `gsl::narrow` does not work in no exception build. - use `onnxruntime::narrow` if necessary; - or change to `static_cast` if it's obviously safe. also apply the changes to usage of `gsl::narrow_cast`, which does not apply checks. * VCPKG improvement: set VCPKG_OSX_DEPLOYMENT_TARGET (#23933) ### Description 1. Set VCPKG_OSX_DEPLOYMENT_TARGET for macOS targets 2. Enable VCPKG in more pipelines. * Allow using a different version of flatbuffers when building with vcpkg (#23946) ### Description Allow using a different version of flatbuffers when building with vcpkg, so that users do not need to pin flatbuffer's version, which provides more flexibility in the build process. Delete utf8_range from the dependencies, because it is an indirect dependency of protobuf, which is already included in the build process. ### Motivation and Context * Make python package pipeline 1ES compliant (#23800) ### Description Make [Python packaging pipeline](https://aiinfra.visualstudio.com/530acbc4-21bc-487d-8cd8-348ff451d2ff/_build?definitionId=841) 1ES compliant ### Motivation and Context  ### Checklist - [x] Make Onnxruntime-QNNEP-Windows-2022-CPU stateless * Delete ROCM Nuget Publishing Pipeline (#23948) * Bump SixLabors.ImageSharp from 2.1.9 to 2.1.10 in /csharp/sample/Microsoft.ML.OnnxRuntime.FasterRcnnSample (#23924) Bumps [SixLabors.ImageSharp](https://github.com/SixLabors/ImageSharp) from 2.1.9 to 2.1.10. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/SixLabors/ImageSharp/releases">SixLabors.ImageSharp's releases</a>.</em></p> <blockquote> <h2>v2.1.10</h2> <h2>What's Changed</h2> <ul> <li>Backport <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2859">#2859</a> to release/2.1.x by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2890">SixLabors/ImageSharp#2890</a></li> <li>Backport <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2701">#2701</a> to 2.1.x [copy] by <a href="https://github.com/antonfirsov"><code>@antonfirsov</code></a> in <a href="https://redirect.github.com/SixLabors/ImageSharp/pull/2891">SixLabors/ImageSharp#2891</a></li> </ul> <p><strong>Full Changelog</strong>: <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10">https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10</a></p> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/SixLabors/ImageSharp/commit/d133ef99e8becfc3b924b0bb4315e63b8681d307"><code>d133ef9</code></a> Set lang version</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/5dfe5a800367581239de442cc18de659da6e9b1d"><code>5dfe5a8</code></a> Missed cache action update</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/4d3a85112b03c89d2cb8616a5b747684b6e73730"><code>4d3a851</code></a> Use latest cache action</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/4cb9f40a722ab2b837157862f0320c6a652da4d0"><code>4cb9f40</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2891">#2891</a> from SixLabors/af/backport-2701</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/bb82f79db0197166271d4355b5fb5ceda370a906"><code>bb82f79</code></a> <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2701">#2701</a> to 2.1.x [copy]</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/627b5f721f30f6d529acb50bd81f92bd3db754eb"><code>627b5f7</code></a> Merge pull request <a href="https://redirect.github.com/SixLabors/ImageSharp/issues/2890">#2890</a> from SixLabors/af/backport-2859</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/67f7848d6e975e7956c8056823555de49a5fdf6d"><code>67f7848</code></a> try to fix LFS for *.BMP</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/44d294e06606111195152ead3006452357ef1bb9"><code>44d294e</code></a> 8.0.x is not needed</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/adb85d9e66aa3a588a86f4a4ef9a0539a8502117"><code>adb85d9</code></a> Another attempt for a Linux-specific skip</li> <li><a href="https://github.com/SixLabors/ImageSharp/commit/efc3fc4ee15eec4e523c26f7130e786541b00df2"><code>efc3fc4</code></a> Disable BmpDecoder_CanDecode_Os2BitmapArray on Linux</li> <li>Additional commits viewable in <a href="https://github.com/SixLabors/ImageSharp/compare/v2.1.9...v2.1.10">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=SixLabors.ImageSharp&package-manager=nuget&previous-version=2.1.9&new-version=2.1.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Make python CUDA package pipeline 1ES compliant (#23802) ### Description Make [Python-Cuda-Publishing-Pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1311&_a=summary) 1ES compliant ### Motivation and Context  * Migrate yarn to npm (#22116) ### Description This PR change all reference to yarn to npm ### Motivation and Context This PR is needed to address all Component Governce issue that ORT is facing ### Current issue - [x] use_react_native!(:path => config["reactNativePath"]) return nil - [x] For error `CocoaPods could not find compatible versions for pod "RCTRequired"`, we might need to increase iOS targe version from 13.0 to a higher version. - [x] For 'react-native' >= 0.73.x , react-native/react.gradle file is no longer used - [x] We need to update to gradle 7.6 or above to upgrade the RN. current gradlew version 7.3.3 that we use does not works on RN 71+. - [x] Instruction on how to implement the React-Native has changed since [0.72](https://reactnative.dev/docs/integration-with-existing-apps). - [x] Error `The new Java toolchain feature cannot be used at the project level in combination with source and/or target compatibility` from gradle. - [x] duplicate class: com.facebook.react.PackageList solution: remove `apply from: file("../../node_modules/@react-native-community/cli-platform-android/native_modules.gradle"); applyNativeModulesAppBuildGradle(project)` from bottom of andoird/app/build.gradle - [x] Need to update the OnnxruntimeModuleTest because `ReactApplicationContext` is now a abstract class. --------- Co-authored-by: Edward Chen <[email protected]> * [WebGPU/JSEP] Support group query attention do_rotary attribute (#23524) ### Description  ### Motivation and Context  * Fix npm audit in js/react-native/e2e (#23975) * Suppress some warnings in WebGPU EP generated by GCC 13 (#23984) ### Description Replace #23445, resolve conflicts and add one new file. --------- Co-authored-by: Changming Sun <[email protected]> * Fix NPM audit in js/react-native (#23974) ### Description  ### Motivation and Context  * Bump axios from 1.7.9 to 1.8.2 in /js/node (#23963) * GCC 14: fix insert_or_assign() call (#23955) Resolve #23954 * ADD emsdk env vars to VCPKG_KEEP_ENV_VARS (#23997) ### Description The vars are set by cmake\external\emsdk\emsdk_env.bat ### Motivation and Context By default they are filtered by vcpkg to make build reproducible. However, emscripten's cmake toolchain file needs this information. emcc.bat has the following code: ``` @set EM_PY=%EMSDK_PYTHON% @if "%EM_PY%"=="" ( set EM_PY=python ) ``` Actually, it doesn't work as expected. the line ``` set EM_PY=python ``` should be changed to ``` set EM_PY=python.exe ``` We haven't hit this issue because usually the var EM_PY is set. * Fix ONNX Runtime Python Test Pipeline (#23990) ### Description [Fix ONNX Runtime Python Test Pipeline ](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1164&_a=summary) ### Motivation and Context  * [webgpu] Fix the continuation issue (#23999) ### Description  ### Motivation and Context  * [WebGPU EP] Implements Gelu, BiasSplitGelu, and QuickGelu (#23981) Increases WebGPU operator coverage * [Native WebGPU] Added ReduceMax and ReduceSum (#23934) ### Description Added ReduceMax and ReduceSum ### Motivation and Context  * Convert Windows CPU CI Pipeline to Github Actions (#23996) * [Fix] Dependencies find_package Eigen error (#23939) ### Description To fix the CMake configuration error when a dependency brought in via FetchContent uses find_package(Eigen3 REQUIRED) Major Changes： - enable EIGEN_BUILD_CMAKE_PACKAGE - [optional] rename eigen to Eigen3 ### Motivation and Context Get the following build error when Dependencies use find_package(Eigen3 REQUIRED) ``` By not providing "FindEigen3.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Eigen3", but CMake did not find one. Could not find a package configuration file provided by "Eigen3" with any of the following names: Eigen3Config.cmake eigen3-config.cmake Add the installation prefix of "Eigen3" to CMAKE_PREFIX_PATH or set "Eigen3_DIR" to a directory containing one of the above files. If "Eigen3" provides a separate development package or SDK, be sure it has been installed. ``` Eigen need enable **EIGEN_BUILD_CMAKE_PACKAGE** when FetchContent for generate **Eigen3Config.cmake** https://gitlab.com/libeigen/eigen/-/blob/master/CMakeLists.txt?ref_type=heads#L213 in addition ， the eigen‘s project name is "Eigen3" and providing the cmake configuration file is "Eigen3Config.cmake" : https://gitlab.com/libeigen/eigen/-/blob/master/CMakeLists.txt?ref_type=heads#L36 https://gitlab.com/libeigen/eigen/-/blob/master/CMakeLists.txt?ref_type=heads#L252 So I think it's best for FetchContent_Declare Name to be consistent with the project name to avoid potential errors. Co-authored-by: mingyue <[email protected]> * Update onnxruntime_c_api.h to work with MinGW (#24006) ### Description Same as #23169 ### Motivation and Context Same as #23169 * Add DNNL github workflow (#24011) ### Description Add DNNL github workflow which is migrated from "Windows CPU CI pipeline" from Azure DevOps. This PR also adds "--build_nuget" to test the C# part. However, then I hit an error when building the tests in "test\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp.csproj". The error message was: ``` D:\a\_work\onnxruntime\onnxruntime\csharp\test\Microsoft.ML.OnnxRuntime.Tests.Common\TrainingTest.cs(34,81): error CS0103: The name 'CheckpointState' does not exist in the current context [D:\a\_work\onnxruntime\onnxruntime\csharp\test\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp\Microsoft.ML.OnnxRuntime.Tests.NetCoreApp.csproj] ``` Then I checked the code. I couldn't understand how it worked before. In this build, `__TRAINING_ENABLED_NATIVE_BUILD__` is not defined. But the "CheckpointState" class is defined in https://github.com/microsoft/onnxruntime/blob/main/csharp/src/Microsoft.ML.OnnxRuntime/Training/CheckpointState.shared.cs#L21 And the file is empty when __TRAINING_ENABLED_NATIVE_BUILD__ is not defined. So I don't understand how it could work in a normal build without dnnl. Here is my build command: ``` python tools\ci_build\build.py --config RelWithDebInfo --build_dir dnnlbuild --skip_submodule_sync --build_csharp --parallel --use_binskim_compliant_compile_flags --cmake_generator "Visual Studio 17 2022" --build_shared_lib --enable_onnx_tests --build_wheel --msbuild_extra_options "IncludeMobileTargets=false" --build_nuget --use_vcpkg --use_vcpkg_ms_internal_asset_cache --use_dnnl ``` This PR removes the failed test. * Qnn weight sharing improvement (#23945) ### Description Qnn weight sharing improvement so that only the last session in the weight sharing group (the session that has both share_ep_contexts and stop_share_ep_contexts enabled) generates the .bin file. The .bin file name is decided from the 1st session. And all generated *_ctx.onnx models point to this single .bin to avoid post-processing work. Previously each session generates a _ctx.onnx model with a .bin file. So it requires post-processing work to go through generated *_ctx.onnx models to get the last generated *_ctx.bin file and update all *_ctx.onnx to point to the same .bin file and remove the .bin files not used. * Correct generated cmake syntax (#24016) ### Description Previously will got CMake Error at build/Android/intermediates/armeabi-v7a/vcpkg/buildtrees/0.vcpkg_dep_info.cmake:15: Parse error. Expected a newline, got identifier with text "set". * [webgpu] allow to specify UseIndicesTypeAlias for Indices (#24019) ### Description Allow to specify `UseIndicesTypeAlias` for `AddIndices` in `ShaderHelper`. * [webgpu] allow overloads to Program::AddIndices (#24021) ### Description This change allows more overloads for the `Program::AddIndices` method, and makes use of r-value references for parameters when possible. Also fixed the implementation of the `AddInputs` and `AddOutputs` methods to use r-value references for the parameters * fix test for RotaryEmbedding (#24022) ### Description the `BaseTester::Run` function signature is: ```c++ void BaseTester::Run(ExpectResult expect_result, const std::string& expected_failure_string, const std::unordered_set<std::string>& excluded_provider_types, const RunOptions* run_options, std::vector<std::unique_ptr<IExecutionProvider>>* execution_providers, ExecutionMode execution_mode, const Graph::ResolveOptions& options); ``` Its behavior is: - if the parameter `execution_providers` is empty, it will try to aggregate all execution providers available in the build, and for each EP, create inference session and perform test. - if the parameter `execution_providers` is not empty, it will run a single inference session, use the passed-in `execution_providers` as session options and perform test. The old code may put multiple EPs into single inference sessions, but at runtime there will be only one EP running the test. Specifically, WebGPU EP is after CPU EP in this case, so the test never run on WebGPU EP. **To reviewers**: if you see **a lot of** changes, click the "setting" button next to the "Jump to", <img width="277" alt="image" src="https://github.com/user-attachments/assets/e8947ffb-f230-4c59-a5b7-36c0aedd2b7c" /> and check the "Hide Whitespace" and load it again. <img width="137" alt="{4D60F676-35F4-4546-B8E1-E2F42411A9E6}" src="https://github.com/user-attachments/assets/f4c58e6e-c290-49f7-aca7-c413db1e3c77" /> * Fix attention bias broadcast (#24017) ### Description * Fix broadcast on attention bias dim 1. * Increase test cases in test_mha.py in pipeline to cover the testing. ### Motivation and Context This feature was added in https://github.com/microsoft/onnxruntime/pull/21710. There was bug when computing the offset when attention bias broadcast on dim 1 only in both CUDA and CPU kernel. It can be triggered when attention bias shape is like [batch_size, 1, sequence_length, total_sequence_length] and batch_size > 1 when unfused kernel is selected. Note that cudnn flash attention and cutlass fused attention also supports attention bias, so the bug in unfused kernel was not discovered previously. * Remove unused parameter in csharp InferenceTest (#24031) ### Description Fix a warning from analyzers: ``` Theory method 'CanRunInferenceOnAModelDotnetTensors' on test class 'InferenceTest' does not use parameter 'enableParallelExecution'. Use the parameter, or remove the parameter and associated data. (https://xunit.net/xunit.analyzers/rules/xUnit1026 ``` ### Motivation and Context  * [TensorRT EP] Call cudaSetDevice at compute function for handling multithreading scenario (#24010) The GPU device is set again at compute function/compute time to handle multithreading scenarios. Consider the following: Users can create multiple threads to initialize separate inference sessions on different devices (not just the default device 0) Later, additional threads may be spawned to execute inference_session.Run(), which calls this compute function. Since new threads default to using device 0, it’s necessary to explicitly set the correct device to ensure computations run on the intended GPU. Example code: ````python provider = [ [ ('TensorrtExecutionProvider', { 'device_id': 0, }), ], [ ('TensorrtExecutionProvider', { 'device_id': 1, }), ] ] class ThreadObj(): def __init__(self, model_path: str, iterations: int, idx: int): ... sess_opt = ort.SessionOptions() self.inference_session = ort.InferenceSession(model_path, sess_opt, provider[idx % 2]) def warmup(self): self.inference_session.run(None, self.input) def run(self, thread_times, threads_complete): for iter in range(self.iterations): self.inference_session.run(None, self.input) def thread_target(obj, thread_times, threads_complete): obj.run(thread_times, threads_complete) ... iterations = 500 num_threads = 13 t_obj_list = [] thread_list = [] for tidx in range(num_threads): obj = ThreadObj(model_path, iterations, tidx) t_obj_list.append(obj) obj.warmup() for t_obj in t_obj_list: thread = threading.Thread(target=thread_target, daemon=True, args=(t_obj,thread_times,threads_complete,)) thread.start() thread_list.append(thread) ... ```` Note: Based on our measurements (using cuda event) on the A100 GPU with CUDA 12, the execution time for `cudaSetDevice` is approximately 0.004 ms, which is negligible and does not impact runtime performance. * Increase timeout for ARM64-Xcode16-targeting-iphonesimulator (#24030) * Support tvOS build (#24000) * [TensorRT EP] Stop enforcing oss parser during Windows debug build (#24036) ### Description  Reverting as this issue disappeared after adapting newer TRT api. This has been validated by building ORT 1.20.1/1.21.0 debug build and testing on FRCNN/resnet50 models. ### Motivation and Context  * Set CMAKE_POLICY_DEFAULT_CMP0069 to NEW to ensure that IPO flags are added for dependencies. (#24034) Set CMAKE_POLICY_DEFAULT_CMP0069 to NEW to ensure that interprocedural optimization (IPO) flags are added for dependencies. If the OLD behavior is used, the IPO flags are only added for the Intel compiler on Linux. * Make Cuda packaging pipeline 1ES compliant (#23806) ### Description Make [Cuda packaging pipeline](https://dev.azure.com/aiinfra/Lotus/_build?definitionId=1287&_a=summary) 1ES compliant ### Motivation and Context  ### Check List - [x] pool `onnxruntime-Win-CPU-2022` not found * [webgpu/wasm] allow runtime switch between WebGPUEP and JSEP (#24032) ### Description Add `--webgpu-ep=runtime` to allow build ort-web with both WebGPUEP and JSEP, while at runtime use `globalThis.WEBGPU_EP` to switch between them. This change helps to do perf comparison between WebGPU EP and JSEP much easier. * Move call to MLAS_CPUIDINFO::GetCPUIDInfo() out of MlasSQNBitGemmDispatchNeon initialization. (#24018) Move call to `MLAS_CPUIDINFO::GetCPUIDInfo()` out of `MlasSQNBitGemmDispatchNeon` initialization. Reduce binary size when MatMulNBits op is not included in the build. I believe the side effect of `MLAS_CPUIDINFO::GetCPUIDInfo()` (e.g., initializing a static object) prevents the linker from discarding the code in a build where the associated MLAS functions are unused. * [webgpu] fix the wrong dispatch size in flash_attention (#24020) ### Description  ### Motivation and Context  Co-authored-by: Yulong Wang <[email protected]> * avoid copy unnecessary files for nodejs pkg (#23992) ### Description remove duplicated file in nodejs package. #23956 * Add support for custom position ids and attention bias to GQA CPU operator (#23944) ### Description - Added support for custom position ids and attention masks to the GQA CPU operator (fp32 and fp16) - Added MLAS eltwise add kernel for mask application for FP32 and FP16 - Added unit tests for the added eltwise add MLAS kernel - Modified python tests to test the new GQA inputs ### Motivation and Context Custom position ids and attention mask are required in order to implement speculative decoding in PhiSilica ### Benchmarks All the benchmarks are executed on the GQA op configuration which will be used in the PhiSilica speculative decoding secnario, and the configuration is as follows: - num_heads: 32 - kv_num_heads: 32 - do_rotary: 1 - local_window_size: -1 - head_size: 96 - sequence_length: 6 - packed_qkv: True Benchmarks were executed on Cadmus with Snapdragon(R) X 12-core X1E80100 @ 3.40 GHz In the tables below, column headers are total sequence length values used for benchmarking, and the row values are if the attention bias was used or not. Values are average inference time in ms over 100000 runs. #### Fp16 results | Total sequence length | 50 | 100 | 250 | 500 | 750 | 1000 | 1500 | 2000 | 2500 | 3000 | 3500 | 4000 | |:-----------------|:---------|:---------|:---------|:---------|:---------|:---------|:---------|:--------|:--------|:--------|:--------|:--------| | Without bias | 0.284054 | 0.257449 | 0.275806 | 0.334123 | 0.458324 | 0.614133 | 0.912791 | 1.38585 | 1.92186 | 2.39203 | 2.88808 | 3.46262 | | With bias | 0.250926 | 0.253072 | 0.279724 | 0.337774 | 0.499058 | 0.585388 | 0.914316 | 1.40701 | 1.87311 | 2.47475 | 3.3906 | 3.47474 | | Runtime increase | -11.66% | -1.7% | +1.42% | +1.09% | +8.89% | -4.68% | +0.17% | +1.53% | -2.54% | +3.46% | +17.4% | +0.35% | #### Fp32 results | Total sequence length | 50 | 100 | 250 | 500 | 750 | 1000 | 1500 | 2000 | 2500 | 3000 | 3500 | 4000 | |:-----------------|:---------|:---------|:---------|:---------|:---------|:---------|:--------|:--------|:--------|:--------|:--------|:--------| | Without bias | 0.259049 | 0.270541 | 0.304583 | 0.376708 | 0.554013 | 0.633217 | 1.20696 | 1.65985 | 1.95169 | 2.45807 | 3.05637 | 4.05169 | | With bias | 0.261631 | 0.268002 | 0.300853 | 0.370452 | 0.529865 | 0.735216 | 1.43493 | 1.4385 | 1.99028 | 2.3858 | 2.99425 | 4.80197 | | Runtime increase | +1.0% | -0.94% | -1.22% | -1.66% | -4.36% | +16.11% | +18.89% | -13.34% | +1.98% | -2.94% | -2.03% | +18.52% | --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * [WebNN] Better int64 integration (#23831) This PR adds some workarounds to enable int64 support for some WebNN backends which don't support int64 data type. - Do not fallback ops that are specifically due to the int64 limitation. - Convert all int64 initializer and input values to int32 and handle potential overflow errors. - Register all int64 model inputs and outputs as int32 ml-tensor. - Handle ONNX ops that need inputs or outputs conversion between int64 and int32. e.g. ArgMax, ArgMin, Cast, etc. - Convert int64 output data back to int32. - Disallow int64 outputs as 'ml-tensor' preferredOutputLocation. Fixed #21401 * Convert Windows GPU pipelines and Windows OpenVino pipeline to Github Actions (#24029) ### Description Convert Windows GPU pipelines and Windows OpenVino pipeline to Github Actions * [ARM CPU] Fix fp16 const initialization on no-fp16 platform (#23978) ### Description Fix fp16 const initialization on no-fp16 platform [such as Raspberry PI](https://github.com/microsoft/onnxruntime/issues/23957) ### Motivation and Context Resolve #23957 * [Native WebGPU EP] Add packedQKV and do_rotary attribute support to GroupQueryAttention operator (#23386) ### Description Add Packed QKV inputs and do_rotary attribute to GQA. ### Motivation and Context  Packed QKV inputs and do_rotary attribute are required for certain models. * Whisper Redesigned Solution (#23549) ### Description This PR re-designs how Whisper is created and supported in ONNX Runtime. The new solution leverages [previous optimization work](https://github.com/microsoft/onnxruntime/pull/15473), and it is designed to be used in conjunction with [this work](https://github.com/microsoft/onnxruntime-genai/pull/1229) in ONNX Runtime GenAI. Some of the added changes include: - Re-designed export that creates new ONNX models without needing a `WhisperBeamSearch` op - Creates one encoder model that also pre-computes the cross-attention KV caches (since they only need to be calculated once) - Creates one decoder model that can be used during pre-fill and token generation - Creates one jump-times model that can be used for word-level timestamps - Removes need for a `WhisperBeamSearch` op to chain the encoder and decoder subgraphs - Removes need to duplicate decoder's weights in memory - Previous solution with the `WhisperBeamSearch` op created an encoder-decoder-init model and decoder-with-past model. The decoder was duplicated twice, one in each. - Removes need for separate logic to export the PyTorch model coming from OpenAI vs. the PyTorch model coming from Hugging Face - Re-factors common parameters and logic used in CPU and CUDA attention kernels - Adds `DUMP_STRING` to enable easy logging of intermediate information when running in debug mode to debug a problem. This info is not printed in release mode so it will not impact performance. - Integrates `DecoderMaskedMultiHeadAttention` into `MultiHeadAttention` - Enables past-present buffer sharing in the `MultiHeadAttention` op for improved performance - Adds `cache_indirection` and `past_sequence_length` as new optional inputs to `MultiHeadAttention` - Adds `output_qk` as new optional output to `MultiHeadAttention` - Enables calculating `output_qk` tensor with FP16 or FP32 precision, regardless of the model's precision - CI tests that run end-to-end across various flag combinations that are used by many customers internally and externally The existing solutions are still available if desired. ### Known Issues - The FP32 CPU model with the `WhisperBeamSearch` op and output QK is currently disabled. This is because ONNX Runtime doesn't currently support output QK kernels on CPU, only on CUDA. - The `DecoderMaskedMultiHeadAttention` CPU kernel has a parity mismatch with the `DecoderMaskedMultiHeadAttention` CUDA kernel. - Using `DecoderMaskedMultiHeadAttention` for the FP32 CPU model is not enabled. Currently, it uses `MultiHeadAttention` to avoid the parity mismatch issue. ### Motivation and Context Using the beam search op has made it more difficult to debug and fix errors that are encountered. This new approach is more flexible and more customizable for users (e.g. by running with ONNX Runtime GenAI). It also helps [this issue](https://github.com/microsoft/onnxruntime/issues/18216). --------- Co-authored-by: mindest <[email protected]> * Windows: Show more useful DLL load errors to say exactly what DLL is missing (#24053) ### Description When we fail to load a provider shared DLL in windows, the error is not very specific. Users have to figure out if the onnxruntime file is missing, a cuda file, or cudnn is not installed (and perhaps others). And this is just the cuda provider. It would be far more useful if it would say exactly what file is missing so the user can fix the actual problem. Plus, this will likely result in many fewer github issues regarding this problem, but if they do, they will be much easier to fix. This fix adds a function that will try loading a dll and its dependencies recursively to figure out which file is missing. It uses the OS dbghelp library to do it and is not very complex. This also fixes a many year old bug that was introduced in the change to use FormatMessage in env.cc, where the system error would always be an empty string `error 126 ""` due to passing 0 as the format buffer length. We will now see the more useful `The specified module could not be found.` style error messages. ### Motivation and Context Previously if we fail to load the cuda provider, the error would look like this, which is limited: `unknown file: error: C++ exception with description " onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll"` Now it will look like this if cudnn is not installed: `unknown file: error: C++ exception with description onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : Error loading "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll" which depends on "cudnn64_9.dll" which is missing. (Error 126: "The specified module could not be found.")` If cuda is not installed: `unknown file: error: C++ exception with description onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : Error loading "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll" which depends on "cudart64_12.dll" which is missing. (Error 126: "The specified module could not be found.")` And if onnxruntime_providers_cuda.dll is not installed: `unknown file: error: C++ exception with description onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : Error loading "C:\Example\Path\To\Library\onnxruntime_providers_cuda.dll" which is missing. (Error 126: "The specified module could not be found.") ` * Extend CMAKE_CUDA_FLAGS with all Blackwell compute capacity (#23928) ### Description  * Update range to build SASS on all arch and PTX on highest arch * when cuda>=12.8, build all arch (including latest blackwell) ### Motivation and Context  https://cmake.org/cmake/help/latest/prop_tgt/CUDA_ARCHITECTURES.html https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list * [WebGPU] Reduce staging buffers for uploading intializers (#23968) This change reduces the number of staging buffers used for uploading initializers to the GPU. On the one hand, we early release the upload staging buffers. On the other hand, we use the BufferMapExtendedUsages feature of Dawn on UMA GPUs, which allows us to directly write into the dest GPU buffer without the need of a staging buffer. To achieve this, we need to ensure the UMA GPU buffers are mapped at creation. We have BufferManager to be awared of OnSessionInitializationEnd(), so that it can handle buffer Create() and Upload() calls properly. Credits to @fs-eire for the overall design of implementation. * [WebGPU EP] Implement Remaining Reduction Ops (#24045) ### Description  Adds naive implementations of ReduceMin, ReduceProd, ReduceL1, ReduceL2, ReduceLogSum, ReduceSumSquare, and ReduceLogSumExp. Will optimize to use shared memory in a later PR. ### Motivation and Context  Increases WebGPU EP operator coverage. * add bool support to EPContext schema to unblock some models (#24065) ### Description add bool support to EPContext schema to unblock some models * [WebGPU EP] fix for reduce min/max error on MacOS CI (#24077) ### Error ```Traceback /onnxruntime/onnxruntime/core/providers/webgpu/reduction/reduction_ops.cc:146 [allow_multi_axes = true] Axes values must be in the range [-rank, rank-1]. Got: 446098880 ``` * Upgrade current MacOS-13 to 14 (#23293) ### Description Upgrade current MacOS-13 to 14 ### Motivation and Context  - [x] Update the RN to 0.73.x+ to have the newer version of boost --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Fix CUDA EP Abs and Sign bfloat16 support (#23914) ### Description  Abs and Sign had bfloat16 kernels created but not registered with the CUDA EP. Additionally Sign bfloat16 didn't work. * register bfloat16 kernels with CUDA EP * fix incorrectly named macro by adding 'X' as they add bfloat16 registration * add specialization for bfloat16 to _Sign * copied existing pattern. not sure if there's a better way * update tests ### Motivation and Context  #23875 * Improve typing for OrtValue and other public Python interfaces (#24086) ### Description Improve the OrtValue interface typing and changed `staticmethod` to `classmethod` for constructors to follow python conventions (https://google.github.io/styleguide/pyguide.html#2174-decision). * [webgpu] Limit that K must be divisible by 128 to apply dp4a matmul (#24078) The DP4AMatMulQuantize shader needs to make sure that K is divisible by 128. Otherwise, we need align the scale to have shape [M, ceil(K / 128)]. To simplify the shader, we limit that K must be divisible by 128 to apply dp4a matmul. * Add macOS ARM64 pipeline for webgpu (#24060) ### Description Add macOS ARM64 pipeline for webgpu. This pipeline is a temporary one. I created this pipeline because the current code already fails on macOS ARM64 for WebGPU EP. Adding this pipeline allows to check the status of the fix, and eventually when the build passes, this pipeline will be merged with the existing macOS arm64 pipeline. * [WebNN/WebGPU JS] Fix shared Module methods overriding each other (#23998) - Renamed all conflicting WebNN methods from `jsep*` to `webnn*`. - WebNN doesn't need flush(), therefore it doesn't need to set `jsepBackend`. This PR addresses issue microsoft/webnn-developer-preview#78 * Enable multithreading on FP16 to FP32 cast operator (#23619) ### Description Enables multithreading on FP16 to FP32 cast operator. ### Motivation and Context Improves CPU performance on FP16 models that require casting to FP32. * Move Android CI Pipeline to Github Actions (#24094) ### Description Move Android CI Pipeline to Github Actions * Cleanup CoreML EP's code to remove COREML_ENABLE_MLPROGRAM (#23490) ### Description Cleanup CoreML EP's code to remove the COREML_ENABLE_MLPROGRAM macro. Also, increase MINIMUM_COREML_VERSION(first version we support) to 5 . * webgpu ep support for argmax/argmin (#24089) * [mobile/reactnative] Remove namespace from AndroidManifest.XML to resolve warning (#23847) ### Description Removes namespace from AndroidManifest.XML ### Motivation and Context - Resolves #21681 * [WebGPU EP] fix implementation of Pow (#24088) ### Description Use custom implementation for Pow to fix test failures. * Increase timeout to 90min for ARM64-Xcode16-targeting-iphonesimulator (#24091) ### Description  There are still some timeout for the pipeline. further extend the timeout to 90 minutes for ARM64-Xcode16-targeting-iphonesimulator. It takes quite a while if all build cache is missing. ### Motivation and Context The pipeline sometimes failed because of timeout. There is a previous PR #24030 to increase the timeout from 60min to 75 min but it looks like not enough. * [WebGPU] fix test failure in Reduce operators on macOS ARM64 (#24108) ### Description fix test failure in Reduce operators on macOS ARM64 ``` [E:onnxruntime:ReduceL1, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running ReduceL1 node. Name:'node1' Status Message: webgpu_context.cc:259 Run Uniform variable[0] (output_size) data type mismatch in program "ReduceL1", Expected: u32, Actual: i32 ``` * [WebGPU EP] Implements CumSum Operator (#24047) Increases WebGPU EP op coverage. * [webgpu] Use 1d dispatch group size (#24084) This PR uses 1d disptach group size and uses workgroup_idx instead of workgroup.x|workgroup.y in case they are normalized. * [WebGPU] fix test failure in MatMulNBits on macOS ARM64 (#24109) ### Description abs_error is slightly loosen from 0.02 to 0.03 to allow test cases on macOS arm64 to pass. * [QNN-EP] Add support for Sum operator with 2 inputs (#24098) ### Description  * Add Sum to op builder in QNN-EP * Now we can limit the support to Sum with 2 inputs. ### Motivation and Context  * Enhance QNN-EP support for Sum with two inputs * [WebNN] Replace narrow with SafeInt for consistently in integer handling (#24059) Remove redundant header files BTW. * [QNN-EP] Add Lora Support with offline QNN context binary (#24026) ### Description - Add the new run option called lora_config to feed the information from lora binary - Parse and apply the lora binary in OnRunStart ### Motivation and Context - Support Lora Adapter Binary with QNN Context Binary Usage * [TensorRT EP] support TensorRT 10.9-GA (#23905) ### Description  * Update to trt10.9 * oss parser tested (here's testing method https://onnxruntime.ai/docs/build/eps.html#note-to-ort-1210-open-sourced-parser-users) ### Motivation and Context  * [webgpu] Apply dp4a for generation shader (#24064) This pr applies DP4A to generation shader. And also support any block_size % 32 = 0. * [CUDA] Support slide window in cutlass fused attention (#24072) ### Description Add slide window support in cutlass fused attention ### Motivation and Context The change was previously created by Ye: https://github.com/microsoft/onnxruntime/pull/21926 I merged the change and resolved some conflictions. Also reversed some Ye's change in kernel_forward.h, so that our code is consistent with pytorch code. * [MIGraphX EP] rename HIPPinnedAllocator to MIGraphXPinnedAllocator (#24103) ### Description Rename class HIPPinnedAllocator to MIGraphXPinnedAllocator ### Motivation and Context To align allocators' naming for the MIGraphX EP * [MIGraphX EP] check POLICY CMP0144 availability before used (#24104) ### Description For a newer CMake, suppress warnings about incorrect letter cases in package names. ### Motivation and Context To avoid reporting for newer CMake that a package name contains capital letters when small letters are required. * [JSEP] handles edge case in gridsample operator (#24121) fix for https://github.com/microsoft/onnxruntime/issues/24070 * [OpenVINO]Session Options Appended After AppendExecutionProvider (#23852) Description To honor SessionOption API Contract the ordering of AddConfigOption and AppendExecutionProvider_OpenVINO should not matter. This PR is fixing that issue Motivation and Context This PR fixes a regression happened during last PR in ordering of SessionOptions. * [webgpu]Add MaxPool and AveragePool (#23714) This adds Max and Average pool operators for webgpu-native. Basically, this is a rewrite of the corresponding JSEP operators with some improvements: 1) 'dilations' support 2) Pooling with kernelShape.length > 2 for NHWC format 3) code cleanup However, there are still a few missing features: 1) ceil 'ceil_mode' 2) column major 'storage_order' 3) 'Indices' output for Max pools. * [webgpu EP] put GetMaxComponents and SumVector to one place. (#24122) ### Description put `GetMaxComponents` and `SumVector` to one place. fix a bug in `SumVector`: ```diff - return "(" + x + ".x + " + x + ".y + " + x + ".w + " + x + ".z" + ")"; + return "(" + x + ".x + " + x + ".y + " + x + ".z + " + x + ".w" + ")"; ``` * skip MOE python test when MPI is not installed (#24116) ### Description It is not common that dev machine have MPI installed. Skip the test if MPI is not installed. ### Motivation and Context Make it easy to run pytest in dev machine without the need to skip the test manually. * Integrate KleidiAI for MatMulNBits via MlasQNBitGemm (#23627) ### Description This PR integrates Arm® KleidiAI™ to provide optimized assembly kernels for matrix multiplication with 4-bit quantized weights. These changes target the MlasQNBitGemm functions, and can be utilized via the MatMulNBits operator. * add test cases for webgpu ep in web (#24117) ### Description This PR enables web tests (NPM suite tests) for WebGPU EP. There are some test failures expected, so the specific job is marked as "continueOnError". ### Motivation and Context  * Refactor Webnn IsSupported*() to use constant initializers. (#24118) ### Description  This PR continues the work started at https://github.com/microsoft/onnxruntime/pull/19401. ### Motivation and Context An overridable initializer should not have a fixed value included in an WebNN model as it could be changed at runtime. The current check doesn't include validating that the initializer is constant. * Deleted the constant SKIP_CUDA_TEST_WITH_DML (#24113) ### Description Deleted the constant SKIP_CUDA_TEST_WITH_DML. It does not seem to be used anywhere. ### Motivation and Context The constant SKIP_CUDA_TEST_WITH_DML prohibits onnxruntime to be compiled when both of the flags -use_cuda and -use_dml are set. Co-authored-by: Andreas Hussing <[email protected]> * Update T5 Onnx Export and Optimization (#23949) Previously, the encoder onnx model adds extra initialization for decoder to generate kv cache from prompt. It is not necessary. Here we redesign onnx export for T5 model to output two separate models for encode and decoder. Move Linear that generates cross features based on encoder_hidden_states to encoder onnx model. In this way, the encoder does not need output encoder_hidden_states, and only need output the features for cross attention used in decoder. Major changes: -[x] update t5 onnx export script -[x] update convert_generation script -[x] update beam search to support changes of inputs and outputs (detail can be found below). -[x] add a tiny t5 model, and enable t…

kunal-vaishnavi and others added 18 commits September 30, 2024 18:17

Rename Whisper encoder input to audio features

c2c8745

Initial commit for new export

1d5f4f0

Fix KV cache initialization and runtime bugs

5bf4628

Add another check for alignment heads input

3cb936e

Dump logits in ORT GenAI

b648f58

Fix cross QK update

2a5b762

Fix finalize cross QK

e24db74

Save checkpoint for working solution

e4c838e

Clean up code

3a548a1

Remove unneeded template instantiations

4d9af67

Fixes: update crossQK copy for first step;

1d9161d

br is already batch_beam_size; fix offset of cache_indir accordingly.

Enable getting model inputs to user

97be76a

Add additional check for cache indirection

1bcd264

Add audio processing unit test

c35a73d

Fix Whisper GenAI config

1d5da61

Save checkpoint for working solution

efd0199

Merge branch 'main' into kvaishnavi/whisper

fbebe68

Merge branch 'main' into kvaishnavi/whisper

ef955e7

kunal-vaishnavi mentioned this pull request Feb 5, 2025

Whisper Redesigned Solution microsoft/onnxruntime#23549

Merged

kunal-vaishnavi force-pushed the kvaishnavi/whisper branch from 53640f2 to ef955e7 Compare February 6, 2025 00:35

kunal-vaishnavi added 3 commits February 6, 2025 00:45

Initial changes to work with main

32c48d2

Merge branch 'main' into kvaishnavi/whisper

e4a8b5f

kunal-vaishnavi added 4 commits March 24, 2025 23:31

Merge branch 'main' into kvaishnavi/whisper

323028a

Resolving build errors after merging main

7756a86

Fix prompt length and get input

a167add

Merge branch 'main' into kvaishnavi/whisper

8782b47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Redesigned Solution #1229

Whisper Redesigned Solution #1229

kunal-vaishnavi commented Feb 5, 2025

Whisper Redesigned Solution #1229

Are you sure you want to change the base?

Whisper Redesigned Solution #1229

Conversation

kunal-vaishnavi commented Feb 5, 2025

Description

Known Issues

Motivation and Context