Whisper Redesigned Solution #23549

kunal-vaishnavi · 2025-01-31T02:09:34Z

Description

This PR re-designs how Whisper is created and supported in ONNX Runtime. The new solution leverages previous optimization work, and it is designed to be used in conjunction with this work in ONNX Runtime GenAI.

Some of the added changes include:

Re-designed export that creates new ONNX models without needing a WhisperBeamSearch op
- Creates one encoder model that also pre-computes the cross-attention KV caches (since they only need to be calculated once)
- Creates one decoder model that can be used during pre-fill and token generation
- Creates one jump-times model that can be used for word-level timestamps
- Removes need for a WhisperBeamSearch op to chain the encoder and decoder subgraphs
- Removes need to duplicate decoder's weights in memory
  - Previous solution with the WhisperBeamSearch op created an encoder-decoder-init model and decoder-with-past model. The decoder was duplicated twice, one in each.
- Removes need for separate logic to export the PyTorch model coming from OpenAI vs. the PyTorch model coming from Hugging Face
Re-factors common parameters and logic used in CPU and CUDA attention kernels
- Adds DUMP_STRING to enable easy logging of intermediate information when running in debug mode to debug a problem. This info is not printed in release mode so it will not impact performance.
- Integrates DecoderMaskedMultiHeadAttention into MultiHeadAttention
- Enables past-present buffer sharing in the MultiHeadAttention op for improved performance
- Adds cache_indirection and past_sequence_length as new optional inputs to MultiHeadAttention
- Adds output_qk as new optional output to MultiHeadAttention
- Enables calculating output_qk tensor with FP16 or FP32 precision, regardless of the model's precision
CI tests that run end-to-end across various flag combinations that are used by many customers internally and externally

The existing solutions are still available if desired.

Known Issues

The FP32 CPU model with the WhisperBeamSearch op and output QK is currently disabled. This is because ONNX Runtime doesn't currently support output QK kernels on CPU, only on CUDA.
The DecoderMaskedMultiHeadAttention CPU kernel has a parity mismatch with the DecoderMaskedMultiHeadAttention CUDA kernel.
Using DecoderMaskedMultiHeadAttention for the FP32 CPU model is not enabled. Currently, it uses MultiHeadAttention to avoid the parity mismatch issue.

Motivation and Context

Using the beam search op has made it more difficult to debug and fix errors that are encountered. This new approach is more flexible and more customizable for users (e.g. by running with ONNX Runtime GenAI). It also helps this issue.

…earch op

onnxruntime/contrib_ops/cpu/bert/attention_common.h

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h

onnxruntime/contrib_ops/cuda/bert/decoder_masked_multihead_attention.cc

onnxruntime/contrib_ops/cuda/bert/multihead_attention.cc

onnxruntime/python/tools/transformers/fusion_bart_attention.py

onnxruntime/core/graph/contrib_ops/bert_defs.cc

kunal-vaishnavi and others added 30 commits April 25, 2024 18:32

Add support for creating optimized whisper ONNX models without beam s…

f314287

…earch op

Fix incorrect dynamic axes labels

6a44f72

Fix fusion breaks for OpenAI implementation of Whisper

58ec5eb

Merge branch 'main' into kvaishnavi/whisper-separate-export

4c228ea

Merge branch 'main' into kvaishnavi/whisper-separate-export

dd20876

Comment out DMMHA case temporarily

b13cb22

Replace MHA with DMMHA

31db1a0

Merge branch 'main' into kvaishnavi/whisper-separate-export

3b92432

Debugging beam search output

7bb79f3

Initial commit for new export

14b7e77

Add parity check after export and optimization

fa345fe

Fix multiple attention kernel invocations

e050dea

Make output Q*K values optional

bf87062

Fix batch size check for cache indirection

17fa0ab

Save checkpoint for working solution

52aeb58

Clean up code

240fe3b

Fix string dumping

ae98085

Fix out_qk dtype issue for half input case.

3d2c8fe

Remove type cast for output QK

287151f

Enable release mode build

0805d1d

Make QK output dtype independent of attention dtype

b629903

Add batched jump times export

648b389

Get batched jump times ONNX model with parity check

a6c6ee8

Save checkpoint for working solution

c0a6ce4

Merge branch 'main' into kvaishnavi/whisper-separate-export

008eeb9

Fix build after merge

158d0a8

Fix model with beam search op

02cb5be

Get model impl and beam search op export combinations working

2acd593

Enable separate export of encoder and decoder init

612eb0c

Add tests for multiple export types to CIs

f2d78fd

kunal-vaishnavi added 5 commits March 13, 2025 17:16

Fix typo with package name

53a470c

Upgrade to CUDA 12.1 in CIs

130626f

Attempt to upgrade to CUDA 12.4

9e20aea

Revert back to CUDA 11.8 in CIs

f6eabd4

Fix typo in TRT version when reverting

e443d70

microsoft deleted a comment from azure-pipelines bot Mar 14, 2025

Merge branch 'main' into kvaishnavi/whisper-separate-export

ab96683

microsoft deleted a comment from azure-pipelines bot Mar 14, 2025

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_common.h Outdated Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Outdated Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/decoder_masked_multihead_attention.cc Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/multihead_attention.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/multihead_attention.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/python/tools/transformers/fusion_bart_attention.py Outdated Show resolved Hide resolved

kunal-vaishnavi added 4 commits March 14, 2025 19:49

Add changes based on PR feedback

11a69fc

Merge branch 'main' into kvaishnavi/whisper-separate-export

f04bd0b

Rename from FT causal attention to decoder attention

0adafe7

Fix Python linter error

460e7e0

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated Show resolved Hide resolved

tianleiwu reviewed Mar 14, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Outdated Show resolved Hide resolved

kunal-vaishnavi added 7 commits March 14, 2025 22:34

Update buffer sharing definition

3ed3a47

Update MHA op spec

09d9fef

Update MHA op spec again

fb18f80

Update wording in MHA op spec details

f6aee5f

Fix typo in wording

29396f1

Remove unnecessary commas in op spec

1748624

Update docs after op spec changes

edf30d0

tianleiwu approved these changes Mar 15, 2025

View reviewed changes

kunal-vaishnavi merged commit 7942fa7 into microsoft:main Mar 15, 2025
89 of 91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Redesigned Solution #23549

Whisper Redesigned Solution #23549

kunal-vaishnavi commented Jan 31, 2025 •

edited

Loading

Whisper Redesigned Solution #23549

Whisper Redesigned Solution #23549

Conversation

kunal-vaishnavi commented Jan 31, 2025 • edited Loading

Description

Known Issues

Motivation and Context

kunal-vaishnavi commented Jan 31, 2025 •

edited

Loading