Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/test_onnxruntime.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,11 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.11]
python-version: [3.12]
runs-on: [ubuntu-24.04]
transformers_version: [latest, 4.36.*, 4.45.*, 4.56.*]
# transformers_version: [latest, 4.36.*, 4.45.*, 4.56.*, 5.2.*]
#temporary disabled to fix 5.2 in priority
transformers_version: [5.2.*]
test_file:
[
test_decoder.py,
Expand Down Expand Up @@ -73,6 +75,8 @@ jobs:
uv pip install "transformers==4.45.*"
elif [ "${{ matrix.transformers_version }}" == '4.56.*' ]; then
uv pip install "transformers==4.56.*"
elif [ "${{ matrix.transformers_version }}" == '5.2.*' ]; then
uv pip install "transformers==5.2.*"
elif [ "${{ matrix.transformers_version }}" != 'latest' ]; then
uv pip install "transformers==${{ matrix.transformers_version }}"
fi
Expand Down
45 changes: 38 additions & 7 deletions docs/source/onnx/usage_guides/export_a_model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,17 @@ The Optimum ONNX export can be used through Optimum command-line:
```bash
optimum-cli export onnx --help

usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}] [--optimize {O1,O2,O3,O4}] [--monolith]
[--no-post-process] [--variant VARIANT] [--framework {pt}] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
[--pad_token_id PAD_TOKEN_ID] [--library-name {transformers,diffusers,timm,sentence_transformers}] [--model-kwargs MODEL_KWARGS]
[--no-dynamic-axes] [--no-constant-folding] [--slim] [--dynamo] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
[--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT] [--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE]
[--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH] [--point_batch_size POINT_BATCH_SIZE]
[--nb_points_per_image NB_POINTS_PER_IMAGE] [--visual_seq_length VISUAL_SEQ_LENGTH]
usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}]
[--optimize {O1,O2,O3,O4}] [--monolith] [--no-post-process] [--variant VARIANT]
[--framework {pt}] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
[--pad_token_id PAD_TOKEN_ID] [--library-name {transformers,diffusers,timm,sentence_transformers}]
[--model-kwargs MODEL_KWARGS] [--no-dynamic-axes] [--no-constant-folding] [--slim]
[--dynamo] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
[--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT]
[--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE]
[--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH]
[--point_batch_size POINT_BATCH_SIZE] [--nb_points_per_image NB_POINTS_PER_IMAGE]
[--visual_seq_length VISUAL_SEQ_LENGTH]
output

options:
Expand Down Expand Up @@ -298,6 +302,33 @@ for any opset >= 18. It is based on [torch.export.export](https://docs.pytorch.o
The following page [ONNX Operators](https://onnx.ai/onnx/operators/index.html)
highlights which operators are available in every opset.

A simple example:

```bash
optimum-cli export onnx -m arnir0/Tiny-LLM --dynamo --dtype fp16 --opset 24 tiny-llm
```

```
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 2084.21it/s, Materializing param=model.norm.weight]
[torch.onnx] Obtain model graph for `LlamaForCausalLM([...]` with `torch.export.export(..., strict=False)`...
[torch.onnx] Obtain model graph for `LlamaForCausalLM([...]` with `torch.export.export(..., strict=False)`... ✅
[torch.onnx] Run decompositions...
[torch.onnx] Run decompositions... ✅
[torch.onnx] Translate the graph into ONNX...
[torch.onnx] Translate the graph into ONNX... ✅
[torch.onnx] Optimize the ONNX graph...
Applied 34 of general pattern rewrite rules.
[torch.onnx] Optimize the ONNX graph... ✅
-[x] values not close enough, max diff: 0.02734375 (atol: 1e-05)
-[x] values not close enough, max diff: 0.0078125 (atol: 1e-05)
-[x] values not close enough, max diff: 0.00048828125 (atol: 1e-05)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.02734375
- present.0.key: max diff = 0.0078125
- present.0.value: max diff = 0.00048828125.
The exported model was saved at: cli-tiny-llm.onnx
```

## Custom export of Transformers models

### Customize the export of official Transformers models
Expand Down
26 changes: 13 additions & 13 deletions docs/source/onnxruntime/usage_guides/gpu.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -196,21 +196,21 @@ And here is a summary for the saving time with different sequence lengths (32 /
Environment:

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 8W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-------------------------------------------------------------------------------+
| NVIDIA-SMI 580.102.01 | Driver Version: 581.57 | CUDA Version: 13.0 |
|-------------------------------+------------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 8W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+------------------------+----------------------+

- Platform: Linux-5.4.0-1089-aws-x86_64-with-glibc2.29
- Python version: 3.8.10
- `transformers` version: 4.24.0
- `optimum` version: 1.5.0
- PyTorch version: 1.12.0+cu113
- Python version: 3.12.3
- `transformers` version: 5.2.0
- `optimum` version: 2.1.0
- PyTorch version: 2.10.0+cu130
```

Note that previous experiments are run with __vanilla ONNX__ models exported directly from the exporter. If you are interested in __further acceleration__, with `ORTOptimizer` you can optimize the graph and convert your model to FP16 if you have a GPU with mixed precision capabilities.
Expand Down
36 changes: 28 additions & 8 deletions optimum/exporters/onnx/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -414,7 +414,10 @@ def _run_validation(
atol_msg = f"The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance {atol}:\n{msg}"

if isinstance(config, SpeechT5OnnxConfig):
atol_msg += "\nIMPORTANT NOTE: SpeechT5 uses a dropout at inference and the output validation of ONNX Runtime inference vs PyTorch is expected to fail. Reference: https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/models/speecht5/modeling_speecht5.py#L727"
atol_msg += (
"\nIMPORTANT NOTE: SpeechT5 uses a dropout at inference and the output validation of ONNX Runtime inference vs PyTorch "
"is expected to fail. Reference: https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/models/speecht5/modeling_speecht5.py#L727"
)
raise AtolError(atol_msg)


Expand Down Expand Up @@ -941,7 +944,9 @@ def onnx_export_from_model(
dynamo: bool = False,
**kwargs_shapes,
):
"""Full-suite ONNX export function, exporting **from a pre-loaded PyTorch model**. This function is especially useful in case one needs to do modifications on the model, as overriding a forward call, before exporting to ONNX.
"""Full-suite ONNX export function, exporting **from a pre-loaded PyTorch model**.
This function is especially useful in case one needs to do modifications on the model,
as overriding a forward call, before exporting to ONNX.

Args:
> Required parameters
Expand Down Expand Up @@ -978,10 +983,13 @@ def onnx_export_from_model(
in case, for example, the model inputs/outputs are changed (for example, if
`model_kwargs={"output_attentions": True}` is passed).
custom_onnx_configs (`Optional[Dict[str, OnnxConfig]]`, defaults to `None`):
Experimental usage: override the default ONNX config used for the given model. This argument may be useful for advanced users that desire a finer-grained control on the export. An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
Experimental usage: override the default ONNX config used for the given model.
This argument may be useful for advanced users that desire a finer-grained control on the export.
An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
fn_get_submodels (`Optional[Callable]`, defaults to `None`):
Experimental usage: Override the default submodels that are used at the export. This is
especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder). If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder).
If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
use_subprocess (`bool`, defaults to `False`):
Do the ONNX exported model validation in subprocesses. This is especially useful when
exporting on CUDA device, where ORT does not release memory at inference session
Expand Down Expand Up @@ -1030,7 +1038,9 @@ def onnx_export_from_model(
task = TasksManager._infer_task_from_model_or_model_class(model=model)
except (ValueError, KeyError) as e:
raise RuntimeError(
f"The model task could not be automatically inferred in `onnx_export_from_model`. Please provide the argument `task` with the relevant task from {', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
f"The model task could not be automatically inferred in `onnx_export_from_model`. "
f"Please provide the argument `task` with the relevant task from "
f"{', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
)

if (
Expand All @@ -1045,7 +1055,13 @@ def onnx_export_from_model(

logger.info(f"Automatic task detection to: {task}.")

dtype = get_parameter_dtype(model) if isinstance(model, torch.nn.Module) and get_parameter_dtype else model.dtype
if isinstance(model, torch.nn.Module) and get_parameter_dtype:
dtype = get_parameter_dtype(model)
elif hasattr(model, "dtype"):
dtype = model.dtype
else:
# Let's peek the default.
dtype = torch.float32

if "bfloat16" in str(dtype):
float_dtype = "bf16"
Expand All @@ -1057,7 +1073,10 @@ def onnx_export_from_model(
# TODO: support onnx_config.py in the model repo
if custom_architecture and custom_onnx_configs is None:
raise ValueError(
f"Trying to export a {model_type} model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type {model_type} to be supported natively in the ONNX export."
f"Trying to export a {model_type} model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. "
f"Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models "
f"for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues "
f"if you would like the model type {model_type} to be supported natively in the ONNX export."
)

if task.startswith("text-generation") and model.config.is_encoder_decoder:
Expand Down Expand Up @@ -1189,7 +1208,8 @@ def onnx_export_from_model(

if float_dtype == "bf16":
logger.warning(
f"Exporting the model {model.__class__.__name__} in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail."
f"Exporting the model {model.__class__.__name__} in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession "
f"with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail."
)

_, onnx_outputs = export_models(
Expand Down
11 changes: 10 additions & 1 deletion optimum/exporters/onnx/model_patcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -1234,11 +1234,20 @@ def __init__(
def qwen3_moe_forward_patched(self, hidden_states: torch.Tensor) -> torch.Tensor:
batch_size, sequence_length, hidden_dim = hidden_states.shape
hidden_states = hidden_states.view(-1, hidden_dim)

# router_logits: (batch * sequence_length, n_experts)
router_logits = self.gate(hidden_states)

# gate returns a tuple with transformers>=5.0, this patch is no longer accurate.
# _, routing_weights, selected_experts = self.gate(hidden_states)
if isinstance(router_logits, tuple):
router_logits = router_logits[2]

routing_weights = torch.nn.functional.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
if hasattr(self, "top_k"):
routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
else:
routing_weights, selected_experts = torch.topk(routing_weights, self.config.num_experts_per_tok, dim=-1)
if self.norm_topk_prob: # only diff with mixtral sparse moe block!
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
# we cast back to the input dtype
Expand Down
40 changes: 40 additions & 0 deletions optimum/onnx/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@

from pathlib import Path

from transformers import AutoFeatureExtractor, AutoProcessor, AutoTokenizer

import onnx
from onnx.external_data_helper import ExternalDataInfo, _get_initializer_tensors, uses_external_data

Expand Down Expand Up @@ -94,3 +96,41 @@ def has_onnx_input(model: onnx.ModelProto | Path | str, input_name: str) -> bool
model = onnx.load(model, load_external_data=False)

return any(input.name == input_name for input in model.graph.input)


def get_preprocessor(model_name: str) -> AutoTokenizer | AutoFeatureExtractor | AutoProcessor | None:
"""Gets a preprocessor (tokenizer, feature extractor or processor) that is available for `model_name`.

Args:
model_name (`str`): Name of the model for which a preprocessor are loaded.

Returns:
`Optional[Union[AutoTokenizer, AutoFeatureExtractor, AutoProcessor]]`:
If a processor is found, it is returned. Otherwise, if a tokenizer or a feature extractor exists, it is
returned. If both a tokenizer and a feature extractor exist, an error is raised. The function returns
`None` if no preprocessor is found.

From PR `transformers#41700 <https://github.com/huggingface/transformers/pull/41700>`_.
"""
try:
return AutoProcessor.from_pretrained(model_name)
except (ValueError, OSError, KeyError):
try:
tokenizer = AutoTokenizer.from_pretrained(model_name)
except (OSError, KeyError):
tokenizer = None
try:
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
except (OSError, KeyError):
feature_extractor = None

if tokenizer is not None and feature_extractor is not None:
raise ValueError(
f"Couldn't auto-detect preprocessor for {model_name}. Found both a tokenizer and a feature extractor."
)
elif tokenizer is None and feature_extractor is None:
return None
elif tokenizer is not None:
return tokenizer
else:
return feature_extractor
Loading
Loading