Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/test_onnxruntime.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,11 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.11]
python-version: [3.12]
runs-on: [ubuntu-24.04]
transformers_version: [latest, 4.36.*, 4.45.*, 4.56.*]
# transformers_version: [latest, 4.36.*, 4.45.*, 4.56.*, 5.2.*]
#temporary disabled to fix 5.2 in priority
transformers_version: [5.2.*]
test_file:
[
test_decoder.py,
Expand Down Expand Up @@ -73,6 +75,8 @@ jobs:
uv pip install "transformers==4.45.*"
elif [ "${{ matrix.transformers_version }}" == '4.56.*' ]; then
uv pip install "transformers==4.56.*"
elif [ "${{ matrix.transformers_version }}" == '5.2.*' ]; then
uv pip install "transformers==5.2.*"
elif [ "${{ matrix.transformers_version }}" != 'latest' ]; then
uv pip install "transformers==${{ matrix.transformers_version }}"
fi
Expand Down
45 changes: 38 additions & 7 deletions docs/source/onnx/usage_guides/export_a_model.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,13 +70,17 @@ The Optimum ONNX export can be used through Optimum command-line:
```bash
optimum-cli export onnx --help

usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}] [--optimize {O1,O2,O3,O4}] [--monolith]
[--no-post-process] [--variant VARIANT] [--framework {pt}] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
[--pad_token_id PAD_TOKEN_ID] [--library-name {transformers,diffusers,timm,sentence_transformers}] [--model-kwargs MODEL_KWARGS]
[--no-dynamic-axes] [--no-constant-folding] [--slim] [--dynamo] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
[--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT] [--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE]
[--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH] [--point_batch_size POINT_BATCH_SIZE]
[--nb_points_per_image NB_POINTS_PER_IMAGE] [--visual_seq_length VISUAL_SEQ_LENGTH]
usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}]
[--optimize {O1,O2,O3,O4}] [--monolith] [--no-post-process] [--variant VARIANT]
[--framework {pt}] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
[--pad_token_id PAD_TOKEN_ID] [--library-name {transformers,diffusers,timm,sentence_transformers}]
[--model-kwargs MODEL_KWARGS] [--no-dynamic-axes] [--no-constant-folding] [--slim]
[--dynamo] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
[--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT]
[--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE]
[--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH]
[--point_batch_size POINT_BATCH_SIZE] [--nb_points_per_image NB_POINTS_PER_IMAGE]
[--visual_seq_length VISUAL_SEQ_LENGTH]
output

options:
Expand Down Expand Up @@ -298,6 +302,33 @@ for any opset >= 18. It is based on [torch.export.export](https://docs.pytorch.o
The following page [ONNX Operators](https://onnx.ai/onnx/operators/index.html)
highlights which operators are available in every opset.

A simple example:

```bash
optimum-cli export onnx -m arnir0/Tiny-LLM --dynamo --dtype fp16 --opset 24 tiny-llm
```

```
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 2084.21it/s, Materializing param=model.norm.weight]
[torch.onnx] Obtain model graph for `LlamaForCausalLM([...]` with `torch.export.export(..., strict=False)`...
[torch.onnx] Obtain model graph for `LlamaForCausalLM([...]` with `torch.export.export(..., strict=False)`... ✅
[torch.onnx] Run decompositions...
[torch.onnx] Run decompositions... ✅
[torch.onnx] Translate the graph into ONNX...
[torch.onnx] Translate the graph into ONNX... ✅
[torch.onnx] Optimize the ONNX graph...
Applied 34 of general pattern rewrite rules.
[torch.onnx] Optimize the ONNX graph... ✅
-[x] values not close enough, max diff: 0.02734375 (atol: 1e-05)
-[x] values not close enough, max diff: 0.0078125 (atol: 1e-05)
-[x] values not close enough, max diff: 0.00048828125 (atol: 1e-05)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.02734375
- present.0.key: max diff = 0.0078125
- present.0.value: max diff = 0.00048828125.
The exported model was saved at: cli-tiny-llm.onnx
```

## Custom export of Transformers models

### Customize the export of official Transformers models
Expand Down
26 changes: 13 additions & 13 deletions docs/source/onnxruntime/usage_guides/gpu.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -196,21 +196,21 @@ And here is a summary for the saving time with different sequence lengths (32 /
Environment:

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 8W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-------------------------------------------------------------------------------+
| NVIDIA-SMI 580.102.01 | Driver Version: 581.57 | CUDA Version: 13.0 |
|-------------------------------+------------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+========================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 28C P8 8W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+------------------------+----------------------+

- Platform: Linux-5.4.0-1089-aws-x86_64-with-glibc2.29
- Python version: 3.8.10
- `transformers` version: 4.24.0
- `optimum` version: 1.5.0
- PyTorch version: 1.12.0+cu113
- Python version: 3.12.3
- `transformers` version: 5.2.0
- `optimum` version: 2.1.0
- PyTorch version: 2.10.0+cu130
```

Note that previous experiments are run with __vanilla ONNX__ models exported directly from the exporter. If you are interested in __further acceleration__, with `ORTOptimizer` you can optimize the graph and convert your model to FP16 if you have a GPU with mixed precision capabilities.
Expand Down
36 changes: 28 additions & 8 deletions optimum/exporters/onnx/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -414,7 +414,10 @@ def _run_validation(
atol_msg = f"The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance {atol}:\n{msg}"

if isinstance(config, SpeechT5OnnxConfig):
atol_msg += "\nIMPORTANT NOTE: SpeechT5 uses a dropout at inference and the output validation of ONNX Runtime inference vs PyTorch is expected to fail. Reference: https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/models/speecht5/modeling_speecht5.py#L727"
atol_msg += (
"\nIMPORTANT NOTE: SpeechT5 uses a dropout at inference and the output validation of ONNX Runtime inference vs PyTorch "
"is expected to fail. Reference: https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/models/speecht5/modeling_speecht5.py#L727"
)
raise AtolError(atol_msg)


Expand Down Expand Up @@ -941,7 +944,9 @@ def onnx_export_from_model(
dynamo: bool = False,
**kwargs_shapes,
):
"""Full-suite ONNX export function, exporting **from a pre-loaded PyTorch model**. This function is especially useful in case one needs to do modifications on the model, as overriding a forward call, before exporting to ONNX.
"""Full-suite ONNX export function, exporting **from a pre-loaded PyTorch model**.
This function is especially useful in case one needs to do modifications on the model,
as overriding a forward call, before exporting to ONNX.

Args:
> Required parameters
Expand Down Expand Up @@ -978,10 +983,13 @@ def onnx_export_from_model(
in case, for example, the model inputs/outputs are changed (for example, if
`model_kwargs={"output_attentions": True}` is passed).
custom_onnx_configs (`Optional[Dict[str, OnnxConfig]]`, defaults to `None`):
Experimental usage: override the default ONNX config used for the given model. This argument may be useful for advanced users that desire a finer-grained control on the export. An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
Experimental usage: override the default ONNX config used for the given model.
This argument may be useful for advanced users that desire a finer-grained control on the export.
An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
fn_get_submodels (`Optional[Callable]`, defaults to `None`):
Experimental usage: Override the default submodels that are used at the export. This is
especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder). If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder).
If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
use_subprocess (`bool`, defaults to `False`):
Do the ONNX exported model validation in subprocesses. This is especially useful when
exporting on CUDA device, where ORT does not release memory at inference session
Expand Down Expand Up @@ -1030,7 +1038,9 @@ def onnx_export_from_model(
task = TasksManager._infer_task_from_model_or_model_class(model=model)
except (ValueError, KeyError) as e:
raise RuntimeError(
f"The model task could not be automatically inferred in `onnx_export_from_model`. Please provide the argument `task` with the relevant task from {', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
f"The model task could not be automatically inferred in `onnx_export_from_model`. "
f"Please provide the argument `task` with the relevant task from "
f"{', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
)

if (
Expand All @@ -1045,7 +1055,13 @@ def onnx_export_from_model(

logger.info(f"Automatic task detection to: {task}.")

dtype = get_parameter_dtype(model) if isinstance(model, torch.nn.Module) and get_parameter_dtype else model.dtype
if isinstance(model, torch.nn.Module) and get_parameter_dtype:
dtype = get_parameter_dtype(model)
elif hasattr(model, "dtype"):
dtype = model.dtype
else:
# Let's peek the default.
dtype = torch.float32

if "bfloat16" in str(dtype):
float_dtype = "bf16"
Expand All @@ -1057,7 +1073,10 @@ def onnx_export_from_model(
# TODO: support onnx_config.py in the model repo
if custom_architecture and custom_onnx_configs is None:
raise ValueError(
f"Trying to export a {model_type} model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type {model_type} to be supported natively in the ONNX export."
f"Trying to export a {model_type} model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. "
f"Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models "
f"for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues "
f"if you would like the model type {model_type} to be supported natively in the ONNX export."
)

if task.startswith("text-generation") and model.config.is_encoder_decoder:
Expand Down Expand Up @@ -1189,7 +1208,8 @@ def onnx_export_from_model(

if float_dtype == "bf16":
logger.warning(
f"Exporting the model {model.__class__.__name__} in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail."
f"Exporting the model {model.__class__.__name__} in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession "
f"with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail."
)

_, onnx_outputs = export_models(
Expand Down
11 changes: 10 additions & 1 deletion optimum/exporters/onnx/model_patcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -1234,11 +1234,20 @@ def __init__(
def qwen3_moe_forward_patched(self, hidden_states: torch.Tensor) -> torch.Tensor:
batch_size, sequence_length, hidden_dim = hidden_states.shape
hidden_states = hidden_states.view(-1, hidden_dim)

# router_logits: (batch * sequence_length, n_experts)
router_logits = self.gate(hidden_states)

# gate returns a tuple with transformers>=5.0, this patch is no longer accurate.
# _, routing_weights, selected_experts = self.gate(hidden_states)
if isinstance(router_logits, tuple):
router_logits = router_logits[2]

routing_weights = torch.nn.functional.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
if hasattr(self, "top_k"):
routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
else:
routing_weights, selected_experts = torch.topk(routing_weights, self.config.num_experts_per_tok, dim=-1)
if self.norm_topk_prob: # only diff with mixtral sparse moe block!
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
# we cast back to the input dtype
Expand Down
40 changes: 40 additions & 0 deletions optimum/onnx/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@

from pathlib import Path

from transformers import AutoFeatureExtractor, AutoProcessor, AutoTokenizer

import onnx
from onnx.external_data_helper import ExternalDataInfo, _get_initializer_tensors, uses_external_data

Expand Down Expand Up @@ -94,3 +96,41 @@ def has_onnx_input(model: onnx.ModelProto | Path | str, input_name: str) -> bool
model = onnx.load(model, load_external_data=False)

return any(input.name == input_name for input in model.graph.input)


def get_preprocessor(model_name: str) -> AutoTokenizer | AutoFeatureExtractor | AutoProcessor | None:
"""Gets a preprocessor (tokenizer, feature extractor or processor) that is available for `model_name`.

Args:
model_name (`str`): Name of the model for which a preprocessor are loaded.

Returns:
`Optional[Union[AutoTokenizer, AutoFeatureExtractor, AutoProcessor]]`:
If a processor is found, it is returned. Otherwise, if a tokenizer or a feature extractor exists, it is
returned. If both a tokenizer and a feature extractor exist, an error is raised. The function returns
`None` if no preprocessor is found.

From PR `transformers#41700 <https://github.com/huggingface/transformers/pull/41700>`_.
"""
try:
return AutoProcessor.from_pretrained(model_name)
except (ValueError, OSError, KeyError):
try:
tokenizer = AutoTokenizer.from_pretrained(model_name)
except (OSError, KeyError):
tokenizer = None
try:
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
except (OSError, KeyError):
feature_extractor = None

if tokenizer is not None and feature_extractor is not None:
raise ValueError(
f"Couldn't auto-detect preprocessor for {model_name}. Found both a tokenizer and a feature extractor."
)
elif tokenizer is None and feature_extractor is None:
return None
elif tokenizer is not None:
return tokenizer
else:
return feature_extractor
Loading
Loading