huggingface · xadupre · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026
diff --git a/.github/workflows/test_onnxruntime.yml b/.github/workflows/test_onnxruntime.yml
@@ -24,9 +24,11 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: [3.11]
+        python-version: [3.12]
         runs-on: [ubuntu-24.04]
-        transformers_version: [latest, 4.36.*, 4.45.*, 4.56.*]
+        # transformers_version: [latest, 4.36.*, 4.45.*, 4.56.*, 5.2.*]
+        #temporary disabled to fix 5.2 in priority
+        transformers_version: [5.2.*]
         test_file:
           [
             test_decoder.py,
@@ -73,6 +75,8 @@ jobs:
             uv pip install "transformers==4.45.*"
           elif [ "${{ matrix.transformers_version }}" == '4.56.*' ]; then
             uv pip install "transformers==4.56.*"
+          elif [ "${{ matrix.transformers_version }}" == '5.2.*' ]; then
+            uv pip install "transformers==5.2.*"
           elif [ "${{ matrix.transformers_version }}" != 'latest' ]; then
             uv pip install "transformers==${{ matrix.transformers_version }}"            
           fi

diff --git a/docs/source/onnx/usage_guides/export_a_model.mdx b/docs/source/onnx/usage_guides/export_a_model.mdx
@@ -70,13 +70,17 @@ The Optimum ONNX export can be used through Optimum command-line:
 ```bash
 optimum-cli export onnx --help
 
-usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}] [--optimize {O1,O2,O3,O4}] [--monolith]
-                               [--no-post-process] [--variant VARIANT] [--framework {pt}] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
-                               [--pad_token_id PAD_TOKEN_ID] [--library-name {transformers,diffusers,timm,sentence_transformers}] [--model-kwargs MODEL_KWARGS]
-                               [--no-dynamic-axes] [--no-constant-folding] [--slim] [--dynamo] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
-                               [--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT] [--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE]
-                               [--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH] [--point_batch_size POINT_BATCH_SIZE]
-                               [--nb_points_per_image NB_POINTS_PER_IMAGE] [--visual_seq_length VISUAL_SEQ_LENGTH]
+usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}]
+                               [--optimize {O1,O2,O3,O4}] [--monolith] [--no-post-process] [--variant VARIANT]
+                               [--framework {pt}] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
+                               [--pad_token_id PAD_TOKEN_ID] [--library-name {transformers,diffusers,timm,sentence_transformers}]
+                               [--model-kwargs MODEL_KWARGS] [--no-dynamic-axes] [--no-constant-folding] [--slim]
+                               [--dynamo] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
+                               [--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT]
+                               [--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE]
+                               [--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH]
+                               [--point_batch_size POINT_BATCH_SIZE] [--nb_points_per_image NB_POINTS_PER_IMAGE]
+                               [--visual_seq_length VISUAL_SEQ_LENGTH]
                                output
 
 options:
@@ -298,6 +302,33 @@ for any opset >= 18. It is based on [torch.export.export](https://docs.pytorch.o
 The following page [ONNX Operators](https://onnx.ai/onnx/operators/index.html)
 highlights which operators are available in every opset.
 
+A simple example:
+
+```bash
+optimum-cli export onnx -m arnir0/Tiny-LLM --dynamo --dtype fp16 --opset 24 tiny-llm
+```
+
+```
+Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 2084.21it/s, Materializing param=model.norm.weight]
+[torch.onnx] Obtain model graph for `LlamaForCausalLM([...]` with `torch.export.export(..., strict=False)`...
+[torch.onnx] Obtain model graph for `LlamaForCausalLM([...]` with `torch.export.export(..., strict=False)`... ✅
+[torch.onnx] Run decompositions...
+[torch.onnx] Run decompositions... ✅
+[torch.onnx] Translate the graph into ONNX...
+[torch.onnx] Translate the graph into ONNX... ✅
+[torch.onnx] Optimize the ONNX graph...
+Applied 34 of general pattern rewrite rules.
+[torch.onnx] Optimize the ONNX graph... ✅
+                -[x] values not close enough, max diff: 0.02734375 (atol: 1e-05)
+                -[x] values not close enough, max diff: 0.0078125 (atol: 1e-05)
+                -[x] values not close enough, max diff: 0.00048828125 (atol: 1e-05)
+The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
+- logits: max diff = 0.02734375
+- present.0.key: max diff = 0.0078125
+- present.0.value: max diff = 0.00048828125.
+ The exported model was saved at: cli-tiny-llm.onnx
+```
+
 ## Custom export of Transformers models
 
 ### Customize the export of official Transformers models

diff --git a/docs/source/onnxruntime/usage_guides/gpu.mdx b/docs/source/onnxruntime/usage_guides/gpu.mdx
@@ -196,21 +196,21 @@ And here is a summary for the saving time with different sequence lengths (32 /
 Environment:
 
 ```
-+-----------------------------------------------------------------------------+
-| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.3     |
-|-------------------------------+----------------------+----------------------+
-| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
-|===============================+======================+======================|
-|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
-| N/A   28C    P8     8W /  70W |      0MiB / 15109MiB |      0%      Default |
-+-------------------------------+----------------------+----------------------+
++-------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.102.01         | Driver Version: 581.57 | CUDA Version: 13.0   |
+|-------------------------------+------------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|           Memory-Usage | GPU-Util  Compute M. |
+|===============================+========================+======================|
+|   0  Tesla T4            On   |   00000000:00:1E.0 Off |                    0 |
+| N/A   28C    P8     8W /  70W |        0MiB / 15109MiB |      0%      Default |
++-------------------------------+------------------------+----------------------+
 
 - Platform: Linux-5.4.0-1089-aws-x86_64-with-glibc2.29
-- Python version: 3.8.10
-- `transformers` version: 4.24.0
-- `optimum` version: 1.5.0
-- PyTorch version: 1.12.0+cu113
+- Python version: 3.12.3
+- `transformers` version: 5.2.0
+- `optimum` version: 2.1.0
+- PyTorch version: 2.10.0+cu130
 ```
 
 Note that previous experiments are run with __vanilla ONNX__ models exported directly from the exporter. If you are interested in __further acceleration__, with `ORTOptimizer` you can optimize the graph and convert your model to FP16 if you have a GPU with mixed precision capabilities.

diff --git a/optimum/exporters/onnx/convert.py b/optimum/exporters/onnx/convert.py
@@ -414,7 +414,10 @@ def _run_validation(
         atol_msg = f"The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance {atol}:\n{msg}"
 
         if isinstance(config, SpeechT5OnnxConfig):
-            atol_msg += "\nIMPORTANT NOTE: SpeechT5 uses a dropout at inference and the output validation of ONNX Runtime inference vs PyTorch is expected to fail. Reference: https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/models/speecht5/modeling_speecht5.py#L727"
+            atol_msg += (
+                "\nIMPORTANT NOTE: SpeechT5 uses a dropout at inference and the output validation of ONNX Runtime inference vs PyTorch "
+                "is expected to fail. Reference: https://github.com/huggingface/transformers/blob/v4.33.2/src/transformers/models/speecht5/modeling_speecht5.py#L727"
+            )
         raise AtolError(atol_msg)
 
 
@@ -941,7 +944,9 @@ def onnx_export_from_model(
     dynamo: bool = False,
     **kwargs_shapes,
 ):
-    """Full-suite ONNX export function, exporting **from a pre-loaded PyTorch model**. This function is especially useful in case one needs to do modifications on the model, as overriding a forward call, before exporting to ONNX.
+    """Full-suite ONNX export function, exporting **from a pre-loaded PyTorch model**.
+    This function is especially useful in case one needs to do modifications on the model,
+    as overriding a forward call, before exporting to ONNX.
 
     Args:
         > Required parameters
@@ -978,10 +983,13 @@ def onnx_export_from_model(
             in case, for example, the model inputs/outputs are changed (for example, if
             `model_kwargs={"output_attentions": True}` is passed).
         custom_onnx_configs (`Optional[Dict[str, OnnxConfig]]`, defaults to `None`):
-            Experimental usage: override the default ONNX config used for the given model. This argument may be useful for advanced users that desire a finer-grained control on the export. An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
+            Experimental usage: override the default ONNX config used for the given model.
+            This argument may be useful for advanced users that desire a finer-grained control on the export.
+            An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
         fn_get_submodels (`Optional[Callable]`, defaults to `None`):
             Experimental usage: Override the default submodels that are used at the export. This is
-            especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder). If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
+            especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder).
+            If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
         use_subprocess (`bool`, defaults to `False`):
             Do the ONNX exported model validation in subprocesses. This is especially useful when
             exporting on CUDA device, where ORT does not release memory at inference session
@@ -1030,7 +1038,9 @@ def onnx_export_from_model(
             task = TasksManager._infer_task_from_model_or_model_class(model=model)
         except (ValueError, KeyError) as e:
             raise RuntimeError(
-                f"The model task could not be automatically inferred in `onnx_export_from_model`. Please provide the argument `task` with the relevant task from {', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
+                f"The model task could not be automatically inferred in `onnx_export_from_model`. "
+                f"Please provide the argument `task` with the relevant task from "
+                f"{', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
             )
 
         if (
@@ -1045,7 +1055,13 @@ def onnx_export_from_model(
 
         logger.info(f"Automatic task detection to: {task}.")
 
-    dtype = get_parameter_dtype(model) if isinstance(model, torch.nn.Module) and get_parameter_dtype else model.dtype
+    if isinstance(model, torch.nn.Module) and get_parameter_dtype:
+        dtype = get_parameter_dtype(model)
+    elif hasattr(model, "dtype"):
+        dtype = model.dtype
+    else:
+        # Let's peek the default.
+        dtype = torch.float32
 
     if "bfloat16" in str(dtype):
         float_dtype = "bf16"
@@ -1057,7 +1073,10 @@ def onnx_export_from_model(
     # TODO: support onnx_config.py in the model repo
     if custom_architecture and custom_onnx_configs is None:
         raise ValueError(
-            f"Trying to export a {model_type} model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type {model_type} to be supported natively in the ONNX export."
+            f"Trying to export a {model_type} model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. "
+            f"Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models "
+            f"for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues "
+            f"if you would like the model type {model_type} to be supported natively in the ONNX export."
         )
 
     if task.startswith("text-generation") and model.config.is_encoder_decoder:
@@ -1189,7 +1208,8 @@ def onnx_export_from_model(
 
     if float_dtype == "bf16":
         logger.warning(
-            f"Exporting the model {model.__class__.__name__} in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail."
+            f"Exporting the model {model.__class__.__name__} in bfloat16 float dtype. After the export, ONNX Runtime InferenceSession "
+            f"with CPU/CUDA execution provider likely does not implement all operators for the bfloat16 data type, and the loading is likely to fail."
         )
 
     _, onnx_outputs = export_models(

diff --git a/optimum/exporters/onnx/model_patcher.py b/optimum/exporters/onnx/model_patcher.py
@@ -1234,11 +1234,20 @@ def __init__(
 def qwen3_moe_forward_patched(self, hidden_states: torch.Tensor) -> torch.Tensor:
     batch_size, sequence_length, hidden_dim = hidden_states.shape
     hidden_states = hidden_states.view(-1, hidden_dim)
+
     # router_logits: (batch * sequence_length, n_experts)
     router_logits = self.gate(hidden_states)
 
+    # gate returns a tuple with transformers>=5.0, this patch is no longer accurate.
+    # _, routing_weights, selected_experts = self.gate(hidden_states)
+    if isinstance(router_logits, tuple):
+        router_logits = router_logits[2]
+
     routing_weights = torch.nn.functional.softmax(router_logits, dim=1, dtype=torch.float)
-    routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
+    if hasattr(self, "top_k"):
+        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
+    else:
+        routing_weights, selected_experts = torch.topk(routing_weights, self.config.num_experts_per_tok, dim=-1)
     if self.norm_topk_prob:  # only diff with mixtral sparse moe block!
         routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
     # we cast back to the input dtype

diff --git a/optimum/onnx/utils.py b/optimum/onnx/utils.py
@@ -15,6 +15,8 @@
 
 from pathlib import Path
 
+from transformers import AutoFeatureExtractor, AutoProcessor, AutoTokenizer
+
 import onnx
 from onnx.external_data_helper import ExternalDataInfo, _get_initializer_tensors, uses_external_data
 
@@ -94,3 +96,41 @@ def has_onnx_input(model: onnx.ModelProto | Path | str, input_name: str) -> bool
         model = onnx.load(model, load_external_data=False)
 
     return any(input.name == input_name for input in model.graph.input)
+
+
+def get_preprocessor(model_name: str) -> AutoTokenizer | AutoFeatureExtractor | AutoProcessor | None:
+    """Gets a preprocessor (tokenizer, feature extractor or processor) that is available for `model_name`.
+
+    Args:
+        model_name (`str`): Name of the model for which a preprocessor are loaded.
+
+    Returns:
+        `Optional[Union[AutoTokenizer, AutoFeatureExtractor, AutoProcessor]]`:
+            If a processor is found, it is returned. Otherwise, if a tokenizer or a feature extractor exists, it is
+            returned. If both a tokenizer and a feature extractor exist, an error is raised. The function returns
+            `None` if no preprocessor is found.
+
+    From PR `transformers#41700 <https://github.com/huggingface/transformers/pull/41700>`_.
+    """
+    try:
+        return AutoProcessor.from_pretrained(model_name)
+    except (ValueError, OSError, KeyError):
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(model_name)
+        except (OSError, KeyError):
+            tokenizer = None
+        try:
+            feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
+        except (OSError, KeyError):
+            feature_extractor = None
+
+        if tokenizer is not None and feature_extractor is not None:
+            raise ValueError(
+                f"Couldn't auto-detect preprocessor for {model_name}. Found both a tokenizer and a feature extractor."
+            )
+        elif tokenizer is None and feature_extractor is None:
+            return None
+        elif tokenizer is not None:
+            return tokenizer
+        else:
+            return feature_extractor