Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
e03fdbb
add ParakeetForRNNT
eustlb Jun 1, 2026
328c94e
ParakeetForRNNT config
eustlb Jun 1, 2026
178a33b
generation mixin for RNNT
eustlb Jun 1, 2026
f406106
conversion script update
eustlb Jun 1, 2026
38d72d8
auto mappings
eustlb Jun 1, 2026
1efefcf
tests
eustlb Jun 1, 2026
e78d8f8
draft PR
eustlb Jun 1, 2026
39182c7
add chunking logic to the feature extractor
eustlb Jun 2, 2026
783f67e
on top of #46331
eustlb Jun 2, 2026
6d6dfa9
better streaming design
eustlb Jun 4, 2026
aee4bb7
RNNT as the main class
eustlb Jun 4, 2026
29f7542
rnn-t as main config
eustlb Jun 4, 2026
a90b87d
nit
eustlb Jun 4, 2026
4ac9006
wip: streaming encoder cache (pre-merge snapshot)
eustlb Jun 4, 2026
721d951
Merge remote-tracking branch 'origin/parakeet-rnnt-from-scratch' into…
eustlb Jun 4, 2026
f962d4e
prop changes to modular
eustlb Jun 4, 2026
851a5d7
nit
eustlb Jun 4, 2026
f38f751
update test with reproducers
eustlb Jun 5, 2026
a1c0785
add RNN-T loss
eustlb Jun 5, 2026
b7e43d0
fix generate
eustlb Jun 5, 2026
299f149
correct loss function usage
eustlb Jun 5, 2026
23a01a9
processing handle ctc/rnnt/tdt diffs
eustlb Jun 6, 2026
4bdf5c8
udpate doc + fix pipeline
eustlb Jun 6, 2026
9119f8c
working commit
eustlb Jun 8, 2026
d873962
cleaner
eustlb Jun 8, 2026
17356f0
nits
eustlb Jun 8, 2026
fb76350
fix
eustlb Jun 8, 2026
d0d78c2
fix
eustlb Jun 8, 2026
eb35831
nit
eustlb Jun 8, 2026
63fc55b
add nemotron asr processor
eustlb Jun 8, 2026
8908100
tmp commit
eustlb Jun 9, 2026
8c3e43a
update loss reduction to match NeMo
eustlb Jun 10, 2026
f4f2c77
udpate expected value
eustlb Jun 10, 2026
a67da59
loss reduction
eustlb Jun 10, 2026
ddcb571
loss reduction to tdt loss
eustlb Jun 10, 2026
182f4d5
conversion udpate
eustlb Jun 10, 2026
9b3ecc7
fix
eustlb Jun 10, 2026
a8e121b
update checkpoint
eustlb Jun 10, 2026
bc77d82
Merge branch 'main' into parakeet-rnnt-from-scratch
eustlb Jun 10, 2026
31a1ea7
nit
eustlb Jun 10, 2026
3ae2ee4
nit
eustlb Jun 10, 2026
9d6e6a6
fix tdt loss
eustlb Jun 10, 2026
309fe6e
fix loss
eustlb Jun 10, 2026
9965000
fix loss test
eustlb Jun 10, 2026
6d041fb
make
eustlb Jun 10, 2026
cdfb7cc
use correct checkpoint
eustlb Jun 10, 2026
bfb2eec
Merge branch 'main' into parakeet-rnnt-from-scratch
eustlb Jun 10, 2026
d5b552a
AutoModelForRNNT in auto.md
eustlb Jun 10, 2026
5d14079
Merge branch 'parakeet-rnnt-from-scratch' of github.com:huggingface/t…
eustlb Jun 10, 2026
1e39472
add reproducable tests + other small fixes
eustlb Jun 10, 2026
7632aa1
fix
eustlb Jun 10, 2026
e98c3dc
Merge branch 'main' into parakeet-rnnt-from-scratch
eustlb Jun 10, 2026
4207b26
add revision
eustlb Jun 10, 2026
b784e39
Merge branch 'parakeet-rnnt-from-scratch' of github.com:huggingface/t…
eustlb Jun 10, 2026
2443e68
Merge branch 'parakeet-rnnt-from-scratch' into nemotron-asr
eustlb Jun 10, 2026
9790901
nit
eustlb Jun 10, 2026
eb41299
add loss
eustlb Jun 10, 2026
cc5e007
fix
eustlb Jun 10, 2026
f9339d5
nit
eustlb Jun 10, 2026
8871730
better formulated forward
eustlb Jun 10, 2026
f257e52
add groups to VoxtralRealtimeCausalConv1d
eustlb Jun 10, 2026
7517d5d
updates
eustlb Jun 10, 2026
08bbd13
init commit
eustlb Jun 11, 2026
6f21c7e
clean up
eustlb Jun 11, 2026
49eaf96
add nemotron_asr to mapping for pipeline
eustlb Jun 11, 2026
806895b
update doc
eustlb Jun 11, 2026
d4daefe
use inheritance on generate
eustlb Jun 11, 2026
dfc8f2e
draft streaming
eustlb Jun 16, 2026
53e68c0
Merge remote-tracking branch 'origin/main' into nemotron-asr
eustlb Jun 16, 2026
12297f9
nits
eustlb Jun 17, 2026
4485016
fixes
eustlb Jun 17, 2026
c6e12ad
update num_lookahead_tokens API
eustlb Jun 17, 2026
5d3c299
add streaming latency
eustlb Jun 17, 2026
1d230bf
nit
eustlb Jun 17, 2026
18fab33
improve doc
eustlb Jun 17, 2026
0ac1fbf
update
eustlb Jun 18, 2026
a1eda37
doc update
eustlb Jun 18, 2026
97dffe4
cleaning generation loop
eustlb Jun 18, 2026
aead605
fix
eustlb Jun 18, 2026
57e8cea
Merge branch 'main' into nemotron-asr
eustlb Jun 18, 2026
227cbf0
use hub checkpoints
eustlb Jun 18, 2026
c7e4df0
udpate tests
eustlb Jun 18, 2026
03ebf54
NemotronAsrConfig update
eustlb Jun 18, 2026
cecd05b
Merge branch 'nemotron-asr' of github.com:huggingface/transformers in…
eustlb Jun 18, 2026
3a063fc
update doc
eustlb Jun 18, 2026
486494c
NemotronAsr -> NemotronAsrStreaming
eustlb Jun 18, 2026
bdd2e69
update doc
eustlb Jun 18, 2026
1fb78a4
update license
eustlb Jun 18, 2026
50c2293
nit
eustlb Jun 18, 2026
2ca01f7
test update
eustlb Jun 18, 2026
08f2d22
Merge branch 'main' into nemotron-asr
eustlb Jun 18, 2026
cd6f28a
Merge branch 'nemotron-asr' into add-nemotron-3.5-asr
eustlb Jun 18, 2026
fb2a580
make
eustlb Jun 18, 2026
4c1ef7f
simplify a bit modular
eustlb Jun 18, 2026
9a5f8f0
remove cached_property
eustlb Jun 18, 2026
7790f13
nit
eustlb Jun 18, 2026
4abedd9
improve comment
eustlb Jun 19, 2026
ca3c035
refacto NemotronAsrStreamingEncoderSubsamplingConv2D
eustlb Jun 19, 2026
083beaa
address comment
eustlb Jun 19, 2026
2667dea
add all_masked_rows as a kwarg
eustlb Jun 19, 2026
31cd80d
Merge branch 'main' into nemotron-asr
eustlb Jun 19, 2026
5b86bf1
rename fixtures
eustlb Jun 19, 2026
c2be332
check-repo
eustlb Jun 19, 2026
7a518c1
Merge branch 'main' into nemotron-asr
eustlb Jun 19, 2026
41ca3bc
Merge branch 'nemotron-asr' into add-nemotron-3.5-asr
eustlb Jun 19, 2026
567f3dd
Migrate Nemotron3_5Asr onto NemotronAsrStreaming; reuse its generatio…
eustlb Jun 19, 2026
343a6fb
updates
eustlb Jun 19, 2026
9cd9758
use nemotron asr feature extractor
eustlb Jun 19, 2026
ef012da
update
eustlb Jun 19, 2026
232aad9
aMerge remote-tracking branch 'origin/main' into add-nemotron-3.5-asr
eustlb Jun 19, 2026
c5aab47
correct auto mappings
eustlb Jun 23, 2026
fff796c
simplify
eustlb Jun 23, 2026
43cae5d
correct type
eustlb Jun 23, 2026
5b8782e
fix
eustlb Jun 23, 2026
b534ce6
update
eustlb Jun 23, 2026
4695da6
Merge branch 'main' into add-nemotron-3.5-asr
eustlb Jun 23, 2026
66a2936
update tests
eustlb Jun 23, 2026
367aa62
update
eustlb Jun 24, 2026
b62ce3d
nit
eustlb Jun 24, 2026
61535b2
nits
eustlb Jun 24, 2026
237df38
fix typing
eustlb Jun 24, 2026
9e227dc
Merge branch 'main' into add-nemotron-3.5-asr
eustlb Jun 24, 2026
ae81387
Merge branch 'nemotron-asr-plain-proc' into add-nemotron-3.5-asr
eustlb Jun 24, 2026
92702bf
update proc
eustlb Jun 24, 2026
00a144d
Merge branch 'nemotron-asr-plain-proc' into add-nemotron-3.5-asr
eustlb Jun 24, 2026
3e3d0ef
nits
eustlb Jun 24, 2026
b955cde
nit
eustlb Jun 24, 2026
824abe3
tests udpates
eustlb Jun 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/source/en/model_doc/nemotron3_5_asr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
<!--Copyright 2026 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->
*This model was contributed to Hugging Face Transformers on 2026-06-24.*

# Nemotron3_5Asr

## Overview

Nemotron3_5Asr is the **multilingual** extension of [NemotronAsr](./nemotron_asr) (`nvidia/nemotron-3.5-asr-streaming-0.6b`).
It reuses the entire cache-aware streaming [Fast Conformer](https://huggingface.co/papers/2305.05084) encoder, RNN-T
(Recurrent Neural Network Transducer) head, feature extraction, and streaming generation of [`NemotronAsr`], and adds
**language-ID prompt conditioning** so a single model transcribes 40 language-locales.

The target language is turned into a one-hot vector, broadcast across the encoder time axis, concatenated with the
encoder output, and fused back to the encoder hidden size by a small MLP (`prompt_kernel`) before the joint network.
Pass the language through the processor's `language` argument (Whisper-style; a locale such as `"en-US"`/`"de-DE"`, a
bare code such as `"de"`, or `"auto"` for automatic language detection). In `auto` mode the model appends an `<xx-XX>` language tag
after the transcript's terminal punctuation. The tag is a special token, so `decode`/`batch_decode` with the default
`skip_special_tokens=True` strip it (clean transcript); pass `skip_special_tokens=False` to keep it for language labeling.

## Usage

```python
from transformers import AutoProcessor, Nemotron3_5AsrForRNNT
from datasets import load_dataset, Audio

model_id = "nvidia/nemotron-3.5-asr-streaming-0.6b"
processor = AutoProcessor.from_pretrained(model_id)
model = Nemotron3_5AsrForRNNT.from_pretrained(model_id).eval()

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))

# Condition on a known language ...
inputs = processor(ds[0]["audio"]["array"], sampling_rate=16000, language="en-US")
generated = model.generate(**inputs)
print(processor.batch_decode(generated.sequences, skip_special_tokens=True))

# ... or let the model detect it and keep the emitted language tag.
inputs = processor(ds[0]["audio"]["array"], sampling_rate=16000, language="auto")
generated = model.generate(**inputs)
print(processor.batch_decode(generated.sequences, skip_special_tokens=False))
```

## Nemotron3_5AsrConfig

[[autodoc]] Nemotron3_5AsrConfig

## Nemotron3_5AsrProcessor

[[autodoc]] Nemotron3_5AsrProcessor

## Nemotron3_5AsrRNNTOutput

[[autodoc]] Nemotron3_5AsrRNNTOutput

## Nemotron3_5AsrForRNNT

[[autodoc]] Nemotron3_5AsrForRNNT
- forward
- generate
2 changes: 2 additions & 0 deletions src/transformers/models/auto/auto_mappings.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,7 @@
("mvp", "MvpConfig"),
("nanochat", "NanoChatConfig"),
("nemotron", "NemotronConfig"),
("nemotron3_5_asr", "Nemotron3_5AsrConfig"),
("nemotron_asr_streaming", "NemotronAsrStreamingConfig"),
("nemotron_asr_streaming_encoder", "NemotronAsrStreamingEncoderConfig"),
("nemotron_h", "NemotronHConfig"),
Expand Down Expand Up @@ -1047,6 +1048,7 @@
("musicflamingo", "MusicFlamingoProcessor"),
("musicgen", "MusicgenProcessor"),
("musicgen_melody", "MusicgenMelodyProcessor"),
("nemotron3_5_asr", "Nemotron3_5AsrProcessor"),
("nemotron_asr_streaming", "NemotronAsrStreamingProcessor"),
("nougat", "NougatProcessor"),
("omdet-turbo", "OmDetTurboProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/feature_extraction_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
("moonshine", "Wav2Vec2FeatureExtractor"),
("moshi", "EncodecFeatureExtractor"),
("musicgen", "EncodecFeatureExtractor"),
("nemotron3_5_asr", "NemotronAsrStreamingFeatureExtractor"),
("nemotron_asr_streaming_encoder", "NemotronAsrStreamingFeatureExtractor"),
("parakeet_ctc", "ParakeetFeatureExtractor"),
("parakeet_encoder", "ParakeetFeatureExtractor"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("mvp", "MvpModel"),
("nanochat", "NanoChatModel"),
("nemotron", "NemotronModel"),
("nemotron3_5_asr", "Nemotron3_5AsrForRNNT"),
("nemotron_asr_streaming", "NemotronAsrStreamingForRNNT"),
("nemotron_asr_streaming_encoder", "NemotronAsrStreamingEncoder"),
("nemotron_h", "NemotronHModel"),
Expand Down Expand Up @@ -1726,6 +1727,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
MODEL_FOR_RNNT_MAPPING_NAMES = OrderedDict(
[
# Model for RNN Transducer (RNN-T) mapping.
("nemotron3_5_asr", "Nemotron3_5AsrForRNNT"),
("nemotron_asr_streaming", "NemotronAsrStreamingForRNNT"),
("parakeet_rnnt", "ParakeetForRNNT"),
]
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,7 @@
("musicgen_melody", "T5Tokenizer" if is_tokenizers_available() else None),
("mvp", "MvpTokenizer" if is_tokenizers_available() else None),
("myt5", "MyT5Tokenizer"),
("nemotron3_5_asr", "ParakeetTokenizer" if is_tokenizers_available() else None),
("nemotron_asr_streaming", "ParakeetTokenizer" if is_tokenizers_available() else None),
("nezha", "BertTokenizer" if is_tokenizers_available() else None),
("nllb", "NllbTokenizer" if is_tokenizers_available() else None),
Expand Down
28 changes: 28 additions & 0 deletions src/transformers/models/nemotron3_5_asr/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2026 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_nemotron3_5_asr import *
from .modeling_nemotron3_5_asr import *
from .processing_nemotron3_5_asr import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from huggingface_hub.dataclasses import strict

from ...configuration_utils import PreTrainedConfig
from ...utils import auto_docstring
from ..nemotron_asr_streaming.configuration_nemotron_asr_streaming import NemotronAsrStreamingEncoderConfig


@auto_docstring(checkpoint="nvidia/nemotron-3.5-asr-streaming-0.6b")
@strict
class Nemotron3_5AsrConfig(PreTrainedConfig):
r"""
vocab_size (`int`, *optional*, defaults to 13088):
Vocabulary size of the joint network output (including the blank token).
decoder_hidden_size (`int`, *optional*, defaults to 640):
Hidden size of the LSTM prediction network (NeMo's `pred_hidden`).
num_decoder_layers (`int`, *optional*, defaults to 2):
Number of LSTM layers in the prediction network.
hidden_act (`str`, *optional*, defaults to `"relu"`):
Activation in the joint network.
max_symbols_per_step (`int`, *optional*, defaults to 10):
Maximum number of non-blank symbols emitted per encoder time step during greedy decoding.
encoder_config (`Union[dict, NemotronAsrStreamingEncoderConfig]`, *optional*):
The config object or dictionary of the encoder. Reuses [`NemotronAsrStreamingEncoderConfig`] directly,
since the encoder is identical to [`NemotronAsrStreaming`]'s.
blank_token_id (`int`, *optional*, defaults to 13087):
Blank token id for RNN-T decoding.
joint_hidden_size (`int`, *optional*, defaults to 640):
Hidden size of the joint network's encoder/decoder projections (NeMo's `joint_hidden`).
durations (`list[int]`, *optional*, defaults to `()`):
Pinned to the empty tuple for RNN-T: no token durations are predicted, so the joint head outputs
only `vocab_size` logits.
num_prompts (`int`, *optional*, defaults to 128):
Number of language-prompt slots. The target language is encoded as a one-hot vector of this
size, broadcast across the encoder time axis and concatenated with the encoder output before
the `prompt_kernel` fusion MLP.
prompt_intermediate_size (`int`, *optional*, defaults to 2048):
Hidden size of the `prompt_kernel` fusion MLP (`Linear(hidden + num_prompts -> intermediate)
-> ReLU -> Linear(intermediate -> hidden)`).

Example:
```python
>>> from transformers import Nemotron3_5AsrForRNNT, Nemotron3_5AsrConfig

>>> configuration = Nemotron3_5AsrConfig()
>>> model = Nemotron3_5AsrForRNNT(configuration)
>>> configuration = model.config
```
"""

model_type = "nemotron3_5_asr"
# The encoder is identical to NemotronAsrStreaming's, so reuse its config class directly.
sub_configs = {"encoder_config": NemotronAsrStreamingEncoderConfig}

vocab_size: int = 13088
decoder_hidden_size: int = 640
num_decoder_layers: int = 2
hidden_act: str = "relu"
max_symbols_per_step: int = 10
encoder_config: dict | PreTrainedConfig | None = None
pad_token_id: int = 0
blank_token_id: int = 13087
is_encoder_decoder: bool = True
joint_hidden_size: int = 640
durations: list[int] | tuple[int, ...] = ()
num_prompts: int = 128
prompt_intermediate_size: int = 2048

def __post_init__(self, **kwargs):
if self.decoder_hidden_size != self.joint_hidden_size:
raise ValueError(
"Nemotron3_5AsrConfig currently requires decoder_hidden_size == joint_hidden_size "
f"(got {self.decoder_hidden_size} and {self.joint_hidden_size})."
)
# The decoder starts on the blank token at frame 0 (NeMo's blank_as_pad convention).
kwargs.setdefault("decoder_start_token_id", self.blank_token_id)
if isinstance(self.encoder_config, dict):
self.encoder_config = NemotronAsrStreamingEncoderConfig(**self.encoder_config)
elif self.encoder_config is None:
self.encoder_config = NemotronAsrStreamingEncoderConfig()
self.initializer_range = self.encoder_config.initializer_range
super().__post_init__(**kwargs)


__all__ = ["Nemotron3_5AsrConfig"]
Loading