Skip to content
Merged
Show file tree
Hide file tree
Changes from 126 commits
Commits
Show all changes
129 commits
Select commit Hold shift + click to select a range
3c8391d
init inference folders
Sep 15, 2025
1b1cb59
added base asr inference
Sep 15, 2025
25c57a8
add ctc and rnnt inference classes
Sep 15, 2025
cb2c7f3
small changes for ctc/rnnt inference
Sep 15, 2025
ab47f96
add cache aware ctc/rnnt inference classes
Sep 15, 2025
c59bc2b
finilize asr inference part
Sep 15, 2025
7c30d09
add word class
Sep 16, 2025
76ff0c4
add enums file
Sep 16, 2025
b3aeb99
add alignment preserving itn
Sep 16, 2025
cd4eb39
add punctuation/capitalization model
Sep 17, 2025
3a5bf54
add audio_io and progressbar files
Sep 17, 2025
397a950
add framing and buffering files
Sep 17, 2025
08f5203
mv common/inference/utils into asr/inference/utils
Sep 18, 2025
2f0717d
add StreamingState objects
Sep 18, 2025
f00102c
temporary rm enhancement stuff
Sep 18, 2025
77d5ffb
rm common/inference
Sep 18, 2025
6e83d6b
add greedy decoders for CTC/RNNT
Sep 18, 2025
2cbdf2d
add endpointing files
Sep 18, 2025
c19bf72
add text processing
Sep 18, 2025
4800a7c
mv itn_utils into utils
Sep 19, 2025
194a507
add bpe_decoder, context_manager for cache aware, recognizer_utils
Sep 19, 2025
5c6e97f
add base_recognizer and recognizer interface files
Sep 19, 2025
9ee2364
add recognizers
Sep 19, 2025
8f65ebf
add factory
Sep 19, 2025
af6e1ef
add inference example and asr_client.py
Sep 19, 2025
e401e6f
minor fix
Sep 19, 2025
18a3e3b
minor fixes
Sep 19, 2025
2da1769
add example usage
Sep 22, 2025
13ff6ec
add jsonl support
Sep 22, 2025
da48a7a
rm niva prefix
Sep 22, 2025
68502ee
fix docstrings
Sep 22, 2025
55b020e
mv RequestType into enums.py
Sep 22, 2025
6f3fed1
rm redundant setters
Sep 22, 2025
42f738f
add a log_level to config.yaml
Sep 22, 2025
010213c
setup log_level in RecognizerBuilder
Sep 22, 2025
a6b9c19
add comments in multi stream and fix docstrings in buffering
Sep 23, 2025
be51fc7
conditional import for diskcache
Sep 23, 2025
6360dd0
set log level to INFO
Sep 23, 2025
72e3115
add MPS device support
Sep 24, 2025
1ccd6e5
add tests
Sep 24, 2025
1f2d381
move inference into examples/asr/asr_chunked_inference/ctc
Sep 25, 2025
e68107e
rm duplicated create_partial_transcript method
Sep 25, 2025
a85520a
Apply isort and black reformatting
naymaraq Sep 26, 2025
efd06b2
resolve flake8 errors
Sep 26, 2025
fa57b30
resolve return type
Sep 26, 2025
022f4eb
fix imports in tests
Sep 27, 2025
f3e0099
optimize bpe_decoder
Sep 28, 2025
d780841
optimize log prob normalization
Sep 30, 2025
d7d7b74
optimize split_text function
Sep 30, 2025
929a9ab
fix parital batching, improved GPU utilization
Oct 4, 2025
2ba57fa
simplify ctc greedy decoder
Oct 4, 2025
971d2b5
add a method to perform ITN on a list of texts
Oct 4, 2025
088a7c3
remove duplicated code in enums
Oct 4, 2025
2678ad9
remove unnecessary pad_to logging
Oct 4, 2025
4570eb1
modified update_punctuation_and_language_tokens_timestamps function t…
Oct 4, 2025
c9aaff0
Apply isort and black reformatting
naymaraq Oct 4, 2025
7309846
[refactor: segment-level output] conditional import for pynini and ne…
Oct 6, 2025
ade13e5
[refactor: segment-level output] fix configs, added asr_output_granul…
Oct 6, 2025
baae7eb
[refactor: segment-level output] write segment/word level output into…
Oct 6, 2025
b3cf0c0
[refactor: segment-level output] add output granuality to request opt…
Oct 6, 2025
637094e
[refactor: segment-level output] add segment related fields to state
Oct 6, 2025
b88d22c
[refactor: segment-level output] add remove repeated punctuation func…
Oct 6, 2025
bbe9020
[refactor: segment-level output] add TextSegment class
Oct 6, 2025
1da433a
[refactor: segment-level output] update bpe decoder to support text s…
Oct 6, 2025
75926e1
[refactor: segment-level output] update recognizers
Oct 6, 2025
80f780b
[refactor: segment-level output] update text processing to support se…
Oct 6, 2025
e2f229a
rm unused and duplicated code
Oct 7, 2025
e2997b0
Apply isort and black reformatting
naymaraq Oct 7, 2025
9a4dbaa
code cleanup
Oct 7, 2025
b54d7c2
Apply isort and black reformatting
naymaraq Oct 7, 2025
557b66b
rm unused code and code cleanup
Oct 7, 2025
14d671c
Merge branch 'dkaramyan/inference' of https://github.com/NVIDIA-NeMo/…
Oct 7, 2025
e776d55
Set num_slots to 1024 and add a num_slots parameter to the config files
Oct 8, 2025
80b43fe
removed hyp.alignment processing codes
Oct 8, 2025
ca4bae8
disable amp
Oct 9, 2025
565ba2d
mv diskcache req into requirements_asr.txt
Oct 9, 2025
6b59370
set use_amp to true and make typing consistent
Oct 9, 2025
0348ea3
use match/case for readability
Oct 9, 2025
5e86413
rm lambdas from punctuation_capitalization_config.py
Oct 9, 2025
04f49da
rm detect_eou method from RNNTGreedyEndpointing
Oct 9, 2025
0ba2e71
reuse read_manifest from manifest_utils
Oct 9, 2025
07999dd
use librosa instead of soundfile
Oct 9, 2025
51bcec0
unfreeze ASRRequestOptions dataclass
Oct 9, 2025
4682688
set use_amp to false for buffered CTC/RNNT recognizers, improved thro…
Oct 9, 2025
a946876
change matmul precision to high for cache aware models
Oct 10, 2025
f236413
optimized audio buffer shifting
Oct 11, 2025
e34ed1e
Move running scripts and YAML files out of the ctc folder
Oct 12, 2025
d5537b3
reorganize file structure
Oct 12, 2025
5a7f6b3
Apply isort and black reformatting
naymaraq Oct 12, 2025
50b8de2
Minor code simplifications
Oct 13, 2025
60a6d78
rm duplicated initializations from recognizers
Oct 13, 2025
f56ad1a
remove package version for diskcache
Oct 13, 2025
ef02c1a
move tqdm import to the top
Oct 13, 2025
e366e63
simplify millisecond_to_frames function
Oct 13, 2025
b144d04
raise a ValueError in case of stream_id > n_audio_files
Oct 13, 2025
49ee2f1
fix return types
Oct 13, 2025
5bce5c1
use list/dict/... instead of List/Dict/...
Oct 13, 2025
793c90d
use keyword argument passing to create CacheFeatureBufferer
Oct 13, 2025
42d9aec
clean up state resetting logic
Oct 13, 2025
1ec2c4b
reuse normalize_batch
Oct 13, 2025
68d549b
rename verbatim_transcripts and automatic_punctuation
Oct 14, 2025
3a17281
rename recognizers to pipelines
Oct 14, 2025
5b63a25
rename asr/*_inference -> model_wrappers/*_inference_wrapper
Oct 14, 2025
c45b6aa
Apply isort and black reformatting
naymaraq Oct 14, 2025
c8e5b35
reorgonize pnc, itn, text_processing params
Oct 14, 2025
4a3585a
improved code readability in pipeline initializations
Oct 15, 2025
7dd54fa
Apply isort and black reformatting
naymaraq Oct 15, 2025
5a7a1ce
add CI script for testing
Oct 15, 2025
cb5bf78
add output_dir in CI test
Oct 15, 2025
02d5b48
move python running script into new folder
Oct 15, 2025
9044aa5
renamed asr_streaming_infer -> asr_streaming_inference
Oct 15, 2025
65f2007
correct path in CI test
Oct 15, 2025
66573f0
Merge remote-tracking branch 'origin/main' into dkaramyan/inference
Oct 27, 2025
444733f
fix: variable may be used before it is initialized
Oct 27, 2025
bc9042f
fix docstring in itn/ folder
Oct 27, 2025
525b167
fix docstring in model_wrappers/ folder
Oct 27, 2025
a2cf86d
fix docstring in utils/ folder
Oct 27, 2025
a54196d
fix docstring in pipelines/ folder
Oct 27, 2025
b9eb2ec
fix docstring in streaming/ folder
Oct 27, 2025
d0510c4
remove PnC codes since nlp models are no longer supported
Oct 28, 2025
e0a0de3
minor changes
Oct 29, 2025
3eba8d8
return step output from transcribe_step method
Oct 30, 2025
b1369c2
Apply isort and black reformatting
naymaraq Oct 30, 2025
511fd38
Merge branch 'main' into dkaramyan/inference
naymaraq Oct 30, 2025
5299973
fix functional_test
Oct 31, 2025
d55ad43
Merge branch 'main' into dkaramyan/inference
naymaraq Oct 31, 2025
fdc168c
increase timeout for L0_Unit_Tests_CPU_ASR
Oct 31, 2025
0328dbf
Merge branch 'main' into dkaramyan/inference
naymaraq Oct 31, 2025
5cb1a65
rm cache aware inference from functional test
Oct 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/cicd-main-speech.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,8 @@ jobs:
script: L2_Speech_Transcription_Speech_to_Text_Streaming_Infer
- runner: self-hosted-azure
script: L2_Speech_Transcription_Speech_to_Text_Cache_Aware_Infer
- runner: self-hosted-azure
script: L2_Speech_Transcription_Streaming_Inference
- runner: self-hosted-azure
script: L2_Speech_Transcription_Canary_Transcribe_Full_Manifest
- runner: self-hosted-azure
Expand Down
1 change: 1 addition & 0 deletions examples/asr/asr_chunked_inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ On the other hand, if you increase your chunk size, then the delay between spoke
## Chunked Inference

For MultitaskAED models, we provide a script to perform chunked inference. This script will split the input audio into non-overlapping chunks and perform inference on each chunk. The script will then concatenate the results to provide the final transcript.

11 changes: 11 additions & 0 deletions examples/asr/asr_streaming_inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Universal Streaming Inference

The `asr_streaming_infer.py` script enables streaming inference for both buffered (CTC/RNNT/TDT) and cache-aware (CTC/RNNT) ASR models. It supports processing a single audio file, a directory of audio files, or a manifest file.

Beyond streaming ASR, the script also supports:

* **Inverse Text Normalization (ITN)**
* **End-of-Utterance (EoU) Detection**
* **Word-level and Segment-level Output**

All related configurations can be found in the `../conf/asr_streaming_inference/` directory.
96 changes: 96 additions & 0 deletions examples/asr/asr_streaming_inference/asr_streaming_infer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script serves as the entry point for local ASR inference, supporting buffered CTC/RNNT/TDT and cache-aware CTC/RNNT inference.

The script performs the following steps:
(1) Accepts as input a single audio file, a directory of audio files, or a manifest file.
- Note: Input audio files must be 16 kHz, mono-channel WAV files.
(2) Creates a pipeline object to perform inference.
(3) Runs inference on the input audio files.
(4) Writes the transcriptions to an output json/jsonl file. Word/Segment level output is written to a separate JSON file.

Example usage:
python asr_streaming_infer.py \
--config-path=../conf/asr_streaming_inference/ \
--config-name=config.yaml \
audio_file=<path to audio file, directory of audio files, or manifest file> \
output_filename=<path to output jsonfile> \
lang=en \
enable_pnc=False \
enable_itn=True \
asr_output_granularity=segment \
...
# See ../conf/asr_streaming_inference/*.yaml for all available options

Note:
The output file is a json file with the following structure:
{"audio_filepath": "path/to/audio/file", "text": "transcription of the audio file", "json_filepath": "path/to/json/file"}
"""


from time import time

import hydra


from nemo.collections.asr.inference.factory.pipeline_builder import PipelineBuilder
from nemo.collections.asr.inference.utils.manifest_io import calculate_duration, dump_output, get_audio_filepaths
from nemo.collections.asr.inference.utils.progressbar import TQDMProgressBar
from nemo.utils import logging

# disable nemo_text_processing logging
try:
from nemo_text_processing.utils import logger as nemo_text_logger

nemo_text_logger.propagate = False
except ImportError:
# NB: nemo_text_processing requires pynini, which is tricky to install on MacOS
# since nemo_text_processing is not necessary for ASR, wrap the import
logging.warning("NeMo text processing library is unavailable.")


@hydra.main(version_base=None)
def main(cfg):

# Set the logging level
logging.setLevel(cfg.log_level)

# Reading audio filepaths
audio_filepaths = get_audio_filepaths(cfg.audio_file, sort_by_duration=True)
logging.info(f"Found {len(audio_filepaths)} audio files")

# Build the pipeline
pipeline = PipelineBuilder.build_pipeline(cfg)
progress_bar = TQDMProgressBar()

# Run the pipeline
start = time()
output = pipeline.run(audio_filepaths, progress_bar=progress_bar)
exec_dur = time() - start

# Calculate RTFX
data_dur = calculate_duration(audio_filepaths)
rtfx = data_dur / exec_dur if exec_dur > 0 else float('inf')
logging.info(f"RTFX: {rtfx:.2f} ({data_dur:.2f}s / {exec_dur:.2f}s)")

# Dump the transcriptions to a output file
dump_output(output, cfg.output_filename, cfg.output_dir)
logging.info(f"Transcriptions written to {cfg.output_filename}")
logging.info("Done!")


if __name__ == "__main__":
main()
80 changes: 80 additions & 0 deletions examples/asr/conf/asr_streaming_inference/buffered_ctc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# ================================
# ASR Configuration
# ================================
asr:
model_name: nvidia/parakeet-ctc-1.1b # Pre-trained CTC/hybrid model from NGC/HuggingFace or local .nemo file path
device: cuda # Device for inference: 'cuda' or 'cpu'
device_id: 0 # GPU device ID
compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32'
use_amp: false # Enable Automatic Mixed Precision


# ==========================================
# Inverse Text Normalization Configuration
# ==========================================
itn:
input_case: lower_cased # Input text case handling: 'lower_cased', 'cased'
whitelist: null # Custom whitelist for ITN processing
overwrite_cache: false # Whether to overwrite existing cache files
max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing
left_padding_size: 4 # Padding size (#spans) for ITN context
batch_size: 32 # Batch size for ITN inference
n_jobs: 16 # Number of parallel jobs for ITN processing


# ========================
# Confidence estimation
# ========================
confidence:
exclude_blank: true # Exclude blank tokens when calculating confidence
aggregation: mean # Aggregation method for confidence across time steps
method_cfg:
name: entropy # Confidence estimation method: 'max_prob' or 'entropy'
entropy_type: tsallis
alpha: 0.5
entropy_norm: exp


# ========================
# Endpointing settings
# ========================
endpointing:
stop_history_eou: 800 # Time window (ms) for evaluating EoU
residue_tokens_at_end: 2 # Number of residual tokens used for EoU


# ========================
# Streaming configuration
# ========================
streaming:
sample_rate: 16000 # Audio sample rate in Hz
batch_size: 256 # Number of audio frames per batch
left_padding_size: 1.6 # Left padding duration in seconds
right_padding_size: 1.6 # Right padding duration in seconds
chunk_size: 4.8 # Audio chunk size in seconds
word_boundary_tolerance: 4 # Tolerance for word boundaries
request_type: feature_buffer # Type of request: frame or feature_buffer
padding_mode: right # Padding mode: left or right. How to pad frames to match the required buffer length


# ========================
# Pipeline settings
# ========================
matmul_precision: high # Matrix multiplication precision: highest, high, medium
log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL)
pipeline_type: buffered # Pipeline type: buffered, cache_aware
asr_decoding_type: ctc # Decoding method: ctc or rnnt


# ========================
# Runtime arguments defined at runtime via command line
# ========================
audio_file: null # Path to audio file, directory, or manifest JSON
output_filename: null # Path to output transcription JSON file
output_dir: null # Directory to save time-aligned output
enable_pnc: false # Whether to apply punctuation & capitalization
enable_itn: false # Whether to apply inverse text normalization
asr_output_granularity: segment # Output granularity: word or segment
cache_dir: null # Directory to store cache (e.g., .far files)
lang: null # Language code for ASR model
return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer
83 changes: 83 additions & 0 deletions examples/asr/conf/asr_streaming_inference/buffered_rnnt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# ================================
# ASR Configuration
# ================================
asr:
model_name: nvidia/parakeet-rnnt-1.1b # Pre-trained RNNT/hybrid model from NGC/HuggingFace or local .nemo file path
device: cuda # Device for inference: 'cuda' or 'cpu'
device_id: 0 # GPU device ID
compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32'
use_amp: false # Enable Automatic Mixed Precision
ngram_lm_model: "" # Path to ngram language model
ngram_lm_alpha: 0.0 # Alpha for language model


# ==========================================
# Inverse Text Normalization Configuration
# ==========================================
itn:
input_case: lower_cased # Input text case handling: 'lower_cased', 'cased'
whitelist: null # Custom whitelist for ITN processing
overwrite_cache: false # Whether to overwrite existing cache files
max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing
left_padding_size: 4 # Padding size (#spans) for ITN context
batch_size: 32 # Batch size for ITN inference
n_jobs: 16 # Number of parallel jobs for ITN processing


# ========================
# Confidence estimation
# ========================
confidence:
exclude_blank: true # Exclude blank tokens when calculating confidence
aggregation: mean # Aggregation method for confidence across time steps
method_cfg:
name: entropy # Confidence estimation method: 'max_prob' or 'entropy'
entropy_type: tsallis
alpha: 0.5
entropy_norm: exp


# ========================
# Endpointing settings
# ========================
endpointing:
stop_history_eou: 800 # Time window (ms) for evaluating EoU
residue_tokens_at_end: 2 # Number of residual tokens used for EoU


# ========================
# Streaming configuration
# ========================
streaming:
sample_rate: 16000 # Audio sample rate in Hz
batch_size: 256 # Number of audio frames per batch
left_padding_size: 1.6 # Left padding duration in seconds
right_padding_size: 1.6 # Right padding duration in seconds
chunk_size: 4.8 # Audio chunk size in seconds
word_boundary_tolerance: 4 # Tolerance for word boundaries
request_type: feature_buffer # Type of request: frame or feature_buffer
stateful: true # Whether to use stateful processing
padding_mode: right # Padding mode: left or right. How to pad frames to match the required buffer length


# ========================
# Pipeline settings
# ========================
matmul_precision: high # Matrix multiplication precision: highest, high, medium
log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL)
pipeline_type: buffered # Pipeline type: buffered, cache_aware
asr_decoding_type: rnnt # Decoding method: ctc or rnnt


# ========================
# Runtime arguments defined at runtime via command line
# ========================
audio_file: null # Path to audio file, directory, or manifest JSON
output_filename: null # Path to output transcription JSON file
output_dir: null # Directory to save time-aligned output
enable_pnc: false # Whether to apply punctuation & capitalization
enable_itn: false # Whether to apply inverse text normalization
asr_output_granularity: segment # Output granularity: word or segment
cache_dir: null # Directory to store cache (e.g., .far files)
lang: null # Language code for ASR model
return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer
80 changes: 80 additions & 0 deletions examples/asr/conf/asr_streaming_inference/cache_aware_ctc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# ================================
# ASR Configuration
# ================================
asr:
model_name: stt_en_fastconformer_hybrid_large_streaming_multi # Pre-trained CTC/hybrid model from NGC/HuggingFace or local .nemo file path
device: cuda # Device for inference: 'cuda' or 'cpu'
device_id: 0 # GPU device ID
compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32'
use_amp: true # Enable Automatic Mixed Precision


# ==========================================
# Inverse Text Normalization Configuration
# ==========================================
itn:
input_case: lower_cased # Input text case handling: 'lower_cased', 'cased'
whitelist: null # Custom whitelist for ITN processing
overwrite_cache: false # Whether to overwrite existing cache files
max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing
left_padding_size: 4 # Padding size (#spans) for ITN context
batch_size: 32 # Batch size for ITN inference
n_jobs: 16 # Number of parallel jobs for ITN processing


# ========================
# Confidence estimation
# ========================
confidence:
exclude_blank: true # Exclude blank tokens when calculating confidence
aggregation: mean # Aggregation method for confidence across time steps
method_cfg:
name: entropy # Confidence estimation method: 'max_prob' or 'entropy'
entropy_type: tsallis
alpha: 0.5
entropy_norm: exp


# ========================
# Endpointing settings
# ========================
endpointing:
stop_history_eou: 800 # Time window (ms) for evaluating EoU
residue_tokens_at_end: 2 # Number of residual tokens used for EoU


# ========================
# Streaming configuration
# ========================
streaming:
sample_rate: 16000 # Audio sample rate in Hz
batch_size: 256 # Number of audio frames per batch
word_boundary_tolerance: 4 # Tolerance for word boundaries
att_context_size: [70,13] # Attention context size: [70,13],[70,6],[70,1],[70,0]
use_cache: true # Whether to use cache for streaming
use_feat_cache: true # Whether to cache mel-spec features, set false to re-calculate all mel-spec features in audio buffer
chunk_size_in_secs: null # Amount of audio to load for each streaming step, e.g., 0.08s for FastConformer. Set to `null` for using default size equal to 1+lookahead frames.
request_type: frame # Type of request: frame, only frame is supported for cache-aware streaming
num_slots: 1024 # Number of slots in the context manager: must be >= batch_size


# ========================
# Pipeline settings
# ========================
matmul_precision: high # Matrix multiplication precision: highest, high, medium
log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL)
pipeline_type: cache_aware # Pipeline type: buffered, cache_aware
asr_decoding_type: ctc # Decoding method: ctc or rnnt

# ========================
# Runtime arguments defined at runtime via command line
# ========================
audio_file: null # Path to audio file, directory, or manifest JSON
output_filename: null # Path to output transcription JSON file
output_dir: null # Directory to save time-aligned output
enable_pnc: false # Whether to apply punctuation & capitalization
enable_itn: false # Whether to apply inverse text normalization
asr_output_granularity: segment # Output granularity: word or segment
cache_dir: null # Directory to store cache (e.g., .far files)
lang: null # Language code for ASR model
return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer
Loading
Loading