-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Unified inference of streaming ASR #14817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 126 commits
Commits
Show all changes
129 commits
Select commit
Hold shift + click to select a range
3c8391d
init inference folders
1b1cb59
added base asr inference
25c57a8
add ctc and rnnt inference classes
cb2c7f3
small changes for ctc/rnnt inference
ab47f96
add cache aware ctc/rnnt inference classes
c59bc2b
finilize asr inference part
7c30d09
add word class
76ff0c4
add enums file
b3aeb99
add alignment preserving itn
cd4eb39
add punctuation/capitalization model
3a5bf54
add audio_io and progressbar files
397a950
add framing and buffering files
08f5203
mv common/inference/utils into asr/inference/utils
2f0717d
add StreamingState objects
f00102c
temporary rm enhancement stuff
77d5ffb
rm common/inference
6e83d6b
add greedy decoders for CTC/RNNT
2cbdf2d
add endpointing files
c19bf72
add text processing
4800a7c
mv itn_utils into utils
194a507
add bpe_decoder, context_manager for cache aware, recognizer_utils
5c6e97f
add base_recognizer and recognizer interface files
9ee2364
add recognizers
8f65ebf
add factory
af6e1ef
add inference example and asr_client.py
e401e6f
minor fix
18a3e3b
minor fixes
2da1769
add example usage
13ff6ec
add jsonl support
da48a7a
rm niva prefix
68502ee
fix docstrings
55b020e
mv RequestType into enums.py
6f3fed1
rm redundant setters
42f738f
add a log_level to config.yaml
010213c
setup log_level in RecognizerBuilder
a6b9c19
add comments in multi stream and fix docstrings in buffering
be51fc7
conditional import for diskcache
6360dd0
set log level to INFO
72e3115
add MPS device support
1ccd6e5
add tests
1f2d381
move inference into examples/asr/asr_chunked_inference/ctc
e68107e
rm duplicated create_partial_transcript method
a85520a
Apply isort and black reformatting
naymaraq efd06b2
resolve flake8 errors
fa57b30
resolve return type
022f4eb
fix imports in tests
f3e0099
optimize bpe_decoder
d780841
optimize log prob normalization
d7d7b74
optimize split_text function
929a9ab
fix parital batching, improved GPU utilization
2ba57fa
simplify ctc greedy decoder
971d2b5
add a method to perform ITN on a list of texts
088a7c3
remove duplicated code in enums
2678ad9
remove unnecessary pad_to logging
4570eb1
modified update_punctuation_and_language_tokens_timestamps function t…
c9aaff0
Apply isort and black reformatting
naymaraq 7309846
[refactor: segment-level output] conditional import for pynini and ne…
ade13e5
[refactor: segment-level output] fix configs, added asr_output_granul…
baae7eb
[refactor: segment-level output] write segment/word level output into…
b3cf0c0
[refactor: segment-level output] add output granuality to request opt…
637094e
[refactor: segment-level output] add segment related fields to state
b88d22c
[refactor: segment-level output] add remove repeated punctuation func…
bbe9020
[refactor: segment-level output] add TextSegment class
1da433a
[refactor: segment-level output] update bpe decoder to support text s…
75926e1
[refactor: segment-level output] update recognizers
80f780b
[refactor: segment-level output] update text processing to support se…
e2f229a
rm unused and duplicated code
e2997b0
Apply isort and black reformatting
naymaraq 9a4dbaa
code cleanup
b54d7c2
Apply isort and black reformatting
naymaraq 557b66b
rm unused code and code cleanup
14d671c
Merge branch 'dkaramyan/inference' of https://github.com/NVIDIA-NeMo/…
e776d55
Set num_slots to 1024 and add a num_slots parameter to the config files
80b43fe
removed hyp.alignment processing codes
ca4bae8
disable amp
565ba2d
mv diskcache req into requirements_asr.txt
6b59370
set use_amp to true and make typing consistent
0348ea3
use match/case for readability
5e86413
rm lambdas from punctuation_capitalization_config.py
04f49da
rm detect_eou method from RNNTGreedyEndpointing
0ba2e71
reuse read_manifest from manifest_utils
07999dd
use librosa instead of soundfile
51bcec0
unfreeze ASRRequestOptions dataclass
4682688
set use_amp to false for buffered CTC/RNNT recognizers, improved thro…
a946876
change matmul precision to high for cache aware models
f236413
optimized audio buffer shifting
e34ed1e
Move running scripts and YAML files out of the ctc folder
d5537b3
reorganize file structure
5a7f6b3
Apply isort and black reformatting
naymaraq 50b8de2
Minor code simplifications
60a6d78
rm duplicated initializations from recognizers
f56ad1a
remove package version for diskcache
ef02c1a
move tqdm import to the top
e366e63
simplify millisecond_to_frames function
b144d04
raise a ValueError in case of stream_id > n_audio_files
49ee2f1
fix return types
5bce5c1
use list/dict/... instead of List/Dict/...
793c90d
use keyword argument passing to create CacheFeatureBufferer
42d9aec
clean up state resetting logic
1ec2c4b
reuse normalize_batch
68d549b
rename verbatim_transcripts and automatic_punctuation
3a17281
rename recognizers to pipelines
5b63a25
rename asr/*_inference -> model_wrappers/*_inference_wrapper
c45b6aa
Apply isort and black reformatting
naymaraq c8e5b35
reorgonize pnc, itn, text_processing params
4a3585a
improved code readability in pipeline initializations
7dd54fa
Apply isort and black reformatting
naymaraq 5a7a1ce
add CI script for testing
cb5bf78
add output_dir in CI test
02d5b48
move python running script into new folder
9044aa5
renamed asr_streaming_infer -> asr_streaming_inference
65f2007
correct path in CI test
66573f0
Merge remote-tracking branch 'origin/main' into dkaramyan/inference
444733f
fix: variable may be used before it is initialized
bc9042f
fix docstring in itn/ folder
525b167
fix docstring in model_wrappers/ folder
a2cf86d
fix docstring in utils/ folder
a54196d
fix docstring in pipelines/ folder
b9eb2ec
fix docstring in streaming/ folder
d0510c4
remove PnC codes since nlp models are no longer supported
e0a0de3
minor changes
3eba8d8
return step output from transcribe_step method
b1369c2
Apply isort and black reformatting
naymaraq 511fd38
Merge branch 'main' into dkaramyan/inference
naymaraq 5299973
fix functional_test
d55ad43
Merge branch 'main' into dkaramyan/inference
naymaraq fdc168c
increase timeout for L0_Unit_Tests_CPU_ASR
0328dbf
Merge branch 'main' into dkaramyan/inference
naymaraq 5cb1a65
rm cache aware inference from functional test
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # Universal Streaming Inference | ||
|
|
||
| The `asr_streaming_infer.py` script enables streaming inference for both buffered (CTC/RNNT/TDT) and cache-aware (CTC/RNNT) ASR models. It supports processing a single audio file, a directory of audio files, or a manifest file. | ||
|
|
||
| Beyond streaming ASR, the script also supports: | ||
|
|
||
| * **Inverse Text Normalization (ITN)** | ||
| * **End-of-Utterance (EoU) Detection** | ||
| * **Word-level and Segment-level Output** | ||
|
|
||
| All related configurations can be found in the `../conf/asr_streaming_inference/` directory. |
96 changes: 96 additions & 0 deletions
96
examples/asr/asr_streaming_inference/asr_streaming_infer.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """ | ||
| This script serves as the entry point for local ASR inference, supporting buffered CTC/RNNT/TDT and cache-aware CTC/RNNT inference. | ||
|
|
||
| The script performs the following steps: | ||
| (1) Accepts as input a single audio file, a directory of audio files, or a manifest file. | ||
| - Note: Input audio files must be 16 kHz, mono-channel WAV files. | ||
| (2) Creates a pipeline object to perform inference. | ||
| (3) Runs inference on the input audio files. | ||
| (4) Writes the transcriptions to an output json/jsonl file. Word/Segment level output is written to a separate JSON file. | ||
|
|
||
| Example usage: | ||
| python asr_streaming_infer.py \ | ||
| --config-path=../conf/asr_streaming_inference/ \ | ||
| --config-name=config.yaml \ | ||
| audio_file=<path to audio file, directory of audio files, or manifest file> \ | ||
| output_filename=<path to output jsonfile> \ | ||
| lang=en \ | ||
| enable_pnc=False \ | ||
| enable_itn=True \ | ||
| asr_output_granularity=segment \ | ||
| ... | ||
| # See ../conf/asr_streaming_inference/*.yaml for all available options | ||
|
|
||
| Note: | ||
| The output file is a json file with the following structure: | ||
| {"audio_filepath": "path/to/audio/file", "text": "transcription of the audio file", "json_filepath": "path/to/json/file"} | ||
| """ | ||
|
|
||
|
|
||
| from time import time | ||
|
|
||
| import hydra | ||
|
|
||
|
|
||
| from nemo.collections.asr.inference.factory.pipeline_builder import PipelineBuilder | ||
| from nemo.collections.asr.inference.utils.manifest_io import calculate_duration, dump_output, get_audio_filepaths | ||
| from nemo.collections.asr.inference.utils.progressbar import TQDMProgressBar | ||
| from nemo.utils import logging | ||
|
|
||
| # disable nemo_text_processing logging | ||
| try: | ||
| from nemo_text_processing.utils import logger as nemo_text_logger | ||
|
|
||
| nemo_text_logger.propagate = False | ||
| except ImportError: | ||
| # NB: nemo_text_processing requires pynini, which is tricky to install on MacOS | ||
| # since nemo_text_processing is not necessary for ASR, wrap the import | ||
| logging.warning("NeMo text processing library is unavailable.") | ||
|
|
||
|
|
||
| @hydra.main(version_base=None) | ||
| def main(cfg): | ||
artbataev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # Set the logging level | ||
| logging.setLevel(cfg.log_level) | ||
|
|
||
| # Reading audio filepaths | ||
| audio_filepaths = get_audio_filepaths(cfg.audio_file, sort_by_duration=True) | ||
| logging.info(f"Found {len(audio_filepaths)} audio files") | ||
|
|
||
| # Build the pipeline | ||
| pipeline = PipelineBuilder.build_pipeline(cfg) | ||
| progress_bar = TQDMProgressBar() | ||
|
|
||
| # Run the pipeline | ||
| start = time() | ||
| output = pipeline.run(audio_filepaths, progress_bar=progress_bar) | ||
| exec_dur = time() - start | ||
|
|
||
| # Calculate RTFX | ||
| data_dur = calculate_duration(audio_filepaths) | ||
| rtfx = data_dur / exec_dur if exec_dur > 0 else float('inf') | ||
| logging.info(f"RTFX: {rtfx:.2f} ({data_dur:.2f}s / {exec_dur:.2f}s)") | ||
|
|
||
| # Dump the transcriptions to a output file | ||
| dump_output(output, cfg.output_filename, cfg.output_dir) | ||
| logging.info(f"Transcriptions written to {cfg.output_filename}") | ||
| logging.info("Done!") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
80 changes: 80 additions & 0 deletions
80
examples/asr/conf/asr_streaming_inference/buffered_ctc.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # ================================ | ||
| # ASR Configuration | ||
| # ================================ | ||
| asr: | ||
| model_name: nvidia/parakeet-ctc-1.1b # Pre-trained CTC/hybrid model from NGC/HuggingFace or local .nemo file path | ||
| device: cuda # Device for inference: 'cuda' or 'cpu' | ||
| device_id: 0 # GPU device ID | ||
| compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32' | ||
| use_amp: false # Enable Automatic Mixed Precision | ||
|
|
||
|
|
||
| # ========================================== | ||
| # Inverse Text Normalization Configuration | ||
| # ========================================== | ||
| itn: | ||
| input_case: lower_cased # Input text case handling: 'lower_cased', 'cased' | ||
| whitelist: null # Custom whitelist for ITN processing | ||
| overwrite_cache: false # Whether to overwrite existing cache files | ||
| max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing | ||
| left_padding_size: 4 # Padding size (#spans) for ITN context | ||
| batch_size: 32 # Batch size for ITN inference | ||
| n_jobs: 16 # Number of parallel jobs for ITN processing | ||
|
|
||
|
|
||
| # ======================== | ||
| # Confidence estimation | ||
| # ======================== | ||
| confidence: | ||
| exclude_blank: true # Exclude blank tokens when calculating confidence | ||
| aggregation: mean # Aggregation method for confidence across time steps | ||
| method_cfg: | ||
| name: entropy # Confidence estimation method: 'max_prob' or 'entropy' | ||
| entropy_type: tsallis | ||
| alpha: 0.5 | ||
| entropy_norm: exp | ||
|
|
||
|
|
||
| # ======================== | ||
| # Endpointing settings | ||
| # ======================== | ||
| endpointing: | ||
| stop_history_eou: 800 # Time window (ms) for evaluating EoU | ||
| residue_tokens_at_end: 2 # Number of residual tokens used for EoU | ||
|
|
||
|
|
||
| # ======================== | ||
| # Streaming configuration | ||
| # ======================== | ||
| streaming: | ||
| sample_rate: 16000 # Audio sample rate in Hz | ||
| batch_size: 256 # Number of audio frames per batch | ||
| left_padding_size: 1.6 # Left padding duration in seconds | ||
| right_padding_size: 1.6 # Right padding duration in seconds | ||
| chunk_size: 4.8 # Audio chunk size in seconds | ||
| word_boundary_tolerance: 4 # Tolerance for word boundaries | ||
| request_type: feature_buffer # Type of request: frame or feature_buffer | ||
| padding_mode: right # Padding mode: left or right. How to pad frames to match the required buffer length | ||
|
|
||
|
|
||
| # ======================== | ||
| # Pipeline settings | ||
| # ======================== | ||
| matmul_precision: high # Matrix multiplication precision: highest, high, medium | ||
| log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL) | ||
| pipeline_type: buffered # Pipeline type: buffered, cache_aware | ||
| asr_decoding_type: ctc # Decoding method: ctc or rnnt | ||
|
|
||
|
|
||
| # ======================== | ||
| # Runtime arguments defined at runtime via command line | ||
| # ======================== | ||
| audio_file: null # Path to audio file, directory, or manifest JSON | ||
| output_filename: null # Path to output transcription JSON file | ||
| output_dir: null # Directory to save time-aligned output | ||
| enable_pnc: false # Whether to apply punctuation & capitalization | ||
| enable_itn: false # Whether to apply inverse text normalization | ||
| asr_output_granularity: segment # Output granularity: word or segment | ||
| cache_dir: null # Directory to store cache (e.g., .far files) | ||
stevehuang52 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| lang: null # Language code for ASR model | ||
| return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer | ||
83 changes: 83 additions & 0 deletions
83
examples/asr/conf/asr_streaming_inference/buffered_rnnt.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| # ================================ | ||
| # ASR Configuration | ||
| # ================================ | ||
| asr: | ||
| model_name: nvidia/parakeet-rnnt-1.1b # Pre-trained RNNT/hybrid model from NGC/HuggingFace or local .nemo file path | ||
| device: cuda # Device for inference: 'cuda' or 'cpu' | ||
| device_id: 0 # GPU device ID | ||
| compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32' | ||
| use_amp: false # Enable Automatic Mixed Precision | ||
| ngram_lm_model: "" # Path to ngram language model | ||
| ngram_lm_alpha: 0.0 # Alpha for language model | ||
|
|
||
|
|
||
| # ========================================== | ||
| # Inverse Text Normalization Configuration | ||
| # ========================================== | ||
| itn: | ||
| input_case: lower_cased # Input text case handling: 'lower_cased', 'cased' | ||
| whitelist: null # Custom whitelist for ITN processing | ||
| overwrite_cache: false # Whether to overwrite existing cache files | ||
| max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing | ||
| left_padding_size: 4 # Padding size (#spans) for ITN context | ||
| batch_size: 32 # Batch size for ITN inference | ||
| n_jobs: 16 # Number of parallel jobs for ITN processing | ||
|
|
||
|
|
||
| # ======================== | ||
| # Confidence estimation | ||
| # ======================== | ||
| confidence: | ||
| exclude_blank: true # Exclude blank tokens when calculating confidence | ||
| aggregation: mean # Aggregation method for confidence across time steps | ||
| method_cfg: | ||
| name: entropy # Confidence estimation method: 'max_prob' or 'entropy' | ||
| entropy_type: tsallis | ||
| alpha: 0.5 | ||
| entropy_norm: exp | ||
|
|
||
|
|
||
| # ======================== | ||
| # Endpointing settings | ||
| # ======================== | ||
| endpointing: | ||
| stop_history_eou: 800 # Time window (ms) for evaluating EoU | ||
| residue_tokens_at_end: 2 # Number of residual tokens used for EoU | ||
|
|
||
|
|
||
| # ======================== | ||
| # Streaming configuration | ||
| # ======================== | ||
| streaming: | ||
| sample_rate: 16000 # Audio sample rate in Hz | ||
| batch_size: 256 # Number of audio frames per batch | ||
| left_padding_size: 1.6 # Left padding duration in seconds | ||
| right_padding_size: 1.6 # Right padding duration in seconds | ||
| chunk_size: 4.8 # Audio chunk size in seconds | ||
| word_boundary_tolerance: 4 # Tolerance for word boundaries | ||
| request_type: feature_buffer # Type of request: frame or feature_buffer | ||
| stateful: true # Whether to use stateful processing | ||
| padding_mode: right # Padding mode: left or right. How to pad frames to match the required buffer length | ||
|
|
||
|
|
||
| # ======================== | ||
| # Pipeline settings | ||
| # ======================== | ||
| matmul_precision: high # Matrix multiplication precision: highest, high, medium | ||
| log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL) | ||
| pipeline_type: buffered # Pipeline type: buffered, cache_aware | ||
| asr_decoding_type: rnnt # Decoding method: ctc or rnnt | ||
|
|
||
|
|
||
| # ======================== | ||
| # Runtime arguments defined at runtime via command line | ||
| # ======================== | ||
| audio_file: null # Path to audio file, directory, or manifest JSON | ||
| output_filename: null # Path to output transcription JSON file | ||
| output_dir: null # Directory to save time-aligned output | ||
| enable_pnc: false # Whether to apply punctuation & capitalization | ||
| enable_itn: false # Whether to apply inverse text normalization | ||
| asr_output_granularity: segment # Output granularity: word or segment | ||
| cache_dir: null # Directory to store cache (e.g., .far files) | ||
| lang: null # Language code for ASR model | ||
| return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer |
80 changes: 80 additions & 0 deletions
80
examples/asr/conf/asr_streaming_inference/cache_aware_ctc.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # ================================ | ||
| # ASR Configuration | ||
| # ================================ | ||
| asr: | ||
| model_name: stt_en_fastconformer_hybrid_large_streaming_multi # Pre-trained CTC/hybrid model from NGC/HuggingFace or local .nemo file path | ||
| device: cuda # Device for inference: 'cuda' or 'cpu' | ||
| device_id: 0 # GPU device ID | ||
| compute_dtype: bfloat16 # Compute precision: 'bfloat16' for Ampere+, 'float16' for older GPUs, or 'float32' | ||
| use_amp: true # Enable Automatic Mixed Precision | ||
|
|
||
|
|
||
| # ========================================== | ||
| # Inverse Text Normalization Configuration | ||
| # ========================================== | ||
| itn: | ||
| input_case: lower_cased # Input text case handling: 'lower_cased', 'cased' | ||
| whitelist: null # Custom whitelist for ITN processing | ||
| overwrite_cache: false # Whether to overwrite existing cache files | ||
| max_number_of_permutations_per_split: 729 # Maximum permutations allowed per text split during ITN processing | ||
| left_padding_size: 4 # Padding size (#spans) for ITN context | ||
| batch_size: 32 # Batch size for ITN inference | ||
| n_jobs: 16 # Number of parallel jobs for ITN processing | ||
|
|
||
|
|
||
| # ======================== | ||
| # Confidence estimation | ||
| # ======================== | ||
| confidence: | ||
| exclude_blank: true # Exclude blank tokens when calculating confidence | ||
| aggregation: mean # Aggregation method for confidence across time steps | ||
| method_cfg: | ||
| name: entropy # Confidence estimation method: 'max_prob' or 'entropy' | ||
| entropy_type: tsallis | ||
| alpha: 0.5 | ||
| entropy_norm: exp | ||
|
|
||
|
|
||
| # ======================== | ||
| # Endpointing settings | ||
| # ======================== | ||
| endpointing: | ||
| stop_history_eou: 800 # Time window (ms) for evaluating EoU | ||
| residue_tokens_at_end: 2 # Number of residual tokens used for EoU | ||
|
|
||
|
|
||
| # ======================== | ||
| # Streaming configuration | ||
| # ======================== | ||
| streaming: | ||
| sample_rate: 16000 # Audio sample rate in Hz | ||
| batch_size: 256 # Number of audio frames per batch | ||
| word_boundary_tolerance: 4 # Tolerance for word boundaries | ||
| att_context_size: [70,13] # Attention context size: [70,13],[70,6],[70,1],[70,0] | ||
| use_cache: true # Whether to use cache for streaming | ||
| use_feat_cache: true # Whether to cache mel-spec features, set false to re-calculate all mel-spec features in audio buffer | ||
| chunk_size_in_secs: null # Amount of audio to load for each streaming step, e.g., 0.08s for FastConformer. Set to `null` for using default size equal to 1+lookahead frames. | ||
| request_type: frame # Type of request: frame, only frame is supported for cache-aware streaming | ||
| num_slots: 1024 # Number of slots in the context manager: must be >= batch_size | ||
|
|
||
|
|
||
| # ======================== | ||
| # Pipeline settings | ||
| # ======================== | ||
| matmul_precision: high # Matrix multiplication precision: highest, high, medium | ||
| log_level: 20 # Logging level: 0 (NOTSET), 10 (DEBUG), 20 (INFO), 30 (WARNING), 40 (ERROR), 50 (CRITICAL) | ||
| pipeline_type: cache_aware # Pipeline type: buffered, cache_aware | ||
| asr_decoding_type: ctc # Decoding method: ctc or rnnt | ||
|
|
||
| # ======================== | ||
| # Runtime arguments defined at runtime via command line | ||
| # ======================== | ||
| audio_file: null # Path to audio file, directory, or manifest JSON | ||
| output_filename: null # Path to output transcription JSON file | ||
| output_dir: null # Directory to save time-aligned output | ||
| enable_pnc: false # Whether to apply punctuation & capitalization | ||
| enable_itn: false # Whether to apply inverse text normalization | ||
| asr_output_granularity: segment # Output granularity: word or segment | ||
| cache_dir: null # Directory to store cache (e.g., .far files) | ||
| lang: null # Language code for ASR model | ||
| return_tail_result: false # Whether to return the tail labels left in the right padded side of the buffer |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.