NVIDIA-NeMo · mohammadaaftabv · Jun 28, 2026 · Jun 29, 2026
@@ -0,0 +1,31 @@
+# Qwen-Omni In-Process ASR Assets
+
+This folder contains prompt templates used by the Qwen-Omni in-process ASR
+adapter.
+
+Install the runtime with `uv sync --extra audio_qwen`. The dedicated extra
+keeps Qwen/vLLM dependencies out of existing `audio_cuda12` installations.
+
+The executable code path is:
+
+```text
+Pipeline
+  -> ManifestReader
+  -> AudioPayloadMaterializeStage
+  -> ASRStage(adapter_target=QwenOmniASRAdapter)
+  -> PayloadReleaseStage
+  -> ManifestWriterStage
+```
+
+The adapter reads prompt text through `prompt_file`, `en_prompt_file`,
+`followup_prompt_file`, or `system_prompt_file`. Curator stage behavior remains
+outside the prompt files:
+
+- graph expansion lives in `nemo_curator/pipeline/payload_lifecycle.py`;
+- audio decode and payload refs live in `nemo_curator/stages/payload_lifecycle.py`;
+- local/windowed ASR model-input segmentation and batching live in
+  `nemo_curator/stages/audio/inference/asr/stage.py`;
+- Qwen/vLLM request construction lives in `nemo_curator/models/asr/qwen_omni.py`.
+
+Prompt files may use `{language}` and `{transcript}` placeholders when the
+stage supplies language or reference text columns.
@@ -0,0 +1,30 @@
+You receive audio in English.
+
+MAIN GOAL: faithfully transcribe audio as is spoken in the audio with all disfluencies present in the audio.
+- Do NOT remove, correct, or "clean up" any speech artifacts.
+- Do NOT paraphrase, edit grammar, or make the speech more polished.
+
+FILLER WORDS:
+- Include hesitation markers like "um", "uh", "hm", "ah" etc as is spoken in the audio.
+
+REPETITIONS:
+- Consecutive instances of the same word or short phrase spoken unintentionally — keep all repetitions as-is.
+  - Example: "I I think", "the the problem"
+
+FALSE STARTS:
+- Incomplete words or phrases the speaker abandons, mark with a hyphen — keep them as-is.
+  - Example: "I was go going to the store." → "I was go- going to the store."
+
+COLLOQUIAL REDUCTIONS:
+- Preserve forms such as "wanna", "gonna", "kinda", "lemme", "lotta", "outta", "Imma", "sorta", "ya", "m'kay", "finna", "tryna", etc exactly as spoken. Do NOT expand them into standard forms.
+
+WRONG GRAMMAR:
+- Grammatical errors should be faithfully captured in the transcript — do NOT correct them.
+- You MUST NOT fix subject-verb agreement, tense errors, or any other grammatical issues.
+
+NUMERICALS:
+- Keep numbers as is spoken in words. Do NOT convert them to numbers. like "oh eleven" should be "oh eleven" as spoken in the audio not "zero eleven" etc
+
+Output format:
+- Return ONLY the transcription text.
+- No explanations, no JSON, no lists.
@@ -0,0 +1,51 @@
+You receive English audio and a reference transcript. The reference may be cleaned, partially wrong, or missing speech artifacts. The audio is the ground truth.
+
+REFERENCE TRANSCRIPT:
+"{transcript}"
+
+MAIN GOAL: Listen carefully to the audio and revise the reference so it faithfully reflects exactly what is spoken, including all disfluencies present in the audio.
+- Use the reference as a starting point; do not ignore it.
+- When the reference matches the audio, keep it unchanged.
+- When the reference conflicts with the audio, follow the audio.
+- Do NOT invent words or content not spoken in the audio.
+- Do NOT remove substantive content that is spoken in the audio (remove reference words only if they are not spoken).
+- Do NOT paraphrase, polish grammar or rewrite sentences that already match the audio.
+- Prefer minimal edits: fix mismatches and insert missing speech artifacts.
+- Preserve named entities from the reference in their exact written form.
+- Normalize numbers to their written form.
+
+ENTITIES (names, places, brands, titles, etc.):
+- Keep every named entity from the reference in its exact written form: spelling, casing, script, and punctuation. This includes names, places, brands, titles, acronyms, and other proper nouns.
+- Do not ever transliterate, translate, re-spell, normalize, or "correct" an entity into another script.
+- If enetities are part code switched data it should stay the same.
+
+KEEP REFERENCE DISFLUENCIES:
+- If the reference already has fillers, repetitions, false starts, colloquial reductions, or grammatical errors, keep them.
+- Add hesitation markers and fillers natural to English wherever they are spoken in the audio but missing from the reference.
+- Do NOT clean up, normalize, or remove disfluencies that are already in the reference and are spoken in the audio.
+- Add consecutive instances of the same word or short phrase when spoken unintentionally.
+  - Example: reference "I think" → "I I think" if that is what is spoken.
+
+
+BACKGROUND / QUIET / OVERLAPPING SPEECH:
+- Keep all audible speech in the reference, including quieter, distant, or overlapping voices — not just the loudest speaker.
+- Add background or secondary speech that is audible but missing; do not drop words because they sound like background.
+
+FALSE STARTS:
+- Add incomplete words or phrases the speaker abandons, marked with a hyphen.
+  - Example: "I was go- going to the store."
+- Do NOT remove false starts already in the reference if they are spoken in the audio.
+
+COLLOQUIAL REDUCTIONS:
+- If the reference uses standard forms but the speaker used reductions, use the spoken form: "want to" → "wanna", "going to" → "gonna", etc.
+- Preserve forms such as "wanna", "gonna", "kinda", "lemme", "lotta", "outta", "Imma", "sorta", "ya", "m'kay", "finna", "tryna", etc. Do NOT expand them.
+
+WRONG GRAMMAR:
+- Keep grammatical errors as spoken. Do NOT correct subject-verb agreement, tense errors, or other grammar issues.
+
+NUMERICALS:
+- Keep numbers as spoken in words. Do NOT convert them to digits.
+  - Example: keep "oh eleven" or "zero eleven".
+
+Output format:
+- Return ONLY the revised transcription text.
@@ -0,0 +1 @@
+Transcribe the {language} audio into text exactly as the speaker says it. Write numbers as spoken words.
@@ -0,0 +1,23 @@
+You receive:
+1) An audio file,
+2) A Ground Truth Transcription of the audio {transcript}.
+
+Goal: To normalize numbers from the text and add any disfluencies that are present in the audio.
+
+ALLOWED ONLY:
+1) Normalize numeric expressions into words exactly as they are SPOKEN in the audio.
+- Mixed format is forbidden:
+    Bad: "5 percent", "2 zeros"
+    Good: "five percent", "two zeros"
+- Normalize: percentages, currencies, units, ranges, decimals, dates/years — ONLY if they are spoken.
+- If a unit (for example “percent”) is NOT spoken, do not add it.
+2) Add any disfluencies present in the audio.
+- Disfluencies as "um", "uh" that are present in the audio should be added to the text.
+- If word is repeated in the audio but missing from ground truth add it to the text.
+
+ENTITIES (names, places, brands, titles, etc.) should be the same as inGround Truth Transcription:
+- Keep every named entity from the reference in its exact written form: spelling, casing, script, and punctuation. This includes names, places, brands, titles, acronyms, and other proper nouns.
+
+OUTPUT FORMAT:
+- Return only the final text.
+- No explanations, no JSON, no lists.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Transcribe the {language} audio into text exactly as the speaker says it. Write numbers as spoken words.